Your Cloud Region Just Went Offline: What Happens Next
Quick summary
AWS UAE went offline in March 2026. Most developers had no tested failover. Here is the complete guide: RTO, RPO, active-passive vs active-active.
On March 1, 2026, AWS UAE went partially offline after Iranian drone strikes hit the data center. Most developers running workloads in that region had no tested failover plan. Here is what should have been in place — and what to build before the next regional outage.
What Actually Happens When a Cloud Region Goes Down
A cloud region outage follows a predictable sequence that most developers do not think through until they are inside it:
- Power or connectivity to specific availability zones fails
- AWS status dashboard shows degraded status for the region
- DNS resolvers still return the same IP addresses for your services
- Your application starts failing because those IPs are unreachable
- If no automation exists, you are now making decisions under pressure with incomplete information
Without prior planning, teams spend the first 30 minutes discovering the scope and the next hour deciding what to do. With automation in place, the entire response completes in under 5 minutes.
Active-Active vs Active-Passive
Active-active means your application runs simultaneously in two or more regions at all times. Traffic is distributed between them. When a region goes offline, its traffic shifts to the other regions immediately with no manual step.
Active-passive means your primary region handles all traffic and a secondary region sits on warm standby. When the primary fails, DNS or load balancer settings update to direct traffic to the secondary. There is a gap between the failure and the flip.
| Active-Active | Active-Passive | |
|---|---|---|
| Cost | ~2x infrastructure | ~1.2x (standby is smaller) |
| Failover time | Near-zero (automatic) | 2-10 minutes |
| Complexity | High | Medium |
| RTO achievable | Seconds | Minutes |
| Best for | Payments, real-time apps | Most APIs and web apps |
For most web applications, active-passive is the right starting point. It is significantly cheaper and simpler. Active-active is necessary when your business cannot tolerate even 5-10 minutes of downtime during a regional event.
The Three Things That Must Be Ready Before Failover
1. Route 53 health checks configured
Route 53 health checks monitor your primary region endpoints. When consecutive checks fail within the threshold you set, Route 53 automatically serves the secondary region DNS record instead. Two settings matter most: failure threshold (how many consecutive failures before Route 53 marks the endpoint unhealthy) and health check interval (10 or 30 seconds). The default failure threshold is 3 checks at 30-second intervals — that is 90 seconds before failover DNS starts propagating. For payment flows, tighten this to 3 checks at 10-second intervals.
2. Cross-region database replica
Most failover plans that fail in practice fail here. The application failover works, but the database is only in the primary region. For AWS RDS, create a cross-region read replica in your secondary region. When the primary region goes down, promote the read replica to become the new primary. This takes 5-10 minutes manually or can be automated with Lambda.
Multi-AZ is not the same as multi-region. Multi-AZ protects against a single availability zone failing within one region. A regional outage takes down all availability zones in that region simultaneously.
3. A failover drill on the calendar
A failover plan that has never been tested is not a failover plan. Run a quarterly drill where you deliberately shift traffic to your secondary region and verify that application functionality works, latency is acceptable from the secondary, database writes are handled correctly, and monitoring alerts fire as expected from the new region. The AWS UAE outage revealed that many teams had theoretical failover plans that broke on first contact with reality.
RTO and RPO: Define These Before an Outage
RTO (Recovery Time Objective) is how long your system can be offline before it causes serious business impact. RPO (Recovery Point Objective) is how much data loss you can tolerate, measured as time since the last successful sync or backup.
If your RTO is 1 hour, active-passive with manual database failover is fine. If your RTO is 5 minutes, you need automated DNS failover with a pre-promoted replica. If your RTO is under 1 minute, you need active-active architecture.
Most startups have never defined these numbers. The March 2026 outage is a good forcing function.
A Checklist for AWS Developers
- Route 53 health check on all production endpoints
- Health check interval set to 10 seconds, failure threshold at 3
- Cross-region RDS read replica in your secondary region
- Runbook documented for promoting the read replica to primary
- Static assets served from CloudFront, not region-specific storage
- Lambda or step function automated failover tested in staging
- Failover drill scheduled within next 90 days
For background on the March 2026 outage, see AWS UAE Data Centre Hit in March 2026. For how the underlying cable infrastructure affects regional connectivity, see What Happens When an Undersea Cable Is Cut.
Key Takeaways
- 90 seconds — default Route 53 time to begin DNS failover if health check settings are left at defaults
- 5-10 minutes — time to promote an RDS cross-region read replica to primary
- Multi-AZ is not multi-region — the most common misconception in cloud disaster recovery planning
- RTO and RPO — define these numbers before an outage, not during one; they determine which architecture you actually need
- For developers: run a failover drill this quarter. If your secondary region has never served real traffic, you do not know whether your failover actually works
- What to watch: AWS Global Resilience (announced late 2025) automates multi-region failover without manual DNS changes — watch for general availability and evaluate it against Route 53 health check-based approaches
More on AWS
All posts →Vercel vs Docker vs AWS: Which Should You Actually Deploy On in 2026?
Not a generic comparison — a decision framework. When Vercel's zero-config is worth it, when Docker buys you control you actually need, and when AWS stops being overkill. With real cost breakdowns.
Gulf Submarine Cables and AWS Middle East Are Under Threat. Here's How to Harden Your Region and Failover.
U.S.–Iran tensions closed Red Sea and Strait of Hormuz traffic in 2026. Seventeen submarine cables run through the Gulf; AWS told customers to migrate workloads out. What developers and architects need to do about region choice and failover now.
Big Tech AI Energy Pledge 2026: What Amazon, Google, and OpenAI Signed
Amazon, Google, Microsoft, and OpenAI signed the White House AI energy pledge on March 4. What it commits to, what it skips, and the cloud cost impact for developers.
SSL Certificate Validity Drops to 47 Days by 2029: Automate Now
SSL validity drops to 200 days (2026), 100 days (2027), 47 days (2029). Apple, Google, Mozilla voted yes. Automate with ACME and cert-manager now.
Written by
Abhishek Gautam
Full Stack Developer & Software Engineer based in Delhi, India. Building web applications and SaaS products with React, Next.js, Node.js, and TypeScript. 8+ projects deployed across 7+ countries.
Free Weekly Briefing
The AI & Dev Briefing
One honest email a week — what actually matters in AI and software engineering. No noise, no sponsored content. Read by developers across 30+ countries.
No spam. Unsubscribe anytime.