Your Cloud Region Just Went Offline: What Happens Next

Abhishek Gautam··8 min read

Quick summary

AWS UAE went offline in March 2026. Most developers had no tested failover. Here is the complete guide: RTO, RPO, active-passive vs active-active.

On March 1, 2026, AWS UAE went partially offline after Iranian drone strikes hit the data center. Most developers running workloads in that region had no tested failover plan. Here is what should have been in place — and what to build before the next regional outage.

What Actually Happens When a Cloud Region Goes Down

A cloud region outage follows a predictable sequence that most developers do not think through until they are inside it:

  • Power or connectivity to specific availability zones fails
  • AWS status dashboard shows degraded status for the region
  • DNS resolvers still return the same IP addresses for your services
  • Your application starts failing because those IPs are unreachable
  • If no automation exists, you are now making decisions under pressure with incomplete information

Without prior planning, teams spend the first 30 minutes discovering the scope and the next hour deciding what to do. With automation in place, the entire response completes in under 5 minutes.

Active-Active vs Active-Passive

Active-active means your application runs simultaneously in two or more regions at all times. Traffic is distributed between them. When a region goes offline, its traffic shifts to the other regions immediately with no manual step.

Active-passive means your primary region handles all traffic and a secondary region sits on warm standby. When the primary fails, DNS or load balancer settings update to direct traffic to the secondary. There is a gap between the failure and the flip.

Active-ActiveActive-Passive
Cost~2x infrastructure~1.2x (standby is smaller)
Failover timeNear-zero (automatic)2-10 minutes
ComplexityHighMedium
RTO achievableSecondsMinutes
Best forPayments, real-time appsMost APIs and web apps

For most web applications, active-passive is the right starting point. It is significantly cheaper and simpler. Active-active is necessary when your business cannot tolerate even 5-10 minutes of downtime during a regional event.

The Three Things That Must Be Ready Before Failover

1. Route 53 health checks configured

Route 53 health checks monitor your primary region endpoints. When consecutive checks fail within the threshold you set, Route 53 automatically serves the secondary region DNS record instead. Two settings matter most: failure threshold (how many consecutive failures before Route 53 marks the endpoint unhealthy) and health check interval (10 or 30 seconds). The default failure threshold is 3 checks at 30-second intervals — that is 90 seconds before failover DNS starts propagating. For payment flows, tighten this to 3 checks at 10-second intervals.

2. Cross-region database replica

Most failover plans that fail in practice fail here. The application failover works, but the database is only in the primary region. For AWS RDS, create a cross-region read replica in your secondary region. When the primary region goes down, promote the read replica to become the new primary. This takes 5-10 minutes manually or can be automated with Lambda.

Multi-AZ is not the same as multi-region. Multi-AZ protects against a single availability zone failing within one region. A regional outage takes down all availability zones in that region simultaneously.

3. A failover drill on the calendar

A failover plan that has never been tested is not a failover plan. Run a quarterly drill where you deliberately shift traffic to your secondary region and verify that application functionality works, latency is acceptable from the secondary, database writes are handled correctly, and monitoring alerts fire as expected from the new region. The AWS UAE outage revealed that many teams had theoretical failover plans that broke on first contact with reality.

RTO and RPO: Define These Before an Outage

RTO (Recovery Time Objective) is how long your system can be offline before it causes serious business impact. RPO (Recovery Point Objective) is how much data loss you can tolerate, measured as time since the last successful sync or backup.

If your RTO is 1 hour, active-passive with manual database failover is fine. If your RTO is 5 minutes, you need automated DNS failover with a pre-promoted replica. If your RTO is under 1 minute, you need active-active architecture.

Most startups have never defined these numbers. The March 2026 outage is a good forcing function.

A Checklist for AWS Developers

  • Route 53 health check on all production endpoints
  • Health check interval set to 10 seconds, failure threshold at 3
  • Cross-region RDS read replica in your secondary region
  • Runbook documented for promoting the read replica to primary
  • Static assets served from CloudFront, not region-specific storage
  • Lambda or step function automated failover tested in staging
  • Failover drill scheduled within next 90 days

For background on the March 2026 outage, see AWS UAE Data Centre Hit in March 2026. For how the underlying cable infrastructure affects regional connectivity, see What Happens When an Undersea Cable Is Cut.

Key Takeaways

  • 90 seconds — default Route 53 time to begin DNS failover if health check settings are left at defaults
  • 5-10 minutes — time to promote an RDS cross-region read replica to primary
  • Multi-AZ is not multi-region — the most common misconception in cloud disaster recovery planning
  • RTO and RPO — define these numbers before an outage, not during one; they determine which architecture you actually need
  • For developers: run a failover drill this quarter. If your secondary region has never served real traffic, you do not know whether your failover actually works
  • What to watch: AWS Global Resilience (announced late 2025) automates multi-region failover without manual DNS changes — watch for general availability and evaluate it against Route 53 health check-based approaches
ShareX / TwitterLinkedIn

Written by

Abhishek Gautam

Full Stack Developer & Software Engineer based in Delhi, India. Building web applications and SaaS products with React, Next.js, Node.js, and TypeScript. 8+ projects deployed across 7+ countries.

Free Weekly Briefing

The AI & Dev Briefing

One honest email a week — what actually matters in AI and software engineering. No noise, no sponsored content. Read by developers across 30+ countries.

No spam. Unsubscribe anytime.