What is the difference between RTO and RPO in cloud architecture?

RTO is Recovery Time Objective, measuring how long a system can be offline before causing unacceptable business impact. RPO is Recovery Point Objective, measuring how much data loss is acceptable in minutes since the last successful backup or sync. An RTO of one hour means 60 minutes of downtime is tolerable. An RPO of five minutes means no more than five minutes of transactions can be lost. Both numbers should be defined before an outage, not during one.

What is the difference between multi-AZ and multi-region on AWS?

Multi-AZ deploys infrastructure across multiple availability zones within a single AWS region, protecting against a single data center failure but not a full regional outage. When an entire region goes offline, all availability zones in that region are affected simultaneously. Multi-region deploys infrastructure in two completely separate geographic regions, which is the only configuration that protects against events like the March 2026 AWS UAE outage.

How do I set up automatic failover on AWS?

Configure Route 53 health checks on your primary region endpoints and create failover routing records pointing to a secondary region. Set health check interval to 10 seconds and failure threshold to 3 consecutive failures. Lower the DNS TTL on your records to 60 seconds or less so DNS changes propagate quickly. Also create a cross-region RDS read replica in the secondary region and document the promotion steps in a runbook.

Should most startups use active-active or active-passive architecture?

Most startups should use active-passive. Active-active costs roughly 2x the infrastructure because both regions run full production capacity simultaneously. Active-passive keeps a smaller standby region and fails over in 2 to 10 minutes with proper Route 53 health checks configured. For most web applications and APIs, 2 to 10 minutes of downtime during a regional outage is acceptable and significantly cheaper than running active-active at all times.

AWS Cloud DevOps Infrastructure

Your Cloud Region Just Went Offline: What Happens Next

Abhishek Gautam·March 8, 2026·8 min read

Quick summary

AWS UAE went offline in March 2026. Most developers had no tested failover. Here is the complete guide: RTO, RPO, active-passive vs active-active.

On March 1, 2026, AWS UAE went partially offline after Iranian drone strikes hit the data center. Most developers running workloads in that region had no tested failover plan. Here is what should have been in place — and what to build before the next regional outage.

What Actually Happens When a Cloud Region Goes Down

A cloud region outage follows a predictable sequence that most developers do not think through until they are inside it:

Power or connectivity to specific availability zones fails
AWS status dashboard shows degraded status for the region
DNS resolvers still return the same IP addresses for your services
Your application starts failing because those IPs are unreachable
If no automation exists, you are now making decisions under pressure with incomplete information

Without prior planning, teams spend the first 30 minutes discovering the scope and the next hour deciding what to do. With automation in place, the entire response completes in under 5 minutes.

Active-Active vs Active-Passive

Active-active means your application runs simultaneously in two or more regions at all times. Traffic is distributed between them. When a region goes offline, its traffic shifts to the other regions immediately with no manual step.

Active-passive means your primary region handles all traffic and a secondary region sits on warm standby. When the primary fails, DNS or load balancer settings update to direct traffic to the secondary. There is a gap between the failure and the flip.

	Active-Active	Active-Passive
Cost	~2x infrastructure	~1.2x (standby is smaller)
Failover time	Near-zero (automatic)	2-10 minutes
Complexity	High	Medium
RTO achievable	Seconds	Minutes
Best for	Payments, real-time apps	Most APIs and web apps

For most web applications, active-passive is the right starting point. It is significantly cheaper and simpler. Active-active is necessary when your business cannot tolerate even 5-10 minutes of downtime during a regional event.

The Three Things That Must Be Ready Before Failover

1. Route 53 health checks configured

Route 53 health checks monitor your primary region endpoints. When consecutive checks fail within the threshold you set, Route 53 automatically serves the secondary region DNS record instead. Two settings matter most: failure threshold (how many consecutive failures before Route 53 marks the endpoint unhealthy) and health check interval (10 or 30 seconds). The default failure threshold is 3 checks at 30-second intervals — that is 90 seconds before failover DNS starts propagating. For payment flows, tighten this to 3 checks at 10-second intervals.

2. Cross-region database replica

Most failover plans that fail in practice fail here. The application failover works, but the database is only in the primary region. For AWS RDS, create a cross-region read replica in your secondary region. When the primary region goes down, promote the read replica to become the new primary. This takes 5-10 minutes manually or can be automated with Lambda.

Multi-AZ is not the same as multi-region. Multi-AZ protects against a single availability zone failing within one region. A regional outage takes down all availability zones in that region simultaneously.

3. A failover drill on the calendar

A failover plan that has never been tested is not a failover plan. Run a quarterly drill where you deliberately shift traffic to your secondary region and verify that application functionality works, latency is acceptable from the secondary, database writes are handled correctly, and monitoring alerts fire as expected from the new region. The AWS UAE outage revealed that many teams had theoretical failover plans that broke on first contact with reality.

RTO and RPO: Define These Before an Outage

RTO (Recovery Time Objective) is how long your system can be offline before it causes serious business impact. RPO (Recovery Point Objective) is how much data loss you can tolerate, measured as time since the last successful sync or backup.

If your RTO is 1 hour, active-passive with manual database failover is fine. If your RTO is 5 minutes, you need automated DNS failover with a pre-promoted replica. If your RTO is under 1 minute, you need active-active architecture.

Most startups have never defined these numbers. The March 2026 outage is a good forcing function.

A Checklist for AWS Developers

Route 53 health check on all production endpoints
Health check interval set to 10 seconds, failure threshold at 3
Cross-region RDS read replica in your secondary region
Runbook documented for promoting the read replica to primary
Static assets served from CloudFront, not region-specific storage
Lambda or step function automated failover tested in staging
Failover drill scheduled within next 90 days

For background on the March 2026 outage, see AWS UAE Data Centre Hit in March 2026. For how the underlying cable infrastructure affects regional connectivity, see What Happens When an Undersea Cable Is Cut.

Key Takeaways

90 seconds — default Route 53 time to begin DNS failover if health check settings are left at defaults
5-10 minutes — time to promote an RDS cross-region read replica to primary
Multi-AZ is not multi-region — the most common misconception in cloud disaster recovery planning
RTO and RPO — define these numbers before an outage, not during one; they determine which architecture you actually need
For developers: run a failover drill this quarter. If your secondary region has never served real traffic, you do not know whether your failover actually works
What to watch: AWS Global Resilience (announced late 2025) automates multi-region failover without manual DNS changes — watch for general availability and evaluate it against Route 53 health check-based approaches