Reliability And Resilience · Chapter 49 of 51

Multi Region Active Active

Akhil Sharma

 20 min 

← → to navigate

Multi-Region Active-Active Architecture (When One Region Isn't Enough)

Audience: platform architects and infrastructure engineers designing globally distributed systems.

This article assumes:

Your users are geographically distributed across continents.
Regional failures happen: AWS us-east-1 goes down, Azure has outages, natural disasters strike data centers.
Network latency across continents is physics-limited (speed of light).
Data consistency across regions is hard - CAP theorem is real.

Challenge: Your "multi-region" architecture fails the regional outage test

Scenario

You proudly announce: "We're multi-region! We have servers in US and EU."

Then AWS us-east-1 has a major outage.

Reality check:

Your US database was the "primary"
EU database was a "read replica"
EU can serve reads but not writes
All US customers can't create orders, update profiles, or process payments
Failover requires 45 minutes of manual work

This is multi-region, but it's not active-active.

Interactive question (pause and think)

What's the difference between:

Multi-region deployment (servers in multiple regions)
Multi-region active-passive (failover capability)
Multi-region active-active (simultaneous operation)

Take 10 seconds.

Progressive reveal (question -> think -> answer)

Answer: They represent increasing levels of availability and complexity.

Multi-region deployment: Geographic distribution, but no failover plan
Active-passive: One region serves traffic, other is hot standby
Active-active: All regions serve traffic simultaneously, survive regional failures transparently

Real-world analogy (fire stations)

Imagine a city with two fire stations:

Active-passive: Fire Station A handles all calls. Fire Station B is staffed but idle, only activates if Station A burns down.

Active-active: Both stations handle calls simultaneously. If Station A is destroyed, Station B already handles 50% of the city. The impact is degradation, not catastrophe.

Key insight box

Active-active means every region can independently serve reads AND writes, with data eventually synchronized across regions.

Challenge question

If active-active is so much better, why doesn't everyone do it? What's the hidden cost?

Mental model - Active-active is a distributed systems problem, not a deployment problem

Scenario

Your VP says: "Just deploy to multiple regions. That's active-active."

You know it's not that simple.

Interactive question (pause and think)

You deploy identical application servers to US and EU regions. Which statement is true?

A. "We're now active-active." B. "We need to solve data synchronization first." C. "We need to solve routing first." D. "B and C, plus conflict resolution, monitoring, testing..."

Progressive reveal

Answer: D.

Active-active isn't about deployment - it's about solving distributed data problems.

Mental model

Think of active-active as:

Distributed consensus problem: How do regions agree on state?
CAP theorem trade-off: Pick consistency, availability, partition tolerance (choose 2)
Eventual consistency game: Accept that regions will temporarily disagree

The challenges:

Data layer: How to keep databases synchronized
Routing layer: How to direct users to the right region
Conflict resolution: What happens when both regions modify the same data
Failure detection: How to know when a region is down
Operational complexity: How to deploy, test, and debug across regions

Real-world parallel (bank branches)

Two bank branches in different cities (US and EU):

The problem: Customer withdraws $100 in US branch, simultaneously tries to withdraw $100 in EU branch. Account has $100 balance.

Bad solution: Both branches allow withdrawal (now -$100, data inconsistency).

Active-passive solution: Only US branch can approve withdrawals (EU can't help if US is down).

Active-active solution: Both can approve, but they synchronize and detect conflicts. Maybe allow temporary overdraft, reconcile later.

Key insight box

Active-active is 20% deployment and 80% distributed data correctness. If you only solve deployment, you'll have data corruption at scale.

Challenge question

Can you have active-active for reads but active-passive for writes? Would that be useful?

Understanding the CAP theorem trade-offs in multi-region

Scenario

You're designing active-active architecture. Your database vendor says "Choose: strong consistency or high availability."

You want both. You can't have both (during network partitions).

CAP theorem reminder

C (Consistency): Every read sees the most recent write A (Availability): Every request gets a response (success or failure) P (Partition tolerance): System works despite network failures

The theorem: During a network partition (P), you must choose between C and A.

Visual: CAP trade-offs in multi-region

text

Multi-region consistency patterns

1. Strong consistency (CP)

yaml

2. Eventual consistency (AP)

yaml

3. Read-your-writes consistency

yaml

Interactive question

You're building a global e-commerce platform. Which consistency model for each feature?

A. Product catalog: Strong / Eventual / Read-your-writes B. Inventory count: Strong / Eventual / Read-your-writes C. Shopping cart: Strong / Eventual / Read-your-writes D. Payment processing: Strong / Eventual / Read-your-writes

Progressive reveal

Answer:

Product catalog: Eventual (stale product descriptions are tolerable)
Inventory count: Strong (can't oversell)
Shopping cart: Read-your-writes (user's own cart must be consistent)
Payment processing: Strong (no duplicate charges)

Key insight box

There's no "one size fits all" consistency model. Active-active requires choosing the right model per feature based on business requirements.

Challenge question

If you choose eventual consistency, how do you handle conflicts when both regions modify the same record simultaneously?

Core components of active-active architecture

Scenario

You're building active-active. What are the actual building blocks?

Architecture layers

text

Data tier patterns (the heart of active-active)

Pattern 1: Multi-master database replication

yaml

Pattern 2: Distributed databases (Cassandra, DynamoDB)

yaml

Pattern 3: Sharding by geography

yaml

Application-level conflict resolution

Key insight box

The data tier is where active-active lives or dies. Application tier is easy (stateless). Data tier requires careful consistency modeling.

Challenge question

How do you test active-active data synchronization? Simulating network partitions between regions is hard.

Routing users to regions - DNS, Anycast, or Application?

Scenario

You have servers in US and EU. User in New York makes a request. Where should it go?

Think about it

What's the goal?

Lowest latency (nearest region)
Load balancing (evenly distribute)
Failover (route away from unhealthy region)

Interactive question (pause and think)

Which routing approach gives you all three?

A. DNS-based (Route53 with latency-based routing) B. Anycast (BGP announces same IP from multiple regions) C. Application-level (smart client chooses region) D. None - trade-offs required

Progressive reveal

Answer: D - each approach has trade-offs.

Routing approaches comparison

DNS-based routing

yaml

Anycast routing

yaml

Application-level routing

yaml

Hybrid approach (common in practice)

yaml

Production insight: How Netflix routes globally

Netflix uses:

DNS returns multiple region IPs
Client device tests each region (background)
Client chooses best region based on throughput test
Caches choice for session duration
Automatically retries other regions on failure

Key insight box

No single routing approach is perfect. Most production systems use DNS for coarse routing + application logic for fine-grained decisions.

Challenge question

User is in New York but their data is sharded to EU region (data residency). Should routing send them to US (low latency) or EU (where their data lives)?

Handling regional failures - detection and recovery

Scenario

Your EU region goes down. How quickly can you detect it and route traffic away?

Failure detection challenges

Challenge 1: What does "down" mean?

yaml

Challenge 2: Distributed failure detection

text

Failover strategies

Strategy 1: DNS failover (slow)

yaml

Strategy 2: Anycast failover (fast)

yaml

Strategy 3: Application-level retry (fastest)

yaml

Failover decision matrix

Avoiding failover oscillation (flapping)

yaml

Key insight box

Fast failure detection is worthless if you don't have fast recovery. Focus on application-level retries for the best user experience.

Challenge question

During failover, what happens to in-flight requests? Do they fail, get retried, or complete successfully?

The hidden costs of active-active

Scenario

Your CFO asks: "Why did our cloud bill double when we went active-active?"

You're running twice the infrastructure, but is that the only cost?

Cost dimensions

1. Infrastructure costs (obvious)

yaml

2. Data transfer costs (hidden)

yaml

3. Operational costs (often underestimated)

yaml

4. Performance costs (sometimes)

yaml

Cost-benefit analysis framework

yaml

When NOT to use active-active

yaml

Key insight box

Active-active is expensive (money, complexity, operational overhead). Only use it when downtime costs exceed those expenses.

Challenge question

Your active-active setup costs $300K/year extra. Your biggest outage last year cost $50K. Should you downgrade to active-passive?

Final synthesis - Design your active-active architecture

Synthesis challenge

You're the architect for a global SaaS collaboration platform (think: Slack, Microsoft Teams).

Requirements:

10 million users across US (40%), EU (35%), Asia (25%)
Real-time messaging (< 100ms latency critical)
File storage (documents, images)
User presence (online/offline status)
Search (messages, files)
Compliance: EU data must stay in EU (GDPR)

Constraints:

Current: Single US region, experiencing EU latency complaints
Budget: Can double infrastructure costs
Team: 15 engineers, limited distributed systems expertise

Your tasks (pause and think)

Choose regions and topology (how many regions? which cloud zones?)
Define data residency strategy (what data goes where?)
Choose consistency model per feature (strong vs eventual)
Design routing strategy (how do users reach their region?)
Plan failover strategy (what happens when a region fails?)
Estimate costs and justify to CFO

Write down your architecture.

Progressive reveal (one possible solution)

1. Regions and topology:

yaml

2. Data residency strategy:

yaml

3. Consistency model per feature:

yaml

4. Routing strategy:

yaml

5. Failover strategy:

yaml

6. Cost estimate:

yaml

Key insight box

Active-active is a business decision, not just a technical one. The architecture must be justified by revenue, compliance, or risk mitigation.

Final challenge question

Your active-active architecture is live. Both US and EU regions write to the same workspace simultaneously (conflict). How do you resolve: last-write-wins, manual merge, or reject one write?

Appendix: Quick checklist (printable)

Active-active design checklist:

Choose regions based on user distribution (not just "US and EU")
Define data residency requirements (GDPR, data sovereignty)
Pick consistency model per feature (strong vs eventual)
Design conflict resolution strategy (LWW, CRDTs, manual)
Plan routing strategy (DNS, anycast, application-level)
Define failover thresholds and procedures
Estimate costs (infrastructure, data transfer, operational)

Data tier checklist:

Choose database technology (multi-master, distributed, sharded)
Configure replication (async vs sync)
Set replication lag SLOs (how stale is acceptable?)
Implement conflict detection and resolution
Plan schema migration across regions (coordinated rollout)
Test failure scenarios (partition, region down, lag spike)

Operational checklist:

Deploy monitoring per region (plus global aggregation)
Set up cross-region tracing (distributed traces)
Create runbooks for common failures
Practice failover (monthly Game Days)
Monitor replication lag (alert on thresholds)
Track cross-region data transfer costs

Testing checklist:

Test cross-region writes (verify replication)
Test conflict scenarios (simultaneous writes)
Test regional failures (simulate region down)
Test network partitions (split-brain scenarios)
Load test each region independently
Test failover time (measure actual vs target)

Cost optimization checklist:

Right-size regional capacity (based on actual traffic)
Optimize cross-region data transfer (compress, dedupe)
Use reserved instances (if traffic predictable)
Implement data lifecycle policies (archive old data)
Monitor cost per region (detect anomalies)

Red flags (reassess architecture):

Replication lag consistently > 10 seconds (data tier not scaling)
Cross-region transfer costs > 10% of infrastructure (over-replicating)
Frequent conflicts requiring manual resolution (sharding strategy wrong)
Failover takes > 5 minutes (routing strategy inadequate)
Operational overhead exceeds engineering capacity (too complex)

Key Takeaways

Active-active serves traffic from multiple regions simultaneously — users are routed to the nearest region for lowest latency
Data replication across regions is the hardest challenge — asynchronous replication has lag, synchronous replication has latency
Conflict resolution is required when two regions write the same data — last-writer-wins, CRDTs, or application-level merge logic
DNS-based global load balancing routes users to the nearest healthy region — with automatic failover when a region goes down
Active-active is significantly more complex than active-passive — only worth the investment if your users are globally distributed and need low latency everywhere

Previous SLIS SLOS && Error Budgets Up next Global Load Balancing

Chapter complete!

Up next Global Load Balancing

Continue

Multi Region Active Active

Multi-Region Active-Active Architecture (When One Region Isn't Enough)

Challenge: Your "multi-region" architecture fails the regional outage test

Scenario

Interactive question (pause and think)

Progressive reveal (question -> think -> answer)

Real-world analogy (fire stations)

Key insight box

Challenge question

Mental model - Active-active is a distributed systems problem, not a deployment problem

Scenario

Interactive question (pause and think)

Progressive reveal

Mental model

Real-world parallel (bank branches)

Key insight box

Challenge question

Understanding the CAP theorem trade-offs in multi-region

Scenario

CAP theorem reminder

Visual: CAP trade-offs in multi-region

Multi-region consistency patterns

Interactive question

Progressive reveal

Key insight box

Challenge question

Core components of active-active architecture

Scenario

Architecture layers

Data tier patterns (the heart of active-active)

Application-level conflict resolution

Key insight box

Challenge question

Routing users to regions - DNS, Anycast, or Application?

Scenario

Think about it

Interactive question (pause and think)

Progressive reveal

Routing approaches comparison

Production insight: How Netflix routes globally

Key insight box

Challenge question

Handling regional failures - detection and recovery

Scenario

Failure detection challenges

Failover strategies

Failover decision matrix

Avoiding failover oscillation (flapping)

Key insight box

Challenge question

The hidden costs of active-active

Scenario

Cost dimensions

Cost-benefit analysis framework

When NOT to use active-active

Key insight box

Challenge question

Final synthesis - Design your active-active architecture

Synthesis challenge

Your tasks (pause and think)

Progressive reveal (one possible solution)

Key insight box

Final challenge question

Appendix: Quick checklist (printable)

Key Takeaways

Course Complete!