Audience: platform architects and infrastructure engineers designing globally distributed systems.
This article assumes:
You proudly announce: "We're multi-region! We have servers in US and EU."
Then AWS us-east-1 has a major outage.
Reality check:
This is multi-region, but it's not active-active.
What's the difference between:
Take 10 seconds.
Answer: They represent increasing levels of availability and complexity.
Imagine a city with two fire stations:
Active-passive: Fire Station A handles all calls. Fire Station B is staffed but idle, only activates if Station A burns down.
Active-active: Both stations handle calls simultaneously. If Station A is destroyed, Station B already handles 50% of the city. The impact is degradation, not catastrophe.
Active-active means every region can independently serve reads AND writes, with data eventually synchronized across regions.
If active-active is so much better, why doesn't everyone do it? What's the hidden cost?
Your VP says: "Just deploy to multiple regions. That's active-active."
You know it's not that simple.
You deploy identical application servers to US and EU regions. Which statement is true?
A. "We're now active-active." B. "We need to solve data synchronization first." C. "We need to solve routing first." D. "B and C, plus conflict resolution, monitoring, testing..."
Answer: D.
Active-active isn't about deployment - it's about solving distributed data problems.
Think of active-active as:
The challenges:
Two bank branches in different cities (US and EU):
The problem: Customer withdraws $100 in US branch, simultaneously tries to withdraw $100 in EU branch. Account has $100 balance.
Bad solution: Both branches allow withdrawal (now -$100, data inconsistency).
Active-passive solution: Only US branch can approve withdrawals (EU can't help if US is down).
Active-active solution: Both can approve, but they synchronize and detect conflicts. Maybe allow temporary overdraft, reconcile later.
Active-active is 20% deployment and 80% distributed data correctness. If you only solve deployment, you'll have data corruption at scale.
Can you have active-active for reads but active-passive for writes? Would that be useful?
You're designing active-active architecture. Your database vendor says "Choose: strong consistency or high availability."
You want both. You can't have both (during network partitions).
C (Consistency): Every read sees the most recent write A (Availability): Every request gets a response (success or failure) P (Partition tolerance): System works despite network failures
The theorem: During a network partition (P), you must choose between C and A.
1. Strong consistency (CP)
2. Eventual consistency (AP)
3. Read-your-writes consistency
You're building a global e-commerce platform. Which consistency model for each feature?
A. Product catalog: Strong / Eventual / Read-your-writes B. Inventory count: Strong / Eventual / Read-your-writes C. Shopping cart: Strong / Eventual / Read-your-writes D. Payment processing: Strong / Eventual / Read-your-writes
Answer:
There's no "one size fits all" consistency model. Active-active requires choosing the right model per feature based on business requirements.
If you choose eventual consistency, how do you handle conflicts when both regions modify the same record simultaneously?
You're building active-active. What are the actual building blocks?
Pattern 1: Multi-master database replication
Pattern 2: Distributed databases (Cassandra, DynamoDB)
Pattern 3: Sharding by geography
The data tier is where active-active lives or dies. Application tier is easy (stateless). Data tier requires careful consistency modeling.
How do you test active-active data synchronization? Simulating network partitions between regions is hard.
You have servers in US and EU. User in New York makes a request. Where should it go?
What's the goal?
Which routing approach gives you all three?
A. DNS-based (Route53 with latency-based routing) B. Anycast (BGP announces same IP from multiple regions) C. Application-level (smart client chooses region) D. None - trade-offs required
Answer: D - each approach has trade-offs.
DNS-based routing
Anycast routing
Application-level routing
Hybrid approach (common in practice)
Netflix uses:
No single routing approach is perfect. Most production systems use DNS for coarse routing + application logic for fine-grained decisions.
User is in New York but their data is sharded to EU region (data residency). Should routing send them to US (low latency) or EU (where their data lives)?
Your EU region goes down. How quickly can you detect it and route traffic away?
Challenge 1: What does "down" mean?
Challenge 2: Distributed failure detection
Strategy 1: DNS failover (slow)
Strategy 2: Anycast failover (fast)
Strategy 3: Application-level retry (fastest)
Fast failure detection is worthless if you don't have fast recovery. Focus on application-level retries for the best user experience.
During failover, what happens to in-flight requests? Do they fail, get retried, or complete successfully?
Your CFO asks: "Why did our cloud bill double when we went active-active?"
You're running twice the infrastructure, but is that the only cost?
1. Infrastructure costs (obvious)
2. Data transfer costs (hidden)
3. Operational costs (often underestimated)
4. Performance costs (sometimes)
Active-active is expensive (money, complexity, operational overhead). Only use it when downtime costs exceed those expenses.
Your active-active setup costs $300K/year extra. Your biggest outage last year cost $50K. Should you downgrade to active-passive?
You're the architect for a global SaaS collaboration platform (think: Slack, Microsoft Teams).
Requirements:
Constraints:
Write down your architecture.
1. Regions and topology:
2. Data residency strategy:
3. Consistency model per feature:
4. Routing strategy:
5. Failover strategy:
6. Cost estimate:
Active-active is a business decision, not just a technical one. The architecture must be justified by revenue, compliance, or risk mitigation.
Your active-active architecture is live. Both US and EU regions write to the same workspace simultaneously (conflict). How do you resolve: last-write-wins, manual merge, or reject one write?
Active-active design checklist:
Data tier checklist:
Operational checklist:
Testing checklist:
Cost optimization checklist:
Red flags (reassess architecture):