Reliability And Resilience · Chapter 47 of 51

Blast Radius and Failure Domain Isolation

Akhil Sharma

 20 min 

← → to navigate

Blast Radius and Failure Domain Isolation (Containing the Blast)

Audience: platform engineers and architects designing fault-tolerant distributed systems.

This article assumes:

Failures are inevitable: services crash, networks partition, dependencies timeout.
Failures cascade: one component failure can trigger a domino effect.
Your system is larger than any one person understands completely.
Customers don't care about your internal architecture - they care about their experience.

Challenge: One microservice takes down your entire platform

Scenario

It's Black Friday. Your recommendation service crashes due to a memory leak.

Within 2 minutes:

The API gateway times out waiting for recommendations
The timeout causes the gateway's connection pool to fill up
New requests to ANY service (checkout, search, login) start failing
Your entire platform is down

All because of one non-critical feature: product recommendations.

Interactive question (pause and think)

What failed here?

The recommendation service (it had the bug)
The API gateway (it couldn't handle timeouts)
The system architecture (one failure cascaded)
The developers (they wrote buggy code)

Take 10 seconds.

Progressive reveal (question -> think -> answer)

Answer: (3) - architecture failure.

The bug is inevitable. The cascade is not.

Real-world analogy (ship bulkheads)

When the Titanic hit the iceberg, water flooded ONE compartment. But the bulkheads (walls) between compartments weren't tall enough. Water spilled over, compartment by compartment, until the ship sank.

Modern ships have watertight compartments that seal completely. One flooded room doesn't sink the ship.

Your system needs the same isolation.

Key insight box

Blast radius is the scope of impact when a component fails. Failure domain isolation limits that scope through architectural boundaries.

Challenge question

If you could only add ONE isolation mechanism to your system right now, would you choose: timeouts, circuit breakers, or separate deployment units?

Mental model - Failures want to spread; isolation contains them

Scenario

Your system has three layers: frontend, backend, database.

Two design philosophies:

Tight coupling: "Fast is efficient. Share everything: connections, memory, threads."
Loose coupling: "Isolated is resilient. Assume dependencies will fail."

Interactive question (pause and think)

Which statement is true?

A. "Isolation adds complexity, tight coupling is simpler." B. "Isolation wastes resources, sharing is more efficient." C. "Isolation trades some efficiency for resilience."

Progressive reveal

Answer: C.

A is only true for trivial systems.
B measures the wrong efficiency (CPU/memory vs. customer impact).
C recognizes the trade-off explicitly.

Mental model

Think of failure domain isolation as:

Firebreaks in a forest (stop fire spread)
Circuit breakers in electrical systems (prevent overload cascades)
Quarantine zones in disease control (contain outbreaks)

The goal: when something fails, the failure stays local.

Real-world parallel (power grid)

The 2003 Northeast Blackout started with one overloaded transmission line in Ohio. It cascaded to 50 million people across 8 states and Canada.

Modern grids use isolation: circuit breakers segment regions. One grid section can fail without taking down neighboring states.

Key insight box

In distributed systems, components fail independently. In poorly designed systems, they fail together.

Challenge question

Is it possible to have TOO MUCH isolation? What are the costs of over-isolating?

Understanding blast radius dimensions

Scenario

Your database goes down. What's the blast radius?

It depends on WHICH dimension you're measuring.

Blast radius dimensions

Customer impact:

How many users affected?
Which features broken?
For how long?

Revenue impact:

How much money lost per minute?
Which revenue streams affected?

Component impact:

How many services degraded/failing?
Which data inconsistent?

Geographic impact:

One region? Multi-region?
Which compliance zones affected?

Time impact:

How long to detect?
How long to mitigate?
How long to fully recover?

Visual: blast radius dimensions

text

Interactive question

Your payment service has a bug. Which blast radius dimension should you minimize FIRST?

Customer impact (how many customers)
Revenue impact (how much money)
Time impact (how fast to fix)

Progressive reveal

Answer: Trick question - minimize (1) ENABLES minimizing (2), but (3) is HOW you achieve it.

Fast detection and mitigation reduce all blast radius dimensions.

Key insight box

Blast radius isn't a single number - it's a multi-dimensional surface. You optimize different dimensions with different techniques.

Challenge question

How do you measure blast radius BEFORE an incident happens? (Hint: Game Days)

Core isolation techniques (and their trade-offs)

Scenario

You're designing a multi-tenant SaaS platform. 1000 customers, ranging from small businesses to enterprises.

How do you prevent one customer's bad behavior from impacting others?

Isolation techniques catalog

1. Process isolation (bulkheads)

What: Separate thread pools, connection pools, or processes per tenant/feature
Prevents: Resource exhaustion cascade
Cost: Memory/CPU overhead

2. Network isolation (segmentation)

What: Separate VPCs, subnets, or service meshes per tenant/region
Prevents: Network-level blasts (DDoS, routing failures)
Cost: Infrastructure complexity

3. Data isolation (sharding)

What: Separate database instances, schemas, or tables per tenant
Prevents: Data corruption cascade, query load cascade
Cost: Operational overhead, harder queries

4. Deployment isolation (cells/clusters)

What: Separate Kubernetes clusters or server groups per region/tier
Prevents: Deployment failures, configuration errors
Cost: Duplication, slower rollouts

5. Circuit breakers and timeouts

What: Fail fast when dependencies are slow/down
Prevents: Waiting cascades, thread exhaustion
Cost: False positives during recovery

Visual: isolation techniques comparison

text

Example: process isolation with thread pools

Production insight: Netflix's approach

Netflix uses "swim lanes" - separate connection pools per dependency. If the recommendation service is slow, only recommendation requests are affected. Checkout, search, and playback continue normally.

Key insight box

Isolation is about accepting some failures to prevent catastrophic failures. Trade local degradation for global stability.

Challenge question

You isolate tenants into separate database shards. One tenant's query runs a table scan that locks up their shard. Other tenants are fine. Is this good isolation or bad user experience?

Circuit breakers - failing fast to prevent cascades

Scenario

Your service calls a payment provider API. The provider is having issues:

50% of requests timeout (30 seconds each)
Your service has 100 threads
Requests are coming in at 10/sec

Think about it

Without circuit breakers, what happens?

Interactive question (pause and think)

How long until your service is completely unresponsive?

A. 30 seconds B. 3 minutes C. Never - timeouts prevent thread exhaustion D. 10 seconds

Pause and calculate.

Progressive reveal

Answer: D (10 seconds).

Math:

100 threads available
10 requests/sec
50% timeout for 30 seconds each
10 * 0.5 = 5 requests/sec tie up threads for 30 sec
5 requests/sec * 30 sec = 150 threads needed
You only have 100 threads → saturated in 100/5 = 20 sec
But realistically, users start seeing failures at ~80% saturation = 10 seconds

Circuit breaker states

text

Circuit breaker implementation

Real-world failure: No circuit breaker

text

Key insight box

Circuit breakers don't prevent failures - they prevent cascading failures. Fail fast to stay available.

Challenge question

Your circuit breaker opens. How do you communicate this to users: "Payment temporarily unavailable" or "Internal server error"?

Cell-based architecture - the ultimate isolation

Scenario

You run a global SaaS platform. A bad deployment takes down your entire us-west region.

Impact: 40% of customers offline.

Interactive question (pause and think)

How can you deploy updates without risking 40% of your customers?

A. Better testing (catch bugs before production) B. Canary deployments (gradually roll out) C. Cell architecture (customers isolated into groups)

Progressive reveal

Answer: All three, but C provides the strongest isolation guarantee.

What is cell-based architecture?

Traditional architecture:

All customers share the same infrastructure pool
One bad deployment affects everyone
Load balancer sprays traffic across all instances

Cell-based architecture:

Customers assigned to isolated "cells" (separate clusters)
Each cell is a complete, independent stack
Cells can fail independently

Visual: monolithic vs cell-based

text

Cell design considerations

Cell sizing:

Too small (10 customers/cell): high overhead, many cells to manage
Too large (10,000 customers/cell): large blast radius
Sweet spot: 100-1,000 customers/cell (1-10% blast radius)

Customer assignment:

By ID hash: evenly distributed load
By tier: Enterprise customers in dedicated cells
By geography: comply with data residency
By feature usage: High-load customers in separate cells

Deployment strategy:

yaml

Production insight: AWS's use of cells

AWS uses "shuffle sharding" - each customer's requests are routed to a unique subset of cells. Even if one cell fails, most customers' other requests go to healthy cells.

Effective blast radius: much less than 1/N where N = number of cells.

Key insight box

Cell-based architecture is the gold standard for blast radius isolation. Cost: operational complexity and reduced resource sharing efficiency.

Challenge question

You have 1000 servers. Monolithic pool vs 10 cells of 100 servers each. Which design can handle more total load? Which is more resilient?

The hidden cost of isolation - operational complexity

Scenario

Your SRE team is thrilled. You've implemented:

Cell-based architecture (10 cells)
Separate database clusters per cell
Separate Kubernetes clusters per cell
Circuit breakers on all dependencies

Then your CEO asks: "Why did our AWS bill triple?"

Interactive question (pause and think)

Isolation costs money. Where does the cost come from?

A. Duplicate infrastructure (less efficient sharing) B. Operational overhead (more things to manage) C. Reduced economies of scale D. All of the above

Progressive reveal

Answer: D - isolation trades efficiency for resilience.

Hidden costs of isolation

1. Infrastructure duplication:

2. Operational overhead:

10x more things to monitor
10x more things to upgrade
10x more configuration management
10x more toil for runbooks

3. Cross-cell operations:

Customer wants to query all their data → scatter-gather across cells
Analytics pipeline needs data from all cells → complex fan-in
Global rate limiting → requires cross-cell coordination

4. Reduced caching efficiency:

When isolation is worth it

DO heavily isolate:

Regulated industries (finance, healthcare)
Multi-tenant SaaS with enterprise SLAs
Systems with catastrophic failure costs (life-critical, high revenue)

DON'T over-isolate:

Early-stage startups (premature optimization)
Internal tools (lower availability requirements)
Read-heavy systems with low mutation rate

Visual: isolation cost/benefit curve

text

Key insight box

Isolation is not free. The art is finding the minimum isolation that meets your resilience and compliance requirements.

Challenge question

How would you measure if you're over-isolated or under-isolated? What metrics indicate the right balance?

Quota management - isolating greedy neighbors

Scenario

You run a multi-tenant API platform. CustomerA is running a badly written script that hammers your API:

1000 requests/sec (10x normal)
80% of your compute capacity
Other customers seeing latency spikes

Without quotas, one customer destroys the experience for everyone.

Think about it

How do you prevent this without manually blocking CustomerA?

Interactive question (pause and think)

Which quota strategy is most fair?

A. Hard limit (1000 req/sec per customer, block excess) B. Rate limiting (X req/sec, queue excess with timeout) C. Fair queuing (each customer gets equal share of capacity) D. Tiered quotas (enterprise gets more than free tier)

Progressive reveal

Answer: Depends on product requirements, but typically D with B.

Quota implementation patterns

1. Token bucket (smooth rate limiting)

2. Fair queuing (weighted fair share)

3. Quota enforcement layers

text

Production insight: Stripe's quota approach

Free tier: 100 requests/sec
Growth tier: 1,000 requests/sec
Enterprise: custom, but still rate-limited (prevents accidents)
Burst allowance: 2x sustained rate for 60 seconds
Graceful degradation: return HTTP 429 with Retry-After header

Key insight box

Quotas are isolation for time-based resources. Without them, one customer's spike becomes everyone's outage.

Challenge question

A customer hits their quota at 11:59 PM. Their critical business process runs at midnight. Should your system allow burst overages for brief periods?

Measuring blast radius - knowing your exposure

Scenario

Your VP asks: "If our payment service goes down, how many customers are affected and for how long?"

You don't know. You've never measured it.

Interactive question (pause and think)

Which metric best represents blast radius?

A. Number of servers impacted B. Percentage of customer requests failing C. Revenue lost per minute D. Time to detect + time to recover

Progressive reveal

Answer: B and C together, adjusted by D.

Blast radius = (% customers impacted) × (revenue impact rate) × (downtime duration)

Blast radius measurement framework

Pre-incident (design time):

yaml

During-incident (real-time monitoring):

Post-incident (retrospective):

text

Key insight box

You can't reduce what you don't measure. Blast radius measurement must happen at design time, incident time, and retrospective time.

Challenge question

Two services: Service A has 10% error rate for 1 hour. Service B has 1% error rate for 10 hours. Which has a bigger blast radius?

Final synthesis - Design a resilient multi-tenant platform

Synthesis challenge

You're the architect for a new B2B SaaS analytics platform.

Requirements:

10,000 customers (range: 10-employee startups to 50,000-employee enterprises)
99.95% uptime SLA (22 min downtime/month)
Must support real-time and batch analytics
Compliance: some customers require data residency (EU, US)
Cost-sensitive: investors want efficient infrastructure spend

Constraints:

Limited ops team (5 SREs)
Can't afford 10,000 isolated cells

Your tasks (pause and think)

Design your isolation strategy (what layers, what technique?)
Define failure domains (what should fail together vs separately?)
Choose cell sizing and customer assignment strategy
Define blast radius SLO per failure mode
Plan quota management approach
Describe how you measure and improve blast radius over time

Write down your design.

Progressive reveal (one possible solution)

1. Isolation strategy (layered):

Tier 1: Geography (US, EU separate regions for data residency)
Tier 2: Customer size (Enterprise in dedicated cells, SMB shared)
Tier 3: Feature isolation (real-time vs batch separate compute)
Tier 4: Circuit breakers on all cross-service calls

2. Failure domains:

Domain	Isolation Level	Blast Radius
Enterprise customer data	Dedicated cell	1 customer
SMB customer data	Shared cell (100 customers)	1% customers
Real-time ingestion	Separate from batch	50% features
Batch processing	Separate from real-time	50% features
Authentication service	Regional (US, EU)	50% customers

3. Cell sizing:

Enterprise tier: 1 customer per cell (largest 50 customers)
Growth tier: 100 customers per cell (next 950 customers)
Startup tier: 1000 customers per cell (remaining 9,000 customers)
Total: 50 + 10 + 9 = 69 cells (manageable by 5 SREs)

4. Blast radius SLOs:

yaml

5. Quota management:

Token bucket per customer (tier-based rates)
Enterprise: 1000 req/sec sustained, 5000 burst
Growth: 100 req/sec sustained, 500 burst
Startup: 10 req/sec sustained, 50 burst
Fair queuing within shared cells
HTTP 429 with Retry-After for quota exceeded

6. Blast radius improvement plan:

Key insight box

Blast radius design is a balancing act: isolation vs cost, resilience vs complexity, safety vs efficiency. The right answer depends on your SLAs and risk tolerance.

Final challenge question

Your design achieves 99.95% uptime (22 min downtime/month). Should you over-engineer for 99.99% (4 min/month) or invest those resources elsewhere? How do you make that trade-off?

Appendix: Quick checklist (printable)

Blast radius design checklist:

Identify failure modes (enumerate what can go wrong)
Define failure domains (what should fail together?)
Choose isolation techniques (bulkheads, cells, circuit breakers)
Size cells appropriately (balance blast vs ops overhead)
Implement quotas (prevent greedy neighbors)
Add circuit breakers (prevent cascades)
Define blast radius SLOs (measure what matters)

Operational checklist:

Monitor blast radius in real-time during incidents
Run Game Days to validate isolation design
Measure actual blast radius vs designed limits
Track isolation costs (infra $ + operational overhead)
Review and adjust quarterly (as system evolves)

Circuit breaker checklist:

Timeout threshold (e.g., 1 second)
Failure threshold (e.g., 50% error rate over 10 sec)
Half-open test period (e.g., 30 seconds)
Success threshold to close (e.g., 3 consecutive successes)
Metrics and dashboards (state changes, trip events)
Alerts on circuit open (human awareness)

Cell-based architecture checklist:

Cell size determined (customer count or load-based)
Customer assignment strategy (hash, tier, geo)
Deployment strategy (canary cell, staged rollout)
Cross-cell operations minimized (data locality)
Observability per cell (separate dashboards/alerts)
Runbooks for cell-specific operations

Red flags (reassess isolation):

Same failure keeps taking down multiple cells
Operational overhead exceeds team capacity
Infrastructure costs 3x+ comparable monolithic design
Cross-cell operations are frequent and slow
Isolation is preventing needed features

Key Takeaways

Blast radius is the scope of impact when something fails — a single-server failure affects one request; a regional outage affects millions
Failure domain isolation limits the blast radius of any single failure — separate availability zones, regions, and cell architectures
Cell-based architecture isolates groups of users into independent cells — a failure in one cell doesn't affect users in other cells
Bulkheads prevent cascading failures — like watertight compartments in a ship, failing components don't bring down the entire system

Previous Game Days and Dirt Testing Up next SLIS SLOS && Error Budgets

Chapter complete!

Up next SLIS SLOS && Error Budgets

Continue