Courses 0%
51
Reliability And Resilience · Chapter 51 of 51

Cascading Failure Prevention

Akhil
Akhil Sharma
20 min

Cascading Failure Prevention (Stop the Domino Effect)

Audience: system architects and SREs building resilient distributed systems that must survive partial failures.

This article assumes:

  • Failures in distributed systems are inevitable and often correlated.
  • A small failure can trigger a catastrophic chain reaction if unchecked.
  • Your system's weakest link will be discovered during your busiest hour.
  • Prevention is cheaper than recovery.

Challenge: One microservice takes down your entire platform

Scenario

Black Friday, 10 AM. Your recommendation service has a memory leak.

Timeline:

  • 10:00: Recommendation service starts running out of memory
  • 10:02: Service slows down (GC thrashing)
  • 10:03: API gateway timeouts waiting for recommendations
  • 10:04: API gateway connection pool fills up
  • 10:05: All API requests fail (checkout, search, login - everything)
  • 10:07: Full platform outage

Your entire e-commerce site down because of a non-critical recommendation feature.

Interactive question (pause and think)

What failed here?

  1. The recommendation service (it had the bug)
  2. The API gateway (it couldn't handle timeouts)
  3. The system architecture (failure cascaded)
  4. The monitoring (it didn't alert fast enough)

Take 10 seconds.

Progressive reveal (question -> think -> answer)

Answer: (3) - architecture failure.

Bugs happen. The system should contain them, not amplify them.

Real-world analogy (forest fire)

A campfire in a forest:

Without firebreaks: Small campfire → spreads to tree → spreads to neighboring trees → entire forest burns.

With firebreaks: Small campfire → spreads to tree → hits firebreak (cleared area) → fire contained to small section.

Your distributed system needs firebreaks.

Key insight box

Cascading failures happen when one component's failure propagates to healthy components, creating a domino effect that exceeds the original blast radius.

Challenge question

Name three architectural patterns that would have prevented this cascade. Can you implement all three without significantly increasing latency?


Mental model - Failures amplify through dependency chains

Scenario

Your system has a dependency graph:

Service B slows down. Who gets affected?

Interactive question (pause and think)

If Service B starts taking 10 seconds instead of 100ms, what happens upstream?

A. Only Service A is affected (direct dependency) B. API Gateway, Service A affected (upstream dependencies) C. Everything upstream fails (Frontend, API Gateway, Service A) D. Even unrelated services fail (shared resources)

Progressive reveal

Answer: Often C or D.

Mental model

Think of cascading failures as:

  • Dependency amplification: One slow service makes everything slow
  • Resource exhaustion: Waiting threads consume memory/connections
  • Retry storms: Failed requests get retried, making overload worse
  • Shared fate: Components share infrastructure (thread pools, databases)

The cascade pattern:

Real-world parallel (traffic jam)

One slow car on a single-lane road:

  1. Car slows to 20 mph
  2. Cars behind it slow down (backpressure)
  3. More cars arrive than can pass (queue builds)
  4. Queue grows until blocking the highway entrance
  5. Now the entire highway system is gridlocked

One slow component can gridlock an entire system.

Key insight box

Cascading failures exploit dependencies and shared resources. The failure propagates faster than you can respond.

Challenge question

Can you design a system where Service B's failure has ZERO impact on Service A? What would you have to sacrifice?


Understanding cascade amplification factors

Scenario

Service B handles 100 requests/second normally.

It degrades to 10 requests/second (10x slower).

How much does this affect upstream services?

Amplification factors

Factor 1: Retry amplification

yaml

Factor 2: Timeout amplification

yaml

Factor 3: Connection pool amplification

yaml

Visual: cascade timeline

text

Key insight box

Cascading failures amplify through retries, timeouts, and resource exhaustion. A 10x slowdown can cause a 100% outage.

Challenge question

If retries make cascades worse, should you disable retries entirely? What's the right retry strategy?


Core cascade prevention patterns

Scenario

You're designing a system that must survive partial failures.

What patterns prevent cascades?

Pattern 1: Timeouts (fail fast)

go

Timeout best practices:

yaml

Pattern 2: Circuit breakers (stop trying)

go

Circuit breaker benefits:

yaml

Pattern 3: Bulkheads (isolate resources)

go

Bulkhead benefits:

yaml

Pattern 4: Load shedding (reject when overloaded)

go

Load shedding strategies:

yaml

Pattern 5: Backpressure (slow down upstream)

go

Backpressure patterns:

yaml

Key insight

Cascade prevention requires multiple defenses: timeouts (fail fast), circuit breakers (stop trying), bulkheads (isolate), load shedding (reject), and backpressure (slow down).

Challenge question

You implement all five patterns. Your system is now "failure-proof." What happens when all regions fail simultaneously (correlated failure)?


Retry storms - when fixing makes it worse

Scenario

Service B is overloaded. Clients start seeing errors.

Clients retry. Now Service B gets 2x traffic (original + retries).

Service B falls over completely.

Think about it

Retries are supposed to help with transient failures. Why do they make overload worse?

Interactive question (pause and think)

Which retry strategy prevents retry storms?

A. Retry immediately on failure B. Retry with exponential backoff C. Retry with jitter (randomization) D. Don't retry on overload errors (503)

Progressive reveal

Answer: B, C, and D together.

Retry storm amplification

text

Safe retry strategies

Strategy 1: Exponential backoff with jitter

go

Strategy 2: Don't retry on overload

go

Strategy 3: Retry budget

go

Production insight: Google's retry strategy

yaml

Key insight box

Retries amplify failures. Safe retry strategies use exponential backoff, jitter, and don't retry on overload signals.

Challenge question

Your service receives a retry storm (1000 req/sec retries on top of 1000 req/sec normal traffic). How do you detect and mitigate it in real-time?


Graceful degradation - failing partially instead of completely

Scenario

Your recommendation service is down. Should your entire product page fail?

Graceful degradation strategies

Strategy 1: Feature flags for non-critical features

go

Strategy 2: Fallback values

yaml

Strategy 3: Static fallback content

go

Strategy 4: Partial responses

yaml

Criticality classification

yaml

Key insight box

Not all features are equally critical. Graceful degradation means failing non-critical features without affecting critical paths.

Challenge question

You've classified features into critical/important/nice-to-have. During an incident, should you proactively disable nice-to-have features to preserve capacity for critical features?


Correlated failures - when everything fails at once

Scenario

All your circuit breakers, timeouts, and bulkheads are perfect.

Then: AWS us-east-1 has a major outage. All regions depend on it for authentication.

Your entire platform is down. No amount of cascade prevention helped.

Correlated failure patterns

Pattern 1: Shared infrastructure

yaml

Pattern 2: Thundering herd after recovery

yaml

Pattern 3: Deployment-induced cascades

yaml

Defense against correlated failures

Defense 1: Redundant infrastructure

yaml

Defense 2: Shuffle sharding

yaml

Defense 3: Chaos engineering

yaml

Key insight box

Cascades can be prevented, but correlated failures require architectural redundancy. Build systems where failures are independent.

Challenge question

Your system survives individual component failures perfectly. How do you test that it survives multiple simultaneous failures?


Final synthesis - Design a cascade-resistant architecture

Synthesis challenge

You're the architect for a global payment processing platform.

Requirements:

  • 100K transactions/second peak
  • 99.99% uptime SLA (52 minutes downtime/year)
  • Must survive: database failures, region outages, dependency timeouts
  • Payment must succeed or fail deterministically (no "maybe" states)

Constraints:

  • Complex dependency graph: 15 microservices
  • Shared database for transactions (consistency required)
  • Third-party dependencies: fraud detection, KYC, card networks
  • Cannot tolerate retry storms (could double-charge customers)

Your tasks (pause and think)

  1. Draw dependency graph and identify cascade risks
  2. Apply timeout strategy per service
  3. Design circuit breaker policy
  4. Implement bulkheads for critical paths
  5. Define retry policy that won't cause double-charges
  6. Plan graceful degradation for non-critical features
  7. Design defense against correlated failures

Write down your architecture.

Progressive reveal (one possible solution)

1. Dependency graph and cascade risks:

yaml

2. Timeout strategy:

yaml

3. Circuit breaker policy:

yaml

4. Bulkheads:

yaml

5. Retry policy:

yaml

6. Graceful degradation:

yaml

7. Correlated failure defenses:

yaml

Key insight box

Cascade prevention in payment systems requires defense-in-depth: timeouts, circuit breakers, bulkheads, careful retry policies, and graceful degradation.

Final challenge question

Your payment system is cascade-resistant. But what if the cascade starts in YOUR system and propagates to your CUSTOMERS' systems? How do you prevent your failures from cascading downstream?


Appendix: Quick checklist (printable)

Cascade prevention design:

  • Map dependency graph (understand what depends on what)
  • Identify shared resources (database, cache, auth)
  • Classify features by criticality (critical, important, nice-to-have)
  • Design failure modes (what should fail together, what should isolate)
  • Plan fallback strategies (cache, default values, omit)

Timeout configuration:

  • Set timeouts on all network calls (no infinite waits)
  • Use hierarchical timeouts (child < parent)
  • Configure per-dependency timeouts (critical vs non-critical)
  • Test timeout behavior (simulate slow dependencies)
  • Monitor timeout frequency (alert on unusual spikes)

Circuit breaker setup:

  • Implement circuit breakers on external dependencies
  • Set failure thresholds (50% error rate typical)
  • Configure timeout before half-open (30-60 seconds)
  • Define fallback behavior (cache, default, error)
  • Monitor circuit state changes (closed, open, half-open)

Bulkhead implementation:

  • Separate thread pools per dependency
  • Separate connection pools per service
  • Size bulkheads based on capacity (don't over-allocate)
  • Monitor bulkhead utilization (alert when full)
  • Test bulkhead isolation (one full bulkhead shouldn't affect others)

Retry strategy:

  • Use exponential backoff (not immediate retry)
  • Add jitter (prevent thundering herd)
  • Don't retry on overload (503, 429)
  • Implement retry budgets (limit retries per time window)
  • Use idempotency keys (prevent duplicate side effects)

Graceful degradation:

  • Define fallback for each non-critical feature
  • Use cached data when fresh data unavailable
  • Return partial responses (not all-or-nothing)
  • Skip optional features under load
  • Communicate degradation to users (if visible)

Monitoring:

  • Track timeout frequency per service
  • Monitor circuit breaker state transitions
  • Alert on bulkhead saturation
  • Measure retry rates (detect retry storms)
  • Track cascading failure metrics (time-to-cascade, blast radius)

Red flags (redesign needed):

  • Cascading failures happen frequently (poor isolation)
  • Same failure affects unrelated features (shared fate)
  • Retry storms observed regularly (bad retry strategy)
  • Circuit breakers always open (dependency not reliable enough)
  • Timeouts too aggressive (false positives) or too lenient (thread exhaustion)

Key Takeaways

  1. Cascading failures occur when one component's failure overloads others — creating a domino effect that brings down the entire system
  2. Circuit breakers stop calling failing services — after a threshold of errors, the circuit opens and fails fast instead of waiting for timeouts
  3. Bulkheads isolate failures — separate thread pools, connection pools, or processes for different dependencies
  4. Timeouts and retries with backoff prevent resource exhaustion — without timeouts, threads pile up waiting for unresponsive services
  5. Load shedding drops excess requests gracefully — returning 503 to some users is better than crashing and returning 503 to all users
Chapter complete!

Course Complete!

You've finished all 51 chapters of

System Design Advanced

Browse courses