Advanced

System Design Advanced

0/51 chapters 0%

Distributed Systems And Algorithms 14

1 Vector Clocks and Lamport Timestamps 2 Byzantine Fault Tolerance 3 X3DH 4 Double Ratchet Algorithm 5 Distributed Consensus 6 Two-Phase and Three-Phase Commit 7 Quorum and Sloppy Quorum 8 Gossip Protocols 9 Merkle Trees 10 Anti Entropy and Read Repair 11 Hinted Handoff 12 Split Brain Mitigation 13 Linearizability and Sequential Consistency 14 Causal Consistency

Architectural Patterns 6

15 Saga Patterns 16 Strangler Fig Pattern 17 Backend For Frontend 18 API Gateway vs Service Mesh 19 Multi Tenancy Architecture 20 Hexagonal Archiecture

Caching and Performance 5

21 Thundering Herd Problem 22 Adaptive Caching With Maching learning 23 Distributed Locks With Redis 24 Cache Coherence in Multi-Region System 25 Edge Computing and Edge Caching

Data Management 10

26 Conflict Free Replicated Data Types 27 Multi Version Concurrency Control 28 Optimistic vs Pessimistic Locking 29 Database Federation 30 Polyglot Persistence 31 Zero Downtime Migration 32 Hot Partition Detection and Mitigation 33 Time Series Database Optimization 34 Bloom Filters 35 Lsm Trees Compaction Strategies

Event Driven Architecture 6

36 Event Sourcing at Scale 37 Complex Event Processing 38 Transactional Outbox Pattern 39 Transactional Inbox Pattern 40 Event Mesh Architecture 41 Kappa Architecture vs Lambda Architecture

Observability 4

42 Observability Driven Development 43 Open Telemetry and Distributed Tracing 44 Tail Based Sampling 45 High Cardinality Data Management

Reliability And Resilience 6

46 Game Days and Dirt Testing 47 Blast Radius and Failure Domain Isolation 48 SLIS SLOS && Error Budgets 49 Multi Region Active Active 50 Global Load Balancing Cascading Failure Prevention

Reliability And Resilience · Chapter 51 of 51

Cascading Failure Prevention

Akhil Sharma

 20 min 

← → to navigate

Cascading Failure Prevention (Stop the Domino Effect)

Audience: system architects and SREs building resilient distributed systems that must survive partial failures.

This article assumes:

Failures in distributed systems are inevitable and often correlated.
A small failure can trigger a catastrophic chain reaction if unchecked.
Your system's weakest link will be discovered during your busiest hour.
Prevention is cheaper than recovery.

Challenge: One microservice takes down your entire platform

Scenario

Black Friday, 10 AM. Your recommendation service has a memory leak.

Timeline:

10:00: Recommendation service starts running out of memory
10:02: Service slows down (GC thrashing)
10:03: API gateway timeouts waiting for recommendations
10:04: API gateway connection pool fills up
10:05: All API requests fail (checkout, search, login - everything)
10:07: Full platform outage

Your entire e-commerce site down because of a non-critical recommendation feature.

Interactive question (pause and think)

What failed here?

The recommendation service (it had the bug)
The API gateway (it couldn't handle timeouts)
The system architecture (failure cascaded)
The monitoring (it didn't alert fast enough)

Take 10 seconds.

Progressive reveal (question -> think -> answer)

Answer: (3) - architecture failure.

Bugs happen. The system should contain them, not amplify them.

Real-world analogy (forest fire)

A campfire in a forest:

Without firebreaks: Small campfire → spreads to tree → spreads to neighboring trees → entire forest burns.

With firebreaks: Small campfire → spreads to tree → hits firebreak (cleared area) → fire contained to small section.

Your distributed system needs firebreaks.

Key insight box

Cascading failures happen when one component's failure propagates to healthy components, creating a domino effect that exceeds the original blast radius.

Challenge question

Name three architectural patterns that would have prevented this cascade. Can you implement all three without significantly increasing latency?

Mental model - Failures amplify through dependency chains

Scenario

Your system has a dependency graph:

Service B slows down. Who gets affected?

Interactive question (pause and think)

If Service B starts taking 10 seconds instead of 100ms, what happens upstream?

A. Only Service A is affected (direct dependency) B. API Gateway, Service A affected (upstream dependencies) C. Everything upstream fails (Frontend, API Gateway, Service A) D. Even unrelated services fail (shared resources)

Progressive reveal

Answer: Often C or D.

Mental model

Think of cascading failures as:

Dependency amplification: One slow service makes everything slow
Resource exhaustion: Waiting threads consume memory/connections
Retry storms: Failed requests get retried, making overload worse
Shared fate: Components share infrastructure (thread pools, databases)

The cascade pattern:

Real-world parallel (traffic jam)

One slow car on a single-lane road:

Car slows to 20 mph
Cars behind it slow down (backpressure)
More cars arrive than can pass (queue builds)
Queue grows until blocking the highway entrance
Now the entire highway system is gridlocked

One slow component can gridlock an entire system.

Key insight box

Cascading failures exploit dependencies and shared resources. The failure propagates faster than you can respond.

Challenge question

Can you design a system where Service B's failure has ZERO impact on Service A? What would you have to sacrifice?

Understanding cascade amplification factors

Scenario

Service B handles 100 requests/second normally.

It degrades to 10 requests/second (10x slower).

How much does this affect upstream services?

Amplification factors

Factor 1: Retry amplification

yaml

Factor 2: Timeout amplification

yaml

Factor 3: Connection pool amplification

yaml

Visual: cascade timeline

text

Key insight box

Cascading failures amplify through retries, timeouts, and resource exhaustion. A 10x slowdown can cause a 100% outage.

Challenge question

If retries make cascades worse, should you disable retries entirely? What's the right retry strategy?

Core cascade prevention patterns

Scenario

You're designing a system that must survive partial failures.

What patterns prevent cascades?

Pattern 1: Timeouts (fail fast)

Timeout best practices:

yaml

Pattern 2: Circuit breakers (stop trying)

Circuit breaker benefits:

yaml

Pattern 3: Bulkheads (isolate resources)

Bulkhead benefits:

yaml

Pattern 4: Load shedding (reject when overloaded)

Load shedding strategies:

yaml

Pattern 5: Backpressure (slow down upstream)

Backpressure patterns:

yaml

Key insight

Cascade prevention requires multiple defenses: timeouts (fail fast), circuit breakers (stop trying), bulkheads (isolate), load shedding (reject), and backpressure (slow down).

Challenge question

You implement all five patterns. Your system is now "failure-proof." What happens when all regions fail simultaneously (correlated failure)?

Retry storms - when fixing makes it worse

Scenario

Service B is overloaded. Clients start seeing errors.

Clients retry. Now Service B gets 2x traffic (original + retries).

Service B falls over completely.

Think about it

Retries are supposed to help with transient failures. Why do they make overload worse?

Interactive question (pause and think)

Which retry strategy prevents retry storms?

A. Retry immediately on failure B. Retry with exponential backoff C. Retry with jitter (randomization) D. Don't retry on overload errors (503)

Progressive reveal

Answer: B, C, and D together.

Retry storm amplification

text

Safe retry strategies

Strategy 1: Exponential backoff with jitter

Strategy 2: Don't retry on overload

Strategy 3: Retry budget

Production insight: Google's retry strategy

yaml

Key insight box

Retries amplify failures. Safe retry strategies use exponential backoff, jitter, and don't retry on overload signals.

Challenge question

Your service receives a retry storm (1000 req/sec retries on top of 1000 req/sec normal traffic). How do you detect and mitigate it in real-time?

Graceful degradation - failing partially instead of completely

Scenario

Your recommendation service is down. Should your entire product page fail?

Graceful degradation strategies

Strategy 1: Feature flags for non-critical features

Strategy 2: Fallback values

yaml

Strategy 3: Static fallback content

Strategy 4: Partial responses

yaml

Criticality classification

yaml

Key insight box

Not all features are equally critical. Graceful degradation means failing non-critical features without affecting critical paths.

Challenge question

You've classified features into critical/important/nice-to-have. During an incident, should you proactively disable nice-to-have features to preserve capacity for critical features?

Correlated failures - when everything fails at once

Scenario

All your circuit breakers, timeouts, and bulkheads are perfect.

Then: AWS us-east-1 has a major outage. All regions depend on it for authentication.

Your entire platform is down. No amount of cascade prevention helped.

Correlated failure patterns

Pattern 1: Shared infrastructure

yaml

Pattern 2: Thundering herd after recovery

yaml

Pattern 3: Deployment-induced cascades

yaml

Defense against correlated failures

Defense 1: Redundant infrastructure

yaml

Defense 2: Shuffle sharding

yaml

Defense 3: Chaos engineering

yaml

Key insight box

Cascades can be prevented, but correlated failures require architectural redundancy. Build systems where failures are independent.

Challenge question

Your system survives individual component failures perfectly. How do you test that it survives multiple simultaneous failures?

Final synthesis - Design a cascade-resistant architecture

Synthesis challenge

You're the architect for a global payment processing platform.

Requirements:

100K transactions/second peak
99.99% uptime SLA (52 minutes downtime/year)
Must survive: database failures, region outages, dependency timeouts
Payment must succeed or fail deterministically (no "maybe" states)

Constraints:

Complex dependency graph: 15 microservices
Shared database for transactions (consistency required)
Third-party dependencies: fraud detection, KYC, card networks
Cannot tolerate retry storms (could double-charge customers)

Your tasks (pause and think)

Draw dependency graph and identify cascade risks
Apply timeout strategy per service
Design circuit breaker policy
Implement bulkheads for critical paths
Define retry policy that won't cause double-charges
Plan graceful degradation for non-critical features
Design defense against correlated failures

Write down your architecture.

Progressive reveal (one possible solution)

1. Dependency graph and cascade risks:

yaml

2. Timeout strategy:

yaml

3. Circuit breaker policy:

yaml

4. Bulkheads:

yaml

5. Retry policy:

yaml

6. Graceful degradation:

yaml

7. Correlated failure defenses:

yaml

Key insight box

Cascade prevention in payment systems requires defense-in-depth: timeouts, circuit breakers, bulkheads, careful retry policies, and graceful degradation.

Final challenge question

Your payment system is cascade-resistant. But what if the cascade starts in YOUR system and propagates to your CUSTOMERS' systems? How do you prevent your failures from cascading downstream?

Appendix: Quick checklist (printable)

Cascade prevention design:

Map dependency graph (understand what depends on what)
Identify shared resources (database, cache, auth)
Classify features by criticality (critical, important, nice-to-have)
Design failure modes (what should fail together, what should isolate)
Plan fallback strategies (cache, default values, omit)

Timeout configuration:

Set timeouts on all network calls (no infinite waits)
Use hierarchical timeouts (child < parent)
Configure per-dependency timeouts (critical vs non-critical)
Test timeout behavior (simulate slow dependencies)
Monitor timeout frequency (alert on unusual spikes)

Circuit breaker setup:

Implement circuit breakers on external dependencies
Set failure thresholds (50% error rate typical)
Configure timeout before half-open (30-60 seconds)
Define fallback behavior (cache, default, error)
Monitor circuit state changes (closed, open, half-open)

Bulkhead implementation:

Separate thread pools per dependency
Separate connection pools per service
Size bulkheads based on capacity (don't over-allocate)
Monitor bulkhead utilization (alert when full)
Test bulkhead isolation (one full bulkhead shouldn't affect others)

Retry strategy:

Use exponential backoff (not immediate retry)
Add jitter (prevent thundering herd)
Don't retry on overload (503, 429)
Implement retry budgets (limit retries per time window)
Use idempotency keys (prevent duplicate side effects)

Graceful degradation:

Define fallback for each non-critical feature
Use cached data when fresh data unavailable
Return partial responses (not all-or-nothing)
Skip optional features under load
Communicate degradation to users (if visible)

Monitoring:

Track timeout frequency per service
Monitor circuit breaker state transitions
Alert on bulkhead saturation
Measure retry rates (detect retry storms)
Track cascading failure metrics (time-to-cascade, blast radius)

Red flags (redesign needed):

Cascading failures happen frequently (poor isolation)
Same failure affects unrelated features (shared fate)
Retry storms observed regularly (bad retry strategy)
Circuit breakers always open (dependency not reliable enough)
Timeouts too aggressive (false positives) or too lenient (thread exhaustion)

Key Takeaways

Cascading failures occur when one component's failure overloads others — creating a domino effect that brings down the entire system
Circuit breakers stop calling failing services — after a threshold of errors, the circuit opens and fails fast instead of waiting for timeouts
Bulkheads isolate failures — separate thread pools, connection pools, or processes for different dependencies
Timeouts and retries with backoff prevent resource exhaustion — without timeouts, threads pile up waiting for unresponsive services
Load shedding drops excess requests gracefully — returning 503 to some users is better than crashing and returning 503 to all users

Previous Global Load Balancing Course complete Browse more courses →

Reading progress 0%

On this page

Cascading Failure Prevention (Stop the Domino Effect) Challenge: One microservice takes down your entire platform Scenario Interactive question (pause and think) Progressive reveal (question -> think -> answer) Real-world analogy (forest fire) Key insight box Challenge question Mental model - Failures amplify through dependency chains Scenario Interactive question (pause and think) Progressive reveal Mental model Real-world parallel (traffic jam) Key insight box Challenge question Understanding cascade amplification factors Scenario Amplification factors Visual: cascade timeline Key insight box Challenge question Core cascade prevention patterns Scenario Pattern 1: Timeouts (fail fast) Pattern 2: Circuit breakers (stop trying) Pattern 3: Bulkheads (isolate resources) Pattern 4: Load shedding (reject when overloaded) Pattern 5: Backpressure (slow down upstream) Key insight Challenge question Retry storms - when fixing makes it worse Scenario Think about it Interactive question (pause and think) Progressive reveal Retry storm amplification Safe retry strategies Production insight: Google's retry strategy Key insight box Challenge question Graceful degradation - failing partially instead of completely Scenario Graceful degradation strategies Criticality classification Key insight box Challenge question Correlated failures - when everything fails at once Scenario Correlated failure patterns Defense against correlated failures Key insight box Challenge question Final synthesis - Design a cascade-resistant architecture Synthesis challenge Your tasks (pause and think) Progressive reveal (one possible solution) Key insight box Final challenge question Appendix: Quick checklist (printable) Key Takeaways

Chapter complete!

Course Complete!

You've finished all 51 chapters of

System Design Advanced

Browse courses