Audience: system architects and SREs building resilient distributed systems that must survive partial failures.
This article assumes:
Black Friday, 10 AM. Your recommendation service has a memory leak.
Timeline:
Your entire e-commerce site down because of a non-critical recommendation feature.
What failed here?
Take 10 seconds.
Answer: (3) - architecture failure.
Bugs happen. The system should contain them, not amplify them.
A campfire in a forest:
Without firebreaks: Small campfire → spreads to tree → spreads to neighboring trees → entire forest burns.
With firebreaks: Small campfire → spreads to tree → hits firebreak (cleared area) → fire contained to small section.
Your distributed system needs firebreaks.
Cascading failures happen when one component's failure propagates to healthy components, creating a domino effect that exceeds the original blast radius.
Name three architectural patterns that would have prevented this cascade. Can you implement all three without significantly increasing latency?
Your system has a dependency graph:
Service B slows down. Who gets affected?
If Service B starts taking 10 seconds instead of 100ms, what happens upstream?
A. Only Service A is affected (direct dependency) B. API Gateway, Service A affected (upstream dependencies) C. Everything upstream fails (Frontend, API Gateway, Service A) D. Even unrelated services fail (shared resources)
Answer: Often C or D.
Think of cascading failures as:
The cascade pattern:
One slow car on a single-lane road:
One slow component can gridlock an entire system.
Cascading failures exploit dependencies and shared resources. The failure propagates faster than you can respond.
Can you design a system where Service B's failure has ZERO impact on Service A? What would you have to sacrifice?
Service B handles 100 requests/second normally.
It degrades to 10 requests/second (10x slower).
How much does this affect upstream services?
Factor 1: Retry amplification
Factor 2: Timeout amplification
Factor 3: Connection pool amplification
Cascading failures amplify through retries, timeouts, and resource exhaustion. A 10x slowdown can cause a 100% outage.
If retries make cascades worse, should you disable retries entirely? What's the right retry strategy?
You're designing a system that must survive partial failures.
What patterns prevent cascades?
Timeout best practices:
Circuit breaker benefits:
Bulkhead benefits:
Load shedding strategies:
Backpressure patterns:
Cascade prevention requires multiple defenses: timeouts (fail fast), circuit breakers (stop trying), bulkheads (isolate), load shedding (reject), and backpressure (slow down).
You implement all five patterns. Your system is now "failure-proof." What happens when all regions fail simultaneously (correlated failure)?
Service B is overloaded. Clients start seeing errors.
Clients retry. Now Service B gets 2x traffic (original + retries).
Service B falls over completely.
Retries are supposed to help with transient failures. Why do they make overload worse?
Which retry strategy prevents retry storms?
A. Retry immediately on failure B. Retry with exponential backoff C. Retry with jitter (randomization) D. Don't retry on overload errors (503)
Answer: B, C, and D together.
Strategy 1: Exponential backoff with jitter
Strategy 2: Don't retry on overload
Strategy 3: Retry budget
Retries amplify failures. Safe retry strategies use exponential backoff, jitter, and don't retry on overload signals.
Your service receives a retry storm (1000 req/sec retries on top of 1000 req/sec normal traffic). How do you detect and mitigate it in real-time?
Your recommendation service is down. Should your entire product page fail?
Strategy 1: Feature flags for non-critical features
Strategy 2: Fallback values
Strategy 3: Static fallback content
Strategy 4: Partial responses
Not all features are equally critical. Graceful degradation means failing non-critical features without affecting critical paths.
You've classified features into critical/important/nice-to-have. During an incident, should you proactively disable nice-to-have features to preserve capacity for critical features?
All your circuit breakers, timeouts, and bulkheads are perfect.
Then: AWS us-east-1 has a major outage. All regions depend on it for authentication.
Your entire platform is down. No amount of cascade prevention helped.
Pattern 1: Shared infrastructure
Pattern 2: Thundering herd after recovery
Pattern 3: Deployment-induced cascades
Defense 1: Redundant infrastructure
Defense 2: Shuffle sharding
Defense 3: Chaos engineering
Cascades can be prevented, but correlated failures require architectural redundancy. Build systems where failures are independent.
Your system survives individual component failures perfectly. How do you test that it survives multiple simultaneous failures?
You're the architect for a global payment processing platform.
Requirements:
Constraints:
Write down your architecture.
1. Dependency graph and cascade risks:
2. Timeout strategy:
3. Circuit breaker policy:
4. Bulkheads:
5. Retry policy:
6. Graceful degradation:
7. Correlated failure defenses:
Cascade prevention in payment systems requires defense-in-depth: timeouts, circuit breakers, bulkheads, careful retry policies, and graceful degradation.
Your payment system is cascade-resistant. But what if the cascade starts in YOUR system and propagates to your CUSTOMERS' systems? How do you prevent your failures from cascading downstream?
Cascade prevention design:
Timeout configuration:
Circuit breaker setup:
Bulkhead implementation:
Retry strategy:
Graceful degradation:
Monitoring:
Red flags (redesign needed):