Deliberately breaking things in production to prove your system can handle it — the practice Netflix pioneered with Chaos Monkey.
Chaos testing (Chaos Engineering) intentionally breaks things in production to ensure system can handle failures.
Philosophy: "Don't wait for things to break. Break them yourself and see what happens!"
Like fire drills in a building:
Don't wait for real fire to test evacuation plan!
Chaos Test:
Pull fire alarm (simulate disaster)
Observe: Do people know where to go?
Measure: How long does evacuation take?
Find: Are there bottlenecks?
Improve: Fix issues before real emergency
Test your disaster response BEFORE disaster strikes!
Netflix's Chaos Monkey: Randomly terminates instances in production
Why?
Forces engineers to build resilient systems
Identifies weak points
Proves system can handle failures
No surprises during real outages
Philosophy: "If we never test our backup systems, how do we know they work when we need them?"
Experiment: "Random Server Crash"
Hypothesis: "If one application server crashes, the system should remain available with no customer impact"
Steady State:
All servers healthy
Response time: 100ms
Error rate: 0%
Introduce Chaos: → Kill random application server (simulate hardware failure)
Observe:
✅ Load balancer detects failure (5 seconds)
✅ Traffic rerouted to healthy servers
✅ Response time: 120ms (slight increase)
✅ Error rate: 0% (no errors!)
⚠️ Auto-scaling triggered (new instance launched)
Result: System resilient! Experiment confirms hypothesis ✅
Learnings:
Health check interval: 5 seconds (acceptable)
Auto-scaling works
No manual intervention needed
Experiment: "Network Split"
Setup: Microservices architecture
Frontend
API Gateway
User Service
Order Service
Payment Service
Introduce Chaos:
→ Block network traffic between Order Service and Payment Service (simulate network failure)
Observe:
❌ Bad Design:
Order Service: Tries to call Payment
Payment Service: (unreachable)
Order Service: Hangs for 60 seconds
User: Sees timeout error 💥
Shopping cart: Lost! 💥
✅ Good Design:
Order Service: Tries to call Payment
Payment Service: (unreachable)
Order Service: Timeout after 3 seconds
Order Service: Enqueues order for retry
Order Service: Returns "Order processing" to user
User: Sees confirmation "We're processing your order"
Background job: Retries payment when network restored
Result: Bad design exposed by chaos testing! Fix: Implement circuit breaker, async processing
Experiment: "Slow Database"
Normal: Database queries return in 10ms Chaos: Add 1000ms latency to all database calls
Observe:
Endpoints affected:
GET /api/users → 1050ms (was 50ms) ❌
GET /api/products → 1020ms (was 30ms) ❌
GET /api/health → 5ms ✅ (doesn't hit DB)
Cascading effects:
Request queues backing up
Thread pool exhaustion
Memory usage increasing
Timeouts occurring
Improvements needed:
✅ Add database read replicas
✅ Implement caching layer
✅ Add query timeouts
✅ Circuit breaker for DB calls
Experiment: "Memory Leak"
Introduce Chaos: → Gradually consume memory (simulate leak)
Observe timeline:
T+0: Memory: 2GB / 8GB (25%)
T+10: Memory: 4GB / 8GB (50%)
T+20: Memory: 6GB / 8GB (75%) ⚠️
T+25: Memory: 7GB / 8GB (87%) ⚠️
→ Alerts triggered ✅
→ On-call engineer paged ✅
T+30: Memory: 8GB / 8GB (100%) ❌
→ Process killed by OS ❌
→ No graceful degradation ❌
What should happen:
✅ Alert at 75% memory usage
✅ Graceful degradation (reject new requests)
✅ Auto-restart before OOM
✅ Load balancer health check fails
✅ Traffic diverted to healthy instances
Improvements:
✅ Implement memory monitoring
✅ Add graceful shutdown
✅ Configure OOM killer properly
Chaos Monkey (Netflix)
Randomly terminates instances
Production testing
AWS-focused
Chaos Toolkit
Declarative chaos experiments
Multiple platforms
Hypothesis-driven
Gremlin
Commercial chaos engineering platform
CPU, memory, network, disk attacks
Safe rollback
Litmus (Kubernetes)
Chaos for Kubernetes
Pod failures, network chaos
GitOps-friendly
Example Chaos Toolkit experiment:
Stage 1: Sandbox Testing
Environment: Local/Dev
Impact: Zero user impact
Goal: Learn the tools
Stage 2: Staging Testing
Environment: Staging/QA
Impact: Zero user impact
Goal: Develop experiments
Stage 3: Production Testing (Off-hours)
Environment: Production
Time: 2 AM, low traffic
Impact: Minimal user exposure
Goal: Validate in real environment
Stage 4: Production Testing (Business hours)
Environment: Production
Time: Normal hours
Impact: Real user exposure
Goal: Prove resilience under realistic conditions
Note: Only reach Stage 4 after building confidence in Stages 1-3!
Key characteristics: