Pushing your system past its limits to find the breaking point — because you'd rather discover it in a test than during Black Friday.
Stress testing pushes the system beyond normal limits to find breaking points.
Normal load: 1,000 users Stress test: 10,000 → 50,000 → 100,000 users
Goal: Find where and how system fails
Like testing how much weight a bridge can hold:
Design capacity: 100 cars Stress test:
✓ 150 cars → Bridge holding ✅
✓ 200 cars → Bridge holding ✅
✓ 250 cars → Bridge sagging ⚠️
✗ 300 cars → Bridge collapses! 💥
Now you know: Safety limit is 250 cars Can post warning signs at 200 cars
Gradually increase load:
Hour 1: 1,000 users → Response time: 100ms ✅
Hour 2: 5,000 users → Response time: 150ms ✅
Hour 3: 10,000 users → Response time: 200ms ✅
Hour 4: 20,000 users → Response time: 500ms ⚠️
Hour 5: 30,000 users → Response time: 2000ms ❌ Error rate: 5% ❌
Breaking point: ~25,000 concurrent users
Sudden traffic spike:
Normal: 1,000 users ↓ Spike: 50,000 users (in 10 seconds!) ↓ Observe: How does system handle sudden surge?
Scenarios:
✅ Good: System auto-scales, handles load
⚠️ OK: System slows down but recovers
❌ Bad: System crashes, needs manual restart
Sustained load over long period:
Load: 10,000 users Duration: 48 hours continuous
Watch for:
❌ Memory leaks (RAM usage grows over time)
❌ Connection pool exhaustion
❌ Disk space issues (logs filling up)
❌ Database connection leaks
✅ System remains stable
Example finding: Hour 1: Memory usage: 2GB
Hour 12: Memory usage: 4GB
Hour 24: Memory usage: 6GB ⚠️ Memory leak detected!
Scenario: Major news event (election, sports final) Normal traffic: 10,000 concurrent users Expected surge: 200,000 concurrent users
Stress Test Setup:
Tool: Apache JMeter or Gatling or k6
Test Script: // Ramp up to peak load
Start: 10,000 virtual users
Every minute: Add 20,000 users
Peak: 200,000 users
Duration: 30 minutes at peak
Ramp down: Gradual decrease
User Behavior:
Load homepage (80% of requests)
Read article (15% of requests)
Post comment (5% of requests)
Metrics to Monitor:
Performance:
Resources:
Results:
Phase 1: 0-50,000 users
Response time: 100-200ms ✅
CPU: 40% ✅
Memory: Stable ✅
Errors: 0% ✅
Phase 2: 50,000-100,000 users
Response time: 200-400ms ⚠️
CPU: 70% ⚠️
Memory: Stable ✅
Errors: 0.1% ⚠️
Phase 3: 100,000-150,000 users Response time: 400-1000ms ❌
CPU: 95% ❌
Memory: Growing slowly ⚠️
Errors: 2% ❌
Database: Connection pool exhausted! 💥
Phase 4: 150,000+ users Response time: >5000ms or timeout ❌
CPU: 100% (maxed) ❌
Memory: OOM errors ❌
Errors: 25% ❌
Database: Deadlocks occurring ❌
Finding: System breaks at ~120,000 concurrent users
Bottlenecks Identified:
Database connection pool too small (max 100 connections)
Application servers CPU-bound
No caching layer for homepage
No rate limiting
Recommendations:
✅ Increase database connection pool to 500
✅ Add Redis caching for homepage
✅ Scale to 5 application servers
✅ Implement CDN for static assets
✅ Add rate limiting (100 req/min per IP)
After fixes, re-test: New capacity: 300,000 concurrent users! ✅
Apache JMeter
Gatling
k6 (by Grafana)
Locust (Python)
Example k6 script:
Key characteristics: