Game Days and DiRT Testing (Chaos Engineering in Production)
Audience: site reliability engineers and platform teams running critical production systems.
This article assumes:
- Your system will fail in production - it's a question of when, not if.
- Incidents happen at the worst possible times: holidays, weekends, during launches.
- Your runbooks are outdated: they describe the system from 6 months ago.
- Team knowledge is unevenly distributed: only 2 people know how the payment flow really works.
Challenge: Your perfectly designed system fails anyway
Scenario
It's 3 AM. Your payment processor is down. The on-call engineer:
- Has never seen this failure mode before
- Can't find the runbook
- Doesn't know who owns the payment service
- Accidentally makes it worse by restarting the wrong components
Your monitoring says everything is green. Customers say they can't pay.
Interactive question (pause and think)
What failed here?
- The monitoring system
- The on-call engineer's skills
- The organization's preparation for failure
- All of the above
Take 10 seconds.
Progressive reveal (question -> think -> answer)
Answer: (3), which contributes to (4).
The engineer is competent. The monitoring works (for known failure modes). But the organization never practiced this scenario.
Real-world analogy (fire drills)
Buildings don't just have fire extinguishers and exit signs. They run fire drills. Why? Because panic creates tunnel vision, people forget procedures, and muscle memory matters during chaos.
Your production system is no different.
Key insight box
Game Days and DiRT (Disaster Recovery Testing) are controlled chaos experiments that expose gaps before real incidents do.
Challenge question
If you could only practice one failure scenario per quarter, which category would reveal the most organizational gaps: network failures, data corruption, or complete region outages?
Mental model - Failure is a learning opportunity, not a risk
Scenario
Your VP asks: "Why would we intentionally break production?"
Two philosophies collide:
- Traditional: minimize all risk, avoid disruption at all costs
- Resilience engineering: controlled small failures prevent catastrophic large failures
Interactive question (pause and think)
Which statement reflects reality?
A. "Our system is too critical to intentionally break."
B. "If we're afraid to test failures, we're not ready for real failures."
C. "Chaos engineering is just for Netflix and Amazon."
Progressive reveal
Answer: B.
- A reflects risk aversion that leads to fragility.
- C is a dangerous myth - chaos engineering scales to any system that must be reliable.
Mental model
Think of Game Days as:
- Rehearsals before the performance
- Flight simulators for your engineering team
- Stress tests for both systems and people
The goal isn't to break things - it's to learn what breaks, how teams respond, and where documentation/automation gaps exist.
Real-world parallel (hospital emergency drills)
Hospitals run mass casualty drills. They don't wait for a real disaster to find out that:
- The backup generator is in the wrong location
- Nurses don't know the evacuation protocol
- The emergency contact list is outdated
Your production system deserves the same rigor.
Key insight box
Game Days reveal unknown unknowns. Runbooks describe known knowns. The gap between them is where incidents live.
Challenge question
What's more dangerous: a failure you've practiced recovering from, or a failure you've never seen but your monitoring claims to detect?
What "Game Days" vs "DiRT" vs "Chaos Engineering" actually mean
Scenario
Your team debates what to call these exercises. The terms are often confused.
Definitions
Game Days:
- Broader organizational exercise
- Includes incident response, communication, decision-making
- Often involves multiple teams
- Simulates complete failure scenarios (e.g., "AWS us-east-1 is down")
DiRT (Disaster Recovery Testing):
- Focuses on disaster recovery capabilities
- Tests: backups, failover, data restoration, region evacuation
- Validates recovery time objectives (RTO) and recovery point objectives (RPO)
- Often scheduled maintenance windows
Chaos Engineering:
- Continuous, automated failure injection
- Hypothesis-driven experiments
- Production environment (often)
- Smaller blast radius, higher frequency
Visual: scope and frequency
Interactive question
Your payment service handles $10M/hour. Which approach do you start with?
- Monthly Game Days with full region failover
- DiRT testing quarterly during maintenance windows
- Automated chaos experiments on 1% traffic daily
Progressive reveal
Answer: Start with (2), graduate to (3), use (1) for organizational readiness.
Starting with Game Days (full region failover) on a critical service without DiRT foundation is reckless.
Key insight box
Start small, automate, then scale. Chaos maturity is a journey: manual → scheduled → continuous.
Challenge question
What happens if you run chaos experiments but never practice the human incident response process?
Core components of a Game Day
Scenario
You're planning your first Game Day. What do you actually need?
Typical components:
- Scenario design (what will we break?)
- Blast radius definition (how much can we afford to impact?)
- Success criteria (what does "handled well" look like?)
- Observers and facilitators (who's running the exercise?)
- Communication plan (how do we coordinate?)
- Rollback plan (when do we stop?)
- Post-mortem and remediation tracking
Visual: Game Day phases
Interactive question
You're designing a Game Day: "Primary database fails over to replica."
Which element is most critical to define first?
A. The exact failure mechanism (network partition vs process crash)
B. Rollback criteria (when to abort)
C. Success metrics (what good looks like)
Progressive reveal
Answer: B, then C, then A.
Without rollback criteria, you risk turning a Game Day into a real incident. Without success metrics, you can't learn. The failure mechanism matters less than knowing when to stop.
Failure catalog (organizational reality)
At scale, Game Days expose:
- Runbook rot: documentation is 3 versions behind
- Tribal knowledge: only senior engineers know critical workflows
- Coordination gaps: teams don't know who to call
- Alert fatigue: important alerts buried in noise
- Hidden dependencies: service A fails, but service C dies (undocumented coupling)
- Tooling friction: dashboards don't show the right data
- Authority confusion: who has permission to make rollback decisions?
Key insight box
Game Days fail fast in controlled environments so your real incidents don't fail slow in production.
Challenge question
If your Game Day runs perfectly and nothing breaks, what went wrong with your scenario design?
Designing safe chaos experiments - the blast radius game
Scenario
You want to test "what if Kafka loses a partition."
But your Kafka cluster handles:
- User authentication events (critical)
- Analytics events (non-critical)
- Payment authorizations (ultra-critical)
- Recommendation updates (non-critical)
Think about it
How do you test Kafka failures without taking down payments?
Interactive question (pause and think)
What's the right blast radius strategy?
- Test in production, but only on 0.1% of traffic
- Test in staging with full traffic
- Test in production, but only on non-critical topics
- Shadow test: duplicate traffic to an isolated Kafka cluster
Pause.
Progressive reveal
Answer depends on maturity, but typically: start with (2), then (4), then (3), finally (1).
Blast radius control techniques
Traffic-based:
- Canary: 1% → 5% → 25% → 100%
- Geography: test in low-traffic regions first
- User segment: employees, beta users, then general population
Infrastructure-based:
- Shadow environments (duplicate traffic, isolated infrastructure)
- Feature flags (degrade gracefully without full failure)
- Read replicas (test non-critical reads, not writes)
Time-based:
- Business hours vs off-hours
- Low-traffic days (Sunday 2 AM)
- Scheduled maintenance windows
Visual: blast radius expansion
Production insight
Netflix's Chaos Monkey started simple: randomly kill instances, nothing more sophisticated. They didn't start with region-level failures.
Key insight box
Blast radius discipline is not cowardice - it's engineering rigor. Start small, measure, expand.
Challenge question
How would you design a chaos experiment for a system where "safe" and "critical" paths share the same database?
DiRT testing - validating disaster recovery for real
Scenario
Your RTO (Recovery Time Objective) SLA says "4 hours to restore service."
But when was the last time you actually tested a full restore from backup?
Interactive question (pause and think)
Your database backup runs nightly. Which statement is true?
A. "Backups that run successfully are restorable."
B. "Backups must be tested separately from creation."
C. "Backup validation is too risky to do often."
Progressive reveal
Answer: B.
Countless companies have learned the hard way: successful backups ≠ restorable backups.
DiRT testing dimensions
Data recovery:
- Full restore from backup
- Point-in-time recovery (PITR)
- Corrupted data rollback
- Incremental vs full backup validation
Infrastructure recovery:
- Region failover (active-passive)
- Zone failover within region
- Bare-metal restore (for on-prem)
- Kubernetes cluster rebuild
Application recovery:
- Stateless service redeploy
- Stateful service migration
- Database schema migration after recovery
- Configuration restore (IaC validation)
Organizational recovery:
- Runbook walkthrough
- Incident command structure
- Communication tree (who calls whom?)
- Third-party vendor coordination
Visual: DiRT test matrix
Real-world DiRT failure stories
The backup that wasn't:
Company had 7 years of database backups. During a ransomware attack, they discovered:
- Backup files were corrupted (silent failures for 2 years)
- Restore scripts referenced decommissioned servers
- No one had actually restored in 4 years
The runbook fiction:
"Step 3: SSH to the failover database."
Problem: failover database IP changed 6 months ago, runbook never updated.
The permission trap:
During region failover drill, discovered: junior on-call engineers don't have AWS IAM permissions to create Route53 records.
Key insight box
DiRT testing answers: "Can we actually do what we claim we can do?" Most organizations are surprised by the answer.
Testing strategy (pragmatic)
Challenge question
Your DiRT test succeeds: you restored from backup in 30 minutes. But your RTO is 4 hours. Should you update your SLA to 30 minutes, or keep the 4-hour buffer? Why?
The "we'll just manually fix it" fallacy
Scenario
During a Game Day, the team manually recovers in 20 minutes.
Engineers conclude: "We're good. Our runbook works."
Interactive question (pause and think)
What's wrong with this conclusion?
- Nothing - manual recovery is fine if it's fast
- Manual steps don't scale to 3 AM on a holiday
- Different on-call engineer might not know the trick
- Both 2 and 3
Progressive reveal
Answer: 4.
The automation maturity ladder
Level 0: No runbook
- Recovery depends on who's on call
- Institutional knowledge in people's heads
- Each incident is novel
Level 1: Documentation exists
- Runbook with manual steps
- Better than nothing, but:
- Runbooks go stale
- Steps are error-prone under pressure
- Requires specific knowledge/access
Level 2: Semi-automated scripts
- Some steps scripted
- Engineer still makes decisions
- Faster, less error-prone
Level 3: Push-button recovery
- Single command triggers full recovery
- Engineer validates but doesn't execute
- Consistent outcomes
Level 4: Automated detection + recovery
- System self-heals
- Human notified but not required
- SRE dream state
Real-world failure: The 3 AM scenario
Key insight box
If your recovery depends on heroics, it's not recovery - it's luck.
Challenge question
You've automated 90% of your incident response. The remaining 10% requires human judgment calls. How do you practice that 10%?
Game Day scenario library - what to practice
Scenario
You have limited time. Which failures matter most?
Core scenarios (start here)
Infrastructure failures:
- Primary database crash
- Entire availability zone down
- Full region outage
- DNS provider failure
- CDN/load balancer failure
Application failures:
6. Memory leak OOM kills
7. Dependency timeout (third-party API down)
8. Configuration pushed breaks app
9. Deployment rollout stuck halfway
10. Infinite retry loop (thundering herd)
Data failures:
11. Corrupted data written to database
12. Accidental table drop
13. Replication lag spike
14. Disk full (logs or data)
15. Backup restoration needed
Organizational failures:
16. On-call engineer unavailable
17. Subject matter expert on vacation
18. Escalation tree outdated
19. Cross-team dependency broken
20. Third-party vendor unresponsive
Interactive question
You can only practice 3 scenarios this quarter. Your system is an e-commerce checkout flow. Which 3?
A. DNS failure, database crash, CDN down
B. Region outage, payment provider timeout, on-call unavailable
C. Memory leak, configuration bug, disk full
Pause and think about your actual blast radius.
Progressive reveal
Answer: B reveals more organizational gaps.
- A tests infrastructure (important but often well-automated)
- B tests both technical and human response
- C tests application issues (valuable but narrower scope)
Scenario design template
Production insight
Companies often test "big obvious" failures (region down) but miss "slow degradation" failures (API timeouts, connection pool exhaustion, gradual memory leaks).
Slow failures are harder to detect and often cause worse cascading effects.
Key insight box
The best Game Day scenario is the one you're most afraid to run. That's where your knowledge gaps hide.
Challenge question
Design a Game Day that tests your team's response to a security incident (compromised credentials) rather than an infrastructure failure. What changes?
Measuring chaos engineering success
Scenario
You've run 10 Game Days. How do you know if you're getting more resilient?
Interactive question (pause and think)
Which metric best indicates chaos engineering maturity?
A. Number of Game Days run
B. Percentage of scenarios that "pass"
C. Time to recovery improvement over time
D. Number of issues found and fixed
Progressive reveal
Answer: C and D together.
- A is vanity (quantity ≠ quality)
- B is misleading (passing means you're not testing hard enough)
- C shows improving resilience
- D shows learning velocity
Metrics to track
System resilience metrics:
- Mean Time To Detect (MTTD)
- Mean Time To Recover (MTTR)
- Blast radius size (customers impacted)
- Automated vs manual recovery ratio
Organizational metrics:
- Runbook accuracy rate
- Cross-team coordination time
- Escalation path effectiveness
- Post-incident action item completion rate
Learning metrics:
- Unknown-unknowns discovered per Game Day
- Automation coverage increase
- Documentation freshness score
- Team confidence survey results
Visual: chaos maturity scorecard
Continuous improvement loop
Key insight box
Chaos engineering success is not "zero failures" - it's "faster learning and recovery from inevitable failures."
Challenge question
Your Game Day found 15 issues. You have budget to fix 5 this quarter. How do you prioritize?
Final synthesis - Build your own chaos program
Synthesis challenge
You're the SRE lead for a fintech company's payment processing platform.
Requirements:
- Handle $500M transactions/day
- 99.99% uptime SLA (52 minutes downtime/year)
- Multi-region active-active
- PCI compliance requires audit trail of all recovery tests
- Team: 10 SREs, 40 developers across 6 teams
- Current state: no Game Days ever run, some monitoring, manual runbooks
Your tasks (pause and think)
- Design your chaos maturity roadmap (6-month plan)
- Choose first 3 scenarios to test (and justify)
- Define success metrics and tracking
- Build org buy-in plan (how to convince leadership?)
- Define blast radius controls for production chaos
- Create remediation process for gaps found
Write down your answers.
Progressive reveal (one possible solution)
1. Chaos maturity roadmap:
- Month 1-2: DiRT testing in staging (database restore, failover)
- Month 3-4: First production Game Day (read replica failure, off-hours)
- Month 5-6: Continuous chaos (automated, 1% traffic, analytics paths first)
2. First 3 scenarios:
- Database primary failover (tests infra + runbooks)
- Payment provider timeout (tests circuit breakers + fallbacks)
- Region failure (tests multi-region claims + org response)
Justification: Cover data layer, external dependencies, and geography - the three highest-risk domains.
3. Success metrics:
- MTTR trending down month-over-month
- Automated recovery % increasing
- Post-Game-Day action item closure rate
- SRE confidence score (survey)
4. Org buy-in:
- Start with "We almost lost $2M last outage - Game Days prevent that"
- Show insurance industry parallel: they practice disasters
- Pitch as "SLA protection" not "breaking things for fun"
- Get executive sponsor from engineering leadership
5. Blast radius controls:
- Start with off-hours, low-traffic regions
- Implement kill-switch (auto-rollback if error rate > 1%)
- Feature flag critical paths (degrade, don't fail)
- Shadow environment for aggressive tests
6. Remediation process:
Key insight box
A chaos program is not a one-time project - it's a cultural shift from "avoid failures" to "practice failures."
Final challenge question
Your CEO asks: "How much will this chaos engineering program cost us in downtime and lost revenue?" How do you answer?
Appendix: Quick checklist (printable)
Before your first Game Day:
During Game Day:
After Game Day:
Chaos maturity progression:
Red flags (stop and reassess):
Key Takeaways
- Game days are scheduled exercises where teams deliberately inject failures — testing incident response processes, not just system resilience
- DIRT (Disaster Recovery Testing) verifies that recovery procedures actually work — backup restores, failover mechanisms, and runbooks
- Start with tabletop exercises before live failures — walk through failure scenarios on a whiteboard before injecting them in production
- Document everything — the goal is to find gaps in processes and runbooks, then fix them before a real incident
- Run game days regularly — systems change, teams change, and institutional knowledge decays; quarterly exercises maintain readiness