Audience: advanced distributed systems engineers, SREs, and platform developers.
Goal: build an intuition for why thundering herds happen, how they cascade across distributed components, and how to design mitigations with clear trade-offs.
Scenario / challenge It’s 8:59 AM. A coffee shop has a sign: “Fresh croissants at 9:00”.
At 9:00 sharp, the door opens. Everyone rushes the counter at once.
Now map that to distributed systems:
A thundering herd occurs when many actors simultaneously contend for a resource or react to the same event, causing a burst of load that degrades throughput and increases latency—often making the event take longer, which triggers more retries, which worsens the herd.
Interactive question (pause and think) If you had to define “thundering herd” in one sentence for your on-call runbook, what would you write?
(Stop here. Write your sentence. Then continue.)
** (question -> think -> answer)** A good runbook definition is:
Thundering herd: a large number of clients/workers wake up or retry at the same time and simultaneously hit the same dependency, causing a load spike and cascading failures.
Real-world parallel The cafe had enough ovens and staff for steady arrivals. The failure mode was synchronized arrivals.
Key insight: It’s not just “high traffic.” It’s synchronized traffic—many actors doing the same thing at the same time.
Challenge question What makes a herd worse: (A) more clients, (B) tighter synchronization, or (C) slower dependency? (You can pick multiple.)
Scenario / challenge Imagine 1,000 people want coffee.
In distributed systems, synchronization is often accidental:
Interactive question (pause and think) Which of these is most likely to produce accidental synchronization?
Pause. Pick all that apply.
Correct: 2, 3, 4.
Randomized exponential backoff is designed to break synchronization.
Real-world parallel This is like a restaurant where every table gets their bill at the same time and all try to pay at the single register.
Key insight: A herd is often created by uniformity (same timers, same TTLs, same triggers). Randomness is a tool.
Challenge question Name one place in your system where uniform timing exists (cron, TTL, retries, heartbeats). Could it herd?
Scenario / challenge You run 5,000 workers that process jobs. They coordinate using a distributed lock stored in Redis or ZooKeeper.
At time T:
Redis/ZK sees a sudden spike:
Now the herd escalates into a retry storm.
Interactive question (pause and think) What’s the most dangerous part?
A) Everyone tries once B) Everyone retries with the same timeout C) The lock service gets slower, increasing timeouts
Pause.
Correct: B + C together.
The deadly combo is:
This creates positive feedback: slower dependency -> more timeouts -> more retries -> even slower.
Real-world parallel A crowd at the door isn’t the worst part; the worst part is when the crowd starts pushing harder because progress is slow.
Key insight: Herds become outages when they couple with timeouts + retries.
Challenge question If you could change only one thing—timeouts, retries, or lock notification—what would you change first and why?
Scenario / challenge Think of thundering herd as a pattern that appears in many costumes. Here are common distributed systems hotspots:
Interactive question (matching exercise) Match the trigger to the herd type:
| Trigger | Herd Type (pick) |
|---|---|
| Cache entry expires at the same second across fleet | A. Connection storm / B. Cache stampede / C. Leader election |
| Load balancer restarts and drops all connections | A. Connection storm / B. Cache stampede / C. Leader election |
| ZooKeeper session expires for leader | A. Connection storm / B. Cache stampede / C. Leader election |
Pause and match.
Real-world parallel Different “events” (door opens, announcement, power flicker) can cause the same crowd dynamic.
Key insight: Herds are often cross-layer: a network event triggers reconnects which triggers auth which triggers DB load.
Challenge question Pick one dependency (DB, Redis, Kafka, etc.). List two ways a herd could hit it.
Scenario / challenge You see a latency spike on your database.
Metrics:
Interactive question (which statement is true?) Which statement is most likely true?
Pause and pick.
Most likely: 2.
Organic growth rarely creates a sharp, short, rectangular spike. Herd events often do.
But note: 2 doesn’t exclude 3—under-provisioning makes the herd more damaging.
Key insight: Herds often look like impulses: sudden spikes tied to a coordinated trigger.
Challenge question What graph shape would you expect from exponential backoff versus fixed-interval retries?
Scenario / challenge Picture a bathtub:
Normal traffic: water in roughly equals drain.
Herd: someone dumps a bucket in instantly.
If the drain is narrow, water level rises (queue grows), and latency increases.
Interactive question (pause and think) If a service can handle 2,000 req/s, and a herd sends 20,000 requests in 1 second, what happens?
Pause.
You’ve created a ~10-second backlog even if no more requests arrive.
In real systems, more requests keep arriving, and retries add more water.
Key insight: Herds create burstiness. Burstiness is what queueing systems punish with high tail latency.
Challenge question What’s more effective: increasing drain size (capacity) or preventing bucket dumps (desynchronization)? When?
Scenario / challenge Your API uses Redis as a cache in front of Postgres.
user:123 cached for 60 seconds.user:123.At t=60s, the key expires.
Suddenly:
Interactive question (pause and think) Which is the root synchronizer here?
A) Redis is slow B) TTL expiry aligns requests C) Postgres is too small
Pause.
Correct: B.
Redis being slow or Postgres being small are amplifiers, but the synchronizer is TTL alignment.
[IMAGE: diagram of cache stampede timeline] A timeline showing many clients hitting cache; at TTL boundary cache misses spike; DB load spikes; then cache repopulates. Include a second line showing retries aligning if timeouts are uniform.
[CODE: Python, demonstrate cache stampede + singleflight]
' '
Key insight: Cache stampede is a herd where the “door opening” is cache miss.
Challenge question If you add a larger Redis cluster, does it solve the stampede? Why or why not?
Scenario / challenge You’re the platform owner. You can’t “just fix it” with one knob.
You need a toolbox:
Interactive question (decision game) You can implement only two mitigations this quarter. Your main failure is cache stampede.
Pick two:
A) Add DB replicas B) Add jitter to TTLs C) Implement request coalescing D) Increase cache TTL from 60s to 10m
Pause and pick.
Best default picks: B + C.
DB replicas (A) help but can still be herded and can increase cost. Longer TTL (D) helps but risks staleness and still herds at 10m boundaries unless jittered.
Key insight: Effective mitigations either reduce synchronization or reduce contention scope.
Challenge question Which mitigation reduces work vs reduces timing alignment? List one of each.
Misconception
“If we scale the database/Redis/Kafka enough, herds go away.”
Interactive question (pause and think) If you double capacity, do you necessarily eliminate a herd? Why?
Explanation Capacity helps, but herds are about burstiness and coordination.
Even huge systems can be toppled by synchronized retries:
Key insight: Scaling is an amplifier control, not a synchrony control.
Challenge question When is “add capacity” the right first move anyway?
Scenario / challenge A downstream service starts returning 500s for 2 seconds.
Clients have:
At t=0, many requests fail. At t=2s, they all retry. If the service is still recovering, it gets hammered again.
Interactive question (pause and think) Why does recovery take longer?
Pause.
Because the recovering service experiences load exactly when it’s weakest.
Recovery is a phase where caches are cold, JIT is warming, connections are re-established, and background tasks run.
[IMAGE: retry alignment wave diagram] Plot showing synchronized retries as periodic spikes; show how adding jitter turns spikes into a spread-out band.
[CODE: Go, exponential backoff with jitter]
Key insight: A retry policy is a distributed coordination mechanism. Bad retries coordinate clients to fail together.
Scenario / challenge In a restaurant, a table of 10 doesn’t send 10 people to the counter. One person orders for everyone.
In systems:
Interactive question (pause and think) What’s the main downside of coalescing?
A) Higher tail latency for waiters B) Increased DB load C) More cache misses
Pause.
Correct: A.
Waiters pay extra latency, but system stability improves.
Key insight: Coalescing trades latency for stability.
Challenge question Where would you place coalescing: client-side, API layer, or cache layer? What changes with each?
Scenario / challenge At 9:00, croissants are supposed to be fresh. But if the bakery is overwhelmed, you can:
In caching:
Interactive question (pause and think) Why does serving stale reduce herd risk?
Pause.
Because it prevents a synchronized “miss” event. You avoid forcing everyone onto the recomputation path at the same moment.
Trade-offs
[CODE: Java/Kotlin, cache with soft TTL + background refresh]
Key insight: Soft TTL converts a hard cliff into a slope.
Misconception
“Polling is inefficient; watches/notifications are always superior.”
Scenario / challenge You replace polling with watch notifications for a lock.
Interactive question (pause and think) What new failure mode might you introduce?
Explanation Broadcast notifications can create a wake-up storm:
Polling can be inefficient, but it naturally spreads load if polling intervals are randomized.
Key insight: Notifications reduce steady-state overhead but can increase synchronization risk.
Challenge question If you must use notifications, what’s one technique to avoid everyone acting at once?
Scenario / challenge You use ZooKeeper/etcd/Consul for leader election.
When the leader dies:
That’s multiple herds stacked:
Interactive question (pause and think) Which component is most at risk?
A) The coordination service (etcd/ZK) B) The new leader C) The clients
Pause.
Often: A and B.
[IMAGE: layered herd diagram] Show leader failure leading to election attempts, watch notifications, client reconnects, and downstream load. Annotate where randomized election timeouts help.
' '
Key insight: Coordination events are global triggers. Global triggers create global herds.
Challenge question How can you reduce the “blast radius” of a leader event?
Scenario / challenge You manage 50,000 instances that refresh tokens every 60 minutes.
If they were all deployed at the same time, they’ll all refresh at the same time.
Interactive question (choose a jitter strategy) Which strategy best reduces herd risk while keeping average refresh near 60 minutes?
Pause.
Generally best: 4 (maximal spreading), but it depends on token expiry constraints.
If tokens must refresh before expiry, you pick a jitter window that preserves safety margin.
Key insight: Jitter is controlled randomness: you trade predictability for stability.
Challenge question What’s the largest jitter window you can safely introduce for your refresh/TTL without violating correctness?
Scenario / challenge A load balancer restarts, dropping 100,000 idle connections.
Clients reconnect immediately:
Interactive question (pause and think) Why is TLS a herd amplifier?
Pause.
Because TLS handshakes are CPU-expensive and often involve shared resources:
Mitigations
[CODE: Rust, connection retry loop with jittered backoff]
Key insight: Connection storms are herds that attack your control plane (handshakes/auth), not just data plane.
Scenario / challenge Service A calls Service B calls Service C (DB).
A has 10,000 RPS. B has a cache; C is a DB.
A small cache stampede in B triggers more calls to C. C slows. B times out and retries. A sees errors and retries.
Now the herd has propagated upstream.
Interactive question (pause and think) Where should you stop the herd?
A) At the DB B) At B (closest to cache) C) At A (at the edge)
Pause.
Best answer: multiple layers, but the highest leverage is often closest to the trigger (B) plus edge protection (A).
[IMAGE: cascade diagram with retry amplification] Show call chain and how retries multiply load (e.g., 1 request becomes 3). Include where circuit breaking cuts the loop.
' '
Key insight: Herds are contagious. Design bulkheads so one herd doesn’t infect the whole system.
Challenge question Where do you have bulkheads today (thread pools, connection pools, rate limits)? Where are you missing them?
Scenario / challenge You need to pick mitigations that match your correctness constraints.
Interactive question (pause and think) Which row would you choose if you cannot serve stale data but can tolerate extra latency under miss?
Comparison table
| Technique | Breaks synchronization? | Reduces duplicate work? | Adds latency? | Risks correctness? | Typical use |
|---|---|---|---|---|---|
| Jitter (TTL/retry) | Yes | No | No | No | retries, refreshes, cache expiry |
| Exponential backoff + jitter | Yes | No | Yes (under failure) | No | downstream errors/timeouts |
| Request coalescing (singleflight) | Indirectly | Yes | Yes (waiters) | No | cache miss recomputation |
| Serve stale (soft TTL) | Yes | Yes (if refresh is coalesced) | No | Yes (stale data) | read-heavy caches |
| Rate limiting | No | No | Yes (rejections) | No | protect dependencies |
| Circuit breaker | No | No | Yes (fast-fail) | No | stop hammering failing deps |
| Queueing / load shedding | Yes (shapes bursts) | No | Yes | No | ingress protection |
| Sharding/partitioning | Yes (reduces domain) | No | No | No | locks, hot keys |
If you cannot serve stale but can tolerate extra latency under miss, request coalescing plus jitter/backoff are usually the first safe moves.
' '
Key insight: There’s no universal fix; you choose based on correctness constraints (staleness), latency budgets, and operational complexity.
Challenge question Which technique is most appropriate when correctness forbids staleness?
Scenario / challenge You’re reviewing a PR that changes retry logic and cache semantics.
Interactive question (which statements are true?) Pick the true statements:
Pause and pick.
True: 2 and 4.
' '
Key insight: Many herd mitigations improve system latency distribution at the cost of some request latency.
Challenge question Which metric would you prioritize during a herd: p50 latency, p99 latency, error rate, or dependency saturation? Why?
Scenario / challenge You’re on call. Something spiked. You need to know if it’s a herd.
Interactive question (pause and think) What’s one metric that differentiates “herd” from “organic traffic”? (Hint: correlation.)
Explanation Signals that scream “herd”
[IMAGE: dashboard layout for herd detection] A dashboard with panels:
Herds are about correlation and burst edges (sharp jump in a short time).
' '
Key insight: Herd detection is often about finding the synchronizer: TTL boundary, deploy, leader election, network flap.
Challenge question What client-side metric do you wish you had during outages that would help confirm a herd?
Scenario / challenge You want herd resilience to be structural, not a pile of ad-hoc patches.
Interactive question (pause and think) Which of these reduces global synchronization points the most?
A) One global lock B) Sharded locks per partition C) Broadcast notifications to all clients
B.
Patterns
The herd gets absorbed at L1.
' '
Key insight: Herd prevention is often about avoiding global synchronization points.
Challenge question Where do you still have a “global synchronization point” in your architecture?
Scenario / challenge Even with jittered TTLs, one key is requested 1,000x more than others.
That key becomes a “VIP croissant” everyone wants.
Interactive question (pause and think) Why do hot keys make herds more likely?
Pause.
Because hot keys concentrate demand, making any miss/expiry event disproportionately impactful.
Mitigations for hot keys
' '
Key insight: Herd risk is proportional to fan-in (how many callers share the same key/resource).
Challenge question How would you identify hot keys safely in production without logging sensitive IDs?
Scenario / challenge You operate an e-commerce checkout.
Constraints
Interactive question (pause and design) Pick a plan:
A) Increase TTL to 5 minutes for all keys. B) Add per-key singleflight for inventory keys; add jittered TTL for non-critical keys. C) Serve stale inventory for 1 minute. D) Add more DB replicas only.
Pause.
Best: B.
' '
Key insight: Correctness constraints determine which herd mitigations are legal.
Challenge question What would you do if singleflight increases tail latency beyond SLA—do you relax SLA, add capacity, or change UX?
Scenario / challenge You want patterns that scale operationally across fleets.
Examples
' '
Key insight: Mature systems treat herds as inevitable and build “shock absorbers” at multiple layers.
Challenge question Where is your system missing a shock absorber: edge, cache, queue, DB, or coordination?
Misconception
“If p99 is bad, send duplicates (hedge) and take the fastest response.”
Scenario / challenge You enable hedging globally during peak traffic.
Interactive question (pause and think) What happens during a partial outage?
Explanation Hedging can reduce tail latency in stable systems but can amplify herds during partial outages:
' '
Key insight: Hedging is a latency optimization that must be gated by load and error conditions.
Challenge question What signal would you use to disable hedging automatically?
Scenario / challenge You are building a distributed feature flag service.
Interactive question (pause and think) Design a plan that avoids thundering herds across:
Write down:
** (suggested design)**
' '
Key insight: Herd resilience is a system property: timing, caching, retries, and protection mechanisms must align.
Final challenge questions