Courses 0%
21
Caching and Performance · Chapter 21 of 51

Thundering Herd Problem

Akhil
Akhil Sharma
20 min

Thundering Herd Problem (Distributed Systems) —

Audience: advanced distributed systems engineers, SREs, and platform developers.

Goal: build an intuition for why thundering herds happen, how they cascade across distributed components, and how to design mitigations with clear trade-offs.


“The cafe door just opened” (What is a thundering herd?)

  • Scenario / challenge It’s 8:59 AM. A coffee shop has a sign: “Fresh croissants at 9:00”.

    At 9:00 sharp, the door opens. Everyone rushes the counter at once.

    • The cashier can’t take all orders simultaneously.
    • People bump into each other.
    • Some leave.
    • The shop could have served everyone smoothly if arrivals were spread out.

    Now map that to distributed systems:

    • The croissant is a shared resource becoming available (a lock, a leader, a cache entry, a DB row, a token bucket refill, a configuration update, a queue message, a connection slot).
    • The crowd is many clients/workers waiting.
    • The door opening is an event that wakes everyone at once (timeout, watch notification, lock release, cache expiry, leader failure, deployment restart).

    A thundering herd occurs when many actors simultaneously contend for a resource or react to the same event, causing a burst of load that degrades throughput and increases latency—often making the event take longer, which triggers more retries, which worsens the herd.

  • Interactive question (pause and think) If you had to define “thundering herd” in one sentence for your on-call runbook, what would you write?

    (Stop here. Write your sentence. Then continue.)

  • ** (question -> think -> answer)** A good runbook definition is:

    Thundering herd: a large number of clients/workers wake up or retry at the same time and simultaneously hit the same dependency, causing a load spike and cascading failures.

  • Real-world parallel The cafe had enough ovens and staff for steady arrivals. The failure mode was synchronized arrivals.


  • Key insight: It’s not just “high traffic.” It’s synchronized traffic—many actors doing the same thing at the same time.

  • Challenge question What makes a herd worse: (A) more clients, (B) tighter synchronization, or (C) slower dependency? (You can pick multiple.)


Synchronization is a hidden amplifier

  • Scenario / challenge Imagine 1,000 people want coffee.

    • If they arrive randomly over 10 minutes, the shop handles it.
    • If they arrive in the same 1 second, the shop collapses.

    In distributed systems, synchronization is often accidental:

    • identical timeouts
    • identical cache TTLs
    • identical cron schedules
    • identical retry policies
    • “wake everyone” notifications
  • Interactive question (pause and think) Which of these is most likely to produce accidental synchronization?

    1. Randomized exponential backoff
    2. Fixed retry interval of 1 second
    3. Cache TTL = 60 seconds for all keys, set at the same time
    4. A watch notification broadcast to all listeners

    Pause. Pick all that apply.


  • Correct: 2, 3, 4.

    • Fixed retry = synchronized retries.
    • Same TTL set simultaneously = synchronized expiry.
    • Broadcast watch = synchronized wake-up.

    Randomized exponential backoff is designed to break synchronization.

  • Real-world parallel This is like a restaurant where every table gets their bill at the same time and all try to pay at the single register.


  • Key insight: A herd is often created by uniformity (same timers, same TTLs, same triggers). Randomness is a tool.

  • Challenge question Name one place in your system where uniform timing exists (cron, TTL, retries, heartbeats). Could it herd?


Failure story: “The lock is free!” (and then everything burns)

  • Scenario / challenge You run 5,000 workers that process jobs. They coordinate using a distributed lock stored in Redis or ZooKeeper.

    • Only one worker should hold the lock.
    • Everyone else waits.

    At time T:

    • The lock holder releases the lock.
    • All waiting workers are notified (or they all poll and discover it).
    • Thousands attempt to acquire simultaneously.

    Redis/ZK sees a sudden spike:

    • CPU spikes
    • network spikes
    • lock acquisition latency rises
    • clients time out
    • clients retry

    Now the herd escalates into a retry storm.

  • Interactive question (pause and think) What’s the most dangerous part?

    A) Everyone tries once B) Everyone retries with the same timeout C) The lock service gets slower, increasing timeouts

    Pause.


  • Correct: B + C together.

    The deadly combo is:

    1. synchronized attempts
    2. synchronized timeouts
    3. retries that align again

    This creates positive feedback: slower dependency -> more timeouts -> more retries -> even slower.

  • Real-world parallel A crowd at the door isn’t the worst part; the worst part is when the crowd starts pushing harder because progress is slow.


  • Key insight: Herds become outages when they couple with timeouts + retries.

  • Challenge question If you could change only one thing—timeouts, retries, or lock notification—what would you change first and why?


Where thundering herds show up in distributed environments

  • Scenario / challenge Think of thundering herd as a pattern that appears in many costumes. Here are common distributed systems hotspots:

    1. Cache stampede (many clients miss cache and hit DB)
    2. Lock contention (many clients attempt lock acquisition)
    3. Leader election / failover (many nodes react to leader loss)
    4. Service discovery updates (all clients refresh endpoints)
    5. Connection storms (all clients reconnect after a blip)
    6. Queue consumer wakeups (polling intervals align)
    7. Rate limiter refill boundaries (tokens replenish at once)
    8. Config refresh / feature flag updates (broadcast triggers)
  • Interactive question (matching exercise) Match the trigger to the herd type:

    TriggerHerd Type (pick)
    Cache entry expires at the same second across fleetA. Connection storm / B. Cache stampede / C. Leader election
    Load balancer restarts and drops all connectionsA. Connection storm / B. Cache stampede / C. Leader election
    ZooKeeper session expires for leaderA. Connection storm / B. Cache stampede / C. Leader election

    Pause and match.


    • Cache expiry -> B. Cache stampede
    • Dropped connections -> A. Connection storm
    • Session expires -> C. Leader election
  • Real-world parallel Different “events” (door opens, announcement, power flicker) can cause the same crowd dynamic.


  • Key insight: Herds are often cross-layer: a network event triggers reconnects which triggers auth which triggers DB load.

  • Challenge question Pick one dependency (DB, Redis, Kafka, etc.). List two ways a herd could hit it.


Decision game: “Is it a herd or just load?”

  • Scenario / challenge You see a latency spike on your database.

    Metrics:

    • QPS jumps 10x for 30 seconds.
    • CPU hits 95%.
    • Errors spike.
    • Then it recovers.
  • Interactive question (which statement is true?) Which statement is most likely true?

    1. It’s just organic traffic growth.
    2. Something synchronized happened (expiry, retry alignment, reconnect).
    3. The DB is under-provisioned; add more CPU.
    4. The network is flaky; ignore it.

    Pause and pick.


  • Most likely: 2.

    Organic growth rarely creates a sharp, short, rectangular spike. Herd events often do.

    But note: 2 doesn’t exclude 3—under-provisioning makes the herd more damaging.


  • Key insight: Herds often look like impulses: sudden spikes tied to a coordinated trigger.

  • Challenge question What graph shape would you expect from exponential backoff versus fixed-interval retries?


The bathtub and the drain (queueing view)

  • Scenario / challenge Picture a bathtub:

    • Water in = incoming requests.
    • Drain = service capacity.

    Normal traffic: water in roughly equals drain.

    Herd: someone dumps a bucket in instantly.

    If the drain is narrow, water level rises (queue grows), and latency increases.

  • Interactive question (pause and think) If a service can handle 2,000 req/s, and a herd sends 20,000 requests in 1 second, what happens?

    Pause.


  • You’ve created a ~10-second backlog even if no more requests arrive.

    In real systems, more requests keep arriving, and retries add more water.


  • Key insight: Herds create burstiness. Burstiness is what queueing systems punish with high tail latency.

  • Challenge question What’s more effective: increasing drain size (capacity) or preventing bucket dumps (desynchronization)? When?


Cache stampede (classic thundering herd)

  • Scenario / challenge Your API uses Redis as a cache in front of Postgres.

    • Key user:123 cached for 60 seconds.
    • Many clients request user:123.

    At t=60s, the key expires.

    Suddenly:

    • Redis miss for everyone.
    • Everyone hits Postgres to recompute.
    • Postgres gets overloaded.
    • Requests slow.
    • Timeouts trigger retries.
  • Interactive question (pause and think) Which is the root synchronizer here?

    A) Redis is slow B) TTL expiry aligns requests C) Postgres is too small

    Pause.


  • Correct: B.

    Redis being slow or Postgres being small are amplifiers, but the synchronizer is TTL alignment.

  • [IMAGE: diagram of cache stampede timeline] A timeline showing many clients hitting cache; at TTL boundary cache misses spike; DB load spikes; then cache repopulates. Include a second line showing retries aligning if timeouts are uniform.

  • [CODE: Python, demonstrate cache stampede + singleflight]

python
  • ' '

    Key insight: Cache stampede is a herd where the “door opening” is cache miss.

  • Challenge question If you add a larger Redis cluster, does it solve the stampede? Why or why not?


Mitigation toolbox overview (and why each has trade-offs)

  • Scenario / challenge You’re the platform owner. You can’t “just fix it” with one knob.

    You need a toolbox:

    1. Jitter (randomize timing)
    2. Exponential backoff (spread retries)
    3. Request coalescing / singleflight (one recomputation per key)
    4. Leases / soft TTL (serve stale while refreshing)
    5. Rate limiting (protect dependencies)
    6. Circuit breakers (stop hammering)
    7. Queueing / load shedding (shape bursts)
    8. Hierarchical caching (local cache, CDN)
    9. Partitioning / sharding (reduce contention domain)
    10. Leader/lock design (avoid waking everyone)
  • Interactive question (decision game) You can implement only two mitigations this quarter. Your main failure is cache stampede.

    Pick two:

    A) Add DB replicas B) Add jitter to TTLs C) Implement request coalescing D) Increase cache TTL from 60s to 10m

    Pause and pick.

    Best default picks: B + C.

    • Jitter reduces synchronization.
    • Coalescing reduces duplicate work.

    DB replicas (A) help but can still be herded and can increase cost. Longer TTL (D) helps but risks staleness and still herds at 10m boundaries unless jittered.

    Key insight: Effective mitigations either reduce synchronization or reduce contention scope.

  • Challenge question Which mitigation reduces work vs reduces timing alignment? List one of each.


“Just add capacity”

  • Misconception

    “If we scale the database/Redis/Kafka enough, herds go away.”

  • Interactive question (pause and think) If you double capacity, do you necessarily eliminate a herd? Why?

  • Explanation Capacity helps, but herds are about burstiness and coordination.

    Even huge systems can be toppled by synchronized retries:

    • A dependency slows slightly.
    • timeouts trigger.
    • thousands retry in lockstep.
    • burst exceeds even scaled capacity.

    Key insight: Scaling is an amplifier control, not a synchrony control.

  • Challenge question When is “add capacity” the right first move anyway?


Retry storms: the herd’s evil twin

  • Scenario / challenge A downstream service starts returning 500s for 2 seconds.

    Clients have:

    • timeout = 2s
    • retry every 2s
    • no jitter

    At t=0, many requests fail. At t=2s, they all retry. If the service is still recovering, it gets hammered again.

  • Interactive question (pause and think) Why does recovery take longer?

    Pause.

    Because the recovering service experiences load exactly when it’s weakest.

    Recovery is a phase where caches are cold, JIT is warming, connections are re-established, and background tasks run.

  • [IMAGE: retry alignment wave diagram] Plot showing synchronized retries as periodic spikes; show how adding jitter turns spikes into a spread-out band.

  • [CODE: Go, exponential backoff with jitter]

python
javascript

Key insight: A retry policy is a distributed coordination mechanism. Bad retries coordinate clients to fail together.

  • Challenge question What is the difference between exponential backoff and exponential backoff with jitter in terms of herd risk?

Request coalescing (singleflight): “One person orders for the table”

  • Scenario / challenge In a restaurant, a table of 10 doesn’t send 10 people to the counter. One person orders for everyone.

    In systems:

    • Many concurrent requests want the same missing cache key.
    • Instead of all hitting DB, you let one do the recomputation.
    • Others wait and reuse the result.
  • Interactive question (pause and think) What’s the main downside of coalescing?

    A) Higher tail latency for waiters B) Increased DB load C) More cache misses

    Pause.

    Correct: A.

    Waiters pay extra latency, but system stability improves.

    Key insight: Coalescing trades latency for stability.

  • Challenge question Where would you place coalescing: client-side, API layer, or cache layer? What changes with each?


Serving stale (soft TTL): “Sell yesterday’s croissant while baking”

  • Scenario / challenge At 9:00, croissants are supposed to be fresh. But if the bakery is overwhelmed, you can:

    • serve yesterday’s croissant immediately
    • bake fresh ones in the background

    In caching:

    • Hard TTL: after expiry, you must recompute.
    • Soft TTL: after soft expiry, you can serve stale for a limited window while refreshing asynchronously.
  • Interactive question (pause and think) Why does serving stale reduce herd risk?

    Pause.

    Because it prevents a synchronized “miss” event. You avoid forcing everyone onto the recomputation path at the same moment.

  • Trade-offs

    • You may serve stale data.
    • You need background refresh logic.
    • You need safeguards to avoid serving stale forever.
  • [CODE: Java/Kotlin, cache with soft TTL + background refresh]

python
javascript

Key insight: Soft TTL converts a hard cliff into a slope.

  • Challenge question What’s a safe maximum “stale window” for your domain (pricing vs user profile vs feature flags)?

“Broadcast notifications are always better than polling”

  • Misconception

    “Polling is inefficient; watches/notifications are always superior.”

  • Scenario / challenge You replace polling with watch notifications for a lock.

  • Interactive question (pause and think) What new failure mode might you introduce?

  • Explanation Broadcast notifications can create a wake-up storm:

    • lock release triggers watch event to all waiters
    • config update triggers all clients to refresh simultaneously

    Polling can be inefficient, but it naturally spreads load if polling intervals are randomized.

    Key insight: Notifications reduce steady-state overhead but can increase synchronization risk.

  • Challenge question If you must use notifications, what’s one technique to avoid everyone acting at once?


Distributed locks and leader election: herd patterns in coordination systems

  • Scenario / challenge You use ZooKeeper/etcd/Consul for leader election.

    When the leader dies:

    • all followers detect session loss
    • all attempt election
    • all update watchers
    • clients refresh endpoints

    That’s multiple herds stacked:

    1. Election herd (candidates)
    2. Watch herd (listeners)
    3. Reconnect herd (clients)
  • Interactive question (pause and think) Which component is most at risk?

    A) The coordination service (etcd/ZK) B) The new leader C) The clients

    Pause.

    Often: A and B.

    • Coordination service gets hammered by compare-and-swap writes and watch fanout.
    • New leader gets hammered by all clients reconnecting.
  • [IMAGE: layered herd diagram] Show leader failure leading to election attempts, watch notifications, client reconnects, and downstream load. Annotate where randomized election timeouts help.

  • ' '

    Key insight: Coordination events are global triggers. Global triggers create global herds.

  • Challenge question How can you reduce the “blast radius” of a leader event?


Techniques to break synchronization (jitter, staggering, and randomization)

  • Scenario / challenge You manage 50,000 instances that refresh tokens every 60 minutes.

    If they were all deployed at the same time, they’ll all refresh at the same time.

  • Interactive question (choose a jitter strategy) Which strategy best reduces herd risk while keeping average refresh near 60 minutes?

    1. Refresh exactly every 60 minutes.
    2. Refresh every 60 minutes +/- random(0..5 minutes).
    3. Refresh every 60 minutes +/- random(0..30 minutes).
    4. Refresh at a random time in the next hour, every hour.

    Pause.


  • Generally best: 4 (maximal spreading), but it depends on token expiry constraints.

    If tokens must refresh before expiry, you pick a jitter window that preserves safety margin.

    Key insight: Jitter is controlled randomness: you trade predictability for stability.

  • Challenge question What’s the largest jitter window you can safely introduce for your refresh/TTL without violating correctness?


Connection storms: “Everyone calls back at once”

  • Scenario / challenge A load balancer restarts, dropping 100,000 idle connections.

    Clients reconnect immediately:

    • TLS handshakes spike
    • auth checks spike
    • connection pool warms
    • downstream services get hit
  • Interactive question (pause and think) Why is TLS a herd amplifier?

    Pause.


  • Because TLS handshakes are CPU-expensive and often involve shared resources:

    • CPU on edge
    • certificate operations
    • session cache
    • sometimes external identity or OCSP checks
  • Mitigations

    • connection backoff + jitter
    • connection rate limiting on clients
    • server-side accept queue tuning
    • reuse sessions / TLS tickets
    • keepalive tuning
  • [CODE: Rust, connection retry loop with jittered backoff]

python
javascript

Key insight: Connection storms are herds that attack your control plane (handshakes/auth), not just data plane.

  • Challenge question If you can only change server-side settings (not clients), what can you do to survive reconnect storms?

Herds across microservices: cascading failure anatomy

  • Scenario / challenge Service A calls Service B calls Service C (DB).

    A has 10,000 RPS. B has a cache; C is a DB.

    A small cache stampede in B triggers more calls to C. C slows. B times out and retries. A sees errors and retries.

    Now the herd has propagated upstream.

  • Interactive question (pause and think) Where should you stop the herd?

    A) At the DB B) At B (closest to cache) C) At A (at the edge)

    Pause.


  • Best answer: multiple layers, but the highest leverage is often closest to the trigger (B) plus edge protection (A).

    • Stop stampede at B (coalescing, stale serving where legal).
    • Stop retry storms at A (backoff, budgets, circuit breaker).
  • [IMAGE: cascade diagram with retry amplification] Show call chain and how retries multiply load (e.g., 1 request becomes 3). Include where circuit breaking cuts the loop.

  • ' '

    Key insight: Herds are contagious. Design bulkheads so one herd doesn’t infect the whole system.

  • Challenge question Where do you have bulkheads today (thread pools, connection pools, rate limits)? Where are you missing them?


Comparison table: mitigation techniques and trade-offs

  • Scenario / challenge You need to pick mitigations that match your correctness constraints.

  • Interactive question (pause and think) Which row would you choose if you cannot serve stale data but can tolerate extra latency under miss?

  • Comparison table

    TechniqueBreaks synchronization?Reduces duplicate work?Adds latency?Risks correctness?Typical use
    Jitter (TTL/retry)YesNoNoNoretries, refreshes, cache expiry
    Exponential backoff + jitterYesNoYes (under failure)Nodownstream errors/timeouts
    Request coalescing (singleflight)IndirectlyYesYes (waiters)Nocache miss recomputation
    Serve stale (soft TTL)YesYes (if refresh is coalesced)NoYes (stale data)read-heavy caches
    Rate limitingNoNoYes (rejections)Noprotect dependencies
    Circuit breakerNoNoYes (fast-fail)Nostop hammering failing deps
    Queueing / load sheddingYes (shapes bursts)NoYesNoingress protection
    Sharding/partitioningYes (reduces domain)NoNoNolocks, hot keys

  • If you cannot serve stale but can tolerate extra latency under miss, request coalescing plus jitter/backoff are usually the first safe moves.

  • ' '

    Key insight: There’s no universal fix; you choose based on correctness constraints (staleness), latency budgets, and operational complexity.

  • Challenge question Which technique is most appropriate when correctness forbids staleness?


Quiz: Which statement is true? (Herd edition)

  • Scenario / challenge You’re reviewing a PR that changes retry logic and cache semantics.

  • Interactive question (which statements are true?) Pick the true statements:

    1. Adding jitter to retries can reduce average latency during failures.
    2. Request coalescing can increase p99 latency while decreasing overall load.
    3. Serving stale data can eliminate cache stampedes without any correctness risk.
    4. Broadcast notifications can cause thundering herds.

    Pause and pick.


  • True: 2 and 4.

    • (1) Jitter usually increases latency for some requests during failure; it reduces synchronized load and improves recovery.
    • (3) Serving stale always has correctness risk; it may be acceptable depending on domain.
  • ' '

    Key insight: Many herd mitigations improve system latency distribution at the cost of some request latency.

  • Challenge question Which metric would you prioritize during a herd: p50 latency, p99 latency, error rate, or dependency saturation? Why?


Observability: how to detect a thundering herd in production

  • Scenario / challenge You’re on call. Something spiked. You need to know if it’s a herd.

  • Interactive question (pause and think) What’s one metric that differentiates “herd” from “organic traffic”? (Hint: correlation.)

  • Explanation Signals that scream “herd”

    • Highly correlated spikes across many clients at the same timestamp
    • Retry rate spikes (client-side metrics)
    • Cache miss rate spikes aligned with TTL boundaries
    • Connection establishment rate spikes
    • Lock acquisition attempts spike
    • Queue depth spikes
  • [IMAGE: dashboard layout for herd detection] A dashboard with panels:

    • request rate
    • error rate
    • retries
    • cache hit ratio
    • DB QPS
    • connection handshakes
    • lock attempts
    • saturation (CPU, IO) Plus an annotation lane for deploys, leader changes, and cache flushes.

  • Herds are about correlation and burst edges (sharp jump in a short time).

  • ' '

    Key insight: Herd detection is often about finding the synchronizer: TTL boundary, deploy, leader election, network flap.

  • Challenge question What client-side metric do you wish you had during outages that would help confirm a herd?


Design patterns to prevent herds (by architecture)

  • Scenario / challenge You want herd resilience to be structural, not a pile of ad-hoc patches.

  • Interactive question (pause and think) Which of these reduces global synchronization points the most?

    A) One global lock B) Sharded locks per partition C) Broadcast notifications to all clients


  • B.

  • Patterns

    Pattern 1: Per-key singleflight at the edge

    • Put coalescing close to users.
    • Benefit: reduces load early.
    • Risk: memory pressure, coordination overhead.

    Pattern 2: Hierarchical caches

    • L1 in-process cache
    • L2 distributed cache
    • DB

    The herd gets absorbed at L1.

    Pattern 3: Staggered rollouts and cron jitter

    • Avoid deploying everything at once.
    • Avoid cron at ":00".

    Pattern 4: Token bucket smoothing

    • Refill continuously rather than in discrete steps.

    Pattern 5: Avoid global locks

    • Prefer sharded locks, leases, or optimistic concurrency.
  • ' '

    Key insight: Herd prevention is often about avoiding global synchronization points.

  • Challenge question Where do you still have a “global synchronization point” in your architecture?


Edge cases: hot keys and uneven popularity

  • Scenario / challenge Even with jittered TTLs, one key is requested 1,000x more than others.

    That key becomes a “VIP croissant” everyone wants.

  • Interactive question (pause and think) Why do hot keys make herds more likely?

    Pause.


  • Because hot keys concentrate demand, making any miss/expiry event disproportionately impactful.

  • Mitigations for hot keys

    • pre-warm caches
    • longer TTL for hot keys
    • proactive refresh
    • sharding by adding a random suffix (be careful: correctness)
    • replicate hot key values locally
  • ' '

    Key insight: Herd risk is proportional to fan-in (how many callers share the same key/resource).

  • Challenge question How would you identify hot keys safely in production without logging sensitive IDs?


Hands-on thought lab: choose a mitigation plan

  • Scenario / challenge You operate an e-commerce checkout.

    • Product inventory is cached for 30 seconds.
    • At flash sale start, traffic increases 20x.
    • Inventory keys expire around the same time.
    • DB is the source of truth and must be correct.
  • Constraints

    • You cannot serve stale inventory counts.
    • You can serve stale product description.
  • Interactive question (pause and design) Pick a plan:

    A) Increase TTL to 5 minutes for all keys. B) Add per-key singleflight for inventory keys; add jittered TTL for non-critical keys. C) Serve stale inventory for 1 minute. D) Add more DB replicas only.

    Pause.


  • Best: B.

    • Correctness forbids stale inventory, so you must reduce duplicate recomputation and desynchronize non-critical paths.
    • You might also add admission control and rate limiting to protect DB.
  • ' '

    Key insight: Correctness constraints determine which herd mitigations are legal.

  • Challenge question What would you do if singleflight increases tail latency beyond SLA—do you relax SLA, add capacity, or change UX?


Real-world usage patterns (what big systems do)

  • Scenario / challenge You want patterns that scale operationally across fleets.

  • Examples

    CDN and edge caching

    • Use stale-while-revalidate semantics.
    • Use request collapsing at edge.

    Large microservice fleets

    • Enforce retry budgets (limit retries as a fraction of traffic).
    • Use hedged requests carefully (hedging can create mini-herds if misused).

    Coordination systems

    • Use watch batching and rate-limited watch delivery.
    • Use randomized election timeouts (e.g., Raft’s randomized election timer to prevent split votes and synchronized elections).

    Databases

    • Use admission control to prevent overload collapse.
    • Use connection limits to avoid connection storms.
  • ' '

    Key insight: Mature systems treat herds as inevitable and build “shock absorbers” at multiple layers.

  • Challenge question Where is your system missing a shock absorber: edge, cache, queue, DB, or coordination?


“Hedged requests always improve tail latency”

  • Misconception

    “If p99 is bad, send duplicates (hedge) and take the fastest response.”

  • Scenario / challenge You enable hedging globally during peak traffic.

  • Interactive question (pause and think) What happens during a partial outage?

  • Explanation Hedging can reduce tail latency in stable systems but can amplify herds during partial outages:

    • duplicates double load exactly when capacity is constrained
    • can convert a small slowdown into a collapse
  • ' '

    Key insight: Hedging is a latency optimization that must be gated by load and error conditions.

  • Challenge question What signal would you use to disable hedging automatically?


Design a herd-resilient subsystem

  • Scenario / challenge You are building a distributed feature flag service.

    • 200,000 clients (mobile + backend) fetch flags.
    • Flags update a few times per day.
    • Clients refresh every 60 seconds.
    • During incidents, the flag service may be partially unavailable.
  • Interactive question (pause and think) Design a plan that avoids thundering herds across:

    1. refresh schedule
    2. update propagation
    3. retry behavior
    4. cache behavior

    Write down:

    • How do you spread refreshes?
    • Do you use polling or push?
    • How do you handle misses?
    • How do you prevent retry storms?
  • ** (suggested design)**

    • Refresh jitter: each client refreshes at a random offset within the minute (or longer), with safety margin.
    • Push updates: optional, but avoid “everyone fetch now” by including randomized delay before fetch; or push deltas.
    • Local caching: clients keep last-known-good flags with a hard expiry far out; serve stale flags if service down (domain-dependent).
    • Retry policy: exponential backoff with full jitter; enforce retry budgets.
    • Server protection: rate limit by client identity; circuit breaker; admission control.
    • Rollouts: stagger deployments so not all clients start at once.
  • ' '

    Key insight: Herd resilience is a system property: timing, caching, retries, and protection mechanisms must align.

  • Final challenge questions

    1. What’s the single most important knob to prevent a global herd in this system?
    2. What’s the hardest correctness trade-off?
    3. Where would you add observability to prove your design works?

Appendix A — Quick reference checklist

  • Are retries jittered and bounded?
  • Do we have retry budgets?
  • Are cache TTLs jittered?
  • Do we coalesce recomputation per key?
  • Can we serve stale safely (soft TTL)?
  • Do we have rate limits and admission control?
  • Do we avoid global locks / global triggers?
  • Do we protect coordination systems from watch storms?
  • Can we survive reconnect storms?

Key Takeaways

  1. Thundering herd occurs when a cache entry expires and hundreds of concurrent requests hit the database simultaneously — causing a load spike that can cascade into an outage
  2. Cache stampede prevention: use locking so only one request regenerates the cache — all other requests wait for the result or serve stale data
  3. Stale-while-revalidate serves the expired value while refreshing in the background — users get instant responses, the cache updates asynchronously
  4. Jittered TTLs prevent synchronized expiration — adding randomness to TTLs ensures not all keys expire at the same moment
Chapter complete!

Course Complete!

You've finished all 51 chapters of

System Design Advanced

Browse courses
Up next Adaptive Caching With Maching learning
Continue