Data Management · Chapter 32 of 51

Hot Partition Detection and Mitigation

Akhil Sharma

 20 min 

← → to navigate

Hot Partition Detection and Mitigation

When one database shard receives 100x more traffic than the others, your entire system's performance is limited by that single partition.

Hot Partition Detection and Mitigation

Audience: engineers building/operating distributed databases, stream processors, caches, and event-driven systems.

Goal: learn to detect, diagnose, and mitigate hot partitions without trading one failure mode for another.

The "One Counter Kills the Cluster" Incident

Its 10:17 AM. Your on-call phone rings.

p99 latency jumped from 35 ms -> 2.8 s.
CPU on one storage node is pegged at 100%.
The rest of the cluster looks bored.
Autoscaling added nodes, but nothing improved.

A product manager messages: "We launched a promotion banner. Traffic is up, but why is one node dying?"

Pause and think

If the system is "distributed," why would a single node become the bottleneck?

A) Bad luck: random skew
B) Cache miss storm
C) Hot partition / hot key
D) Network packet loss

Dont answer yet - keep reading. Well come back with evidence.

Real-world analogy (coffee shop chain): Imagine a coffee shop chain with 20 baristas across 5 counters. A new promotion says: "Free espresso if you ask for 'THE SPECIAL'." Suddenly, everyone queues at counter #3 because thats the only counter that knows the special phrase. The chain is "distributed," but the workload isnt.

Key insight:

Hot partitions happen when routing concentrates load on a subset of shards/partitions/keys, turning a distributed system into a single-threaded system.

Production note:

Autoscaling often fails here because the bottleneck is not cluster capacity; its placement + routing. Unless the hot partition can move/split or the hot key can be sharded/cached, new nodes remain idle.

What Exactly Is a Hot Partition?

Scenario

You have a partitioned dataset or stream:

a distributed DB table sharded by user_id
a Kafka topic partitioned by account_id
a cache keyed by session_id

Everything is fine until one key/partition starts receiving a disproportionate share of:

reads
writes
updates
compactions
locks
replication traffic

Interactive question (pause and think)

Which of the following is a hot partition?

One partition has 10x the QPS of others.
One partition has 10x the data size of others.
One partition has 10x the compaction I/O of others.
One partition has 10x the lock contention of others.

Pause and think.

Progressive reveal (answer) All of them can be "hot," depending on what resource is saturated.

Hot by traffic (QPS/throughput)
Hot by size (storage/compaction/rebalancing time)
Hot by contention (locks, MVCC conflicts)
Hot by replication (leader/follower imbalance)

Restaurant analogy: A restaurant has 10 kitchens (shards). If 70% of orders are "Burger #1," and only Kitchen 7 makes it, you get a hot kitchen. If Kitchen 7 also stores all frozen buns (data size), its hot in multiple dimensions.

Key insight:

"Hot partition" is not just a metric; its a resource dominance pattern: one partition dominates CPU, I/O, memory, network, or coordination.

Clarification:

"Hot partition" is often used loosely to mean "hot shard leader." In leader-based systems (Kafka, Raft groups, many DBs), the leader is the write bottleneck and often the read bottleneck too (depending on read routing).

Why Hot Partitions Are So Dangerous in Distributed Systems

Scenario

Your system has:

partition leaders
replication
background maintenance (compaction, GC)
load balancers

A hot partition doesnt just slow itself down; it triggers feedback loops.

Think about it

What happens when a partition leader is overloaded?

clients retry -> more load
replicas fall behind -> read repair or leader elections
timeouts -> circuit breakers open -> traffic shifts -> new hot spots
compaction lags -> read amplification increases -> even more CPU

Progressive reveal Hot partitions are dangerous because they create nonlinear degradation:

Queueing effects: latency grows superlinearly as utilization approaches 100% (classic M/M/1 intuition: as rho->1, wait time explodes).
Retry storms: timeouts cause clients to amplify traffic.
Replica divergence: follower lag forces leader to do extra work (catch-up, read repair, snapshotting).
Maintenance starvation: compaction/checkpointing falls behind.
Failover cascades: overloaded leader fails -> new leader inherits hot partition -> repeats.

Delivery service analogy: If one neighborhood suddenly orders 10x more deliveries, one depot gets slammed. Drivers start missing deadlines, customers reorder ("retry"), depot backlog grows, and eventually the depot shuts down - forcing the next closest depot to take over and also collapse.

Key insight:

Hot partitions are "distributed" failure modes with centralized symptoms.

Production insight:

Many outages are caused by control-plane reactions (retries, rebalances, elections) rather than the initial skew.
During incidents, prefer mitigations that reduce work (shed, cache, coalesce) over those that move work (rebalance) unless youre sure the hotspot is placement-only.

How Partitioning Creates Hot Spots

Mental model

Partitioning is a function:

partition = f(key)

Common choices:

hash partitioning: hash(key) % N
range partitioning: key in [a, b)
consistent hashing / ring
application routing tables](streamdown:incomplete-link)

Hot partitions happen when:

key distribution is skewed
time-based locality concentrates writes
"celebrity keys" dominate
partition counts are too low
leadership is imbalanced

Interactive exercise: Match the cause to the symptom

Match each cause (left) to the likely symptom (right):

Cause	Symptom
A) Range partitioning by timestamp	1) Latest partition hottest, oldest cold
B) Hash by user_id, but one user is a bot	2) One key dominates QPS
C) Partition leader placement uneven	3) One node has many leaders
D) Compaction debt on one shard	4) I/O spikes, read amp increases

Pause and think.

Answer A->1, B->2, C->3, D->4.

Coffee shop analogy: If you route customers by last-name range (A-F, G-L, ...), and a celebrity "Kardashian" shows up with 1,000 fans, the G-L counter melts.

Key insight:

Hot partitions are often "correct behavior" of the partition function under skewed inputs.

Network assumption (explicit):

Assume an asynchronous network: variable latency, occasional packet loss, partitions possible. This matters because retries, timeouts, and leader elections interact with hotspots.

Decision Game - Is This a Hot Partition or Something Else?

Scenario

You observe:

One node high CPU
p99 latency high
cluster-wide QPS stable

Which statement is most likely true?

"Its a network issue; hot partitions dont affect only one node."
"Its a GC pause; hot partitions would show high QPS everywhere."
"Its a hot partition; one shard is overloaded while others idle."
"Its a slow disk; hot partitions require skewed keys, not hardware."

Pause and think.

Progressive reveal 3 is most likely.

But you must prove it. Hardware issues can mimic hot partitions (and vice versa). The distinguishing feature is correlation with partition/key-level metrics.

Key insight:

Treat "hot partition" as a hypothesis that must be validated with per-partition telemetry.

Production checklist to disambiguate:

If CPU is high but partition QPS is not skewed: suspect GC, kernel issues, noisy neighbor, or a single expensive query pattern.
If disk latency spikes correlate with one shards compaction: suspect storage engine maintenance.
If network retransmits correlate with one node: suspect NIC/ToR issues.

Observability Prerequisites - You Cant Fix What You Cant See

Scenario

Your monitoring shows node-level CPU and disk, but nothing about partitions.

You attempt mitigation:

add nodes
increase thread pools
raise timeouts

Nothing works.

Pause and think

Why doesnt adding nodes help?

Progressive reveal Because the hot partition is still mapped to the same leader/owner. Without re-partitioning or rebalancing, youve only added idle capacity.

What you need (minimum viable hot-partition observability)

Per partition/shard (and ideally per key for top offenders):

QPS (reads/writes)
bytes in/out
latency histograms
queue depth / in-flight requests
CPU time attributed (or service time)
lock wait time / conflict rate
compaction backlog / LSM levels / SST counts
replication lag (leader->follower)
error rates / timeouts

[IMAGE] Partition heatmap dashboard (ASCII mock)

Key insight:

Node metrics tell you where pain is. Partition metrics tell you why.

Production insight:

Be careful with high-cardinality labels (e.g., key=...). Prefer:
- top-K sampling exported separately
- sketches (Count-Min, SpaceSaving)
- tracing logs sampled, analyzed offline

Detection - From "Somethings Slow" to "Partition 17 Is Hot"

Scenario

Your cluster has 256 partitions. You suspect partition 17 is hot.

Detection patterns (ranked by usefulness)

Heatmaps: partition vs time colored by QPS/latency/CPU.
Top-N partitions: list partitions by request rate or service time.
Skew indices: quantify imbalance (Gini coefficient, p95/p50 ratio across partitions).
Queueing indicators: per-partition queue depth, saturation.
Key sampling: top keys within hot partition.

Interactive exercise: Choose a detection metric

You have to pick one metric to page on for hot partitions. Which is best?

A) Node CPU > 90%
B) Partition QPS ratio (max / median) > 10
C) Cluster-wide QPS > baseline + 30%
D) Disk utilization > 80%

Pause and think.

Progressive reveal B is the most directly targeted.

Node CPU (A) is too indirect. Cluster QPS (C) misses skew. Disk utilization (D) is slow-moving and noisy.

Key insight:

Hot partition detection is fundamentally about relative imbalance, not absolute load.

Alerting production note:

Page on skew + impact (e.g., skew>10x AND p99>threshold AND queue depth rising) to avoid false positives during normal diurnal patterns.

Quantifying Skew (So You Dont Argue in Slack)

Scenario

Two engineers disagree:

Engineer 1: "Partition 17 is hot."
Engineer 2: "Its just normal variance."

You need a number.

Options to quantify skew

Max/Median ratio

Simple, intuitive
Sensitive to outliers

p95/p50 across partitions

More robust than max

Gini coefficient

Measures inequality across partitions
Great for dashboards and alerts

Entropy / KL divergence

Useful if you know expected distribution

Mini exercise (pause and think)

If you have 10 partitions and QPS distribution:

[10, 9, 11, 10, 10, 9, 10, 10, 10, 100]

What is max/median?

Pause and compute.

Answer Median is ~10, max is 100 -> 10x skew.

python

Key insight:

Use a skew metric to move from "feels hot" to "measurably hot."

Production insight:

Track skew separately for reads, writes, and bytes. A partition can be "hot by bytes" (large scans) without being "hot by QPS."

Common Misconceptions (Hot Partitions Edition)

Misconception 1: "Hash partitioning prevents hot partitions."

Reality: Hashing spreads random keys well, but it cannot fix skewed popularity (celebrity keys). If 40% of traffic targets one key, hashing faithfully routes 40% to one partition.

Misconception 2: "Just add more partitions."

Reality: More partitions can reduce average load per partition, but the hottest key still maps to exactly one partition. You may reduce collateral damage, but not the root cause.

Misconception 3: "Caching always fixes hot keys."

Reality: Caches help read-heavy hot keys, but:

write-heavy hot keys still hurt
cache stampedes can amplify load
cache invalidation can create bursts

Misconception 4: "Rebalancing solves it."

Reality: Rebalancing moves partitions across nodes, but if the partition is intrinsically hot, youre just moving the hotspot around.

Key insight:

Hot partitions are usually data-model problems or routing problems, not simply capacity problems.

Root Causes - The Usual Suspects

Scenario

You found partition 17 is hot. Now you need to know why.

Root cause categories

Hot key (single key dominates)

counters (global like count)
"latest" feed key
feature flag key
session store for one mega-session

Hot range (range partitioning)

time-based partitions (todays data)
monotonically increasing IDs

Leadership hotspot

too many leaders on one node
leader election pinned

Background work hotspot

compaction backlog
repair/rebuild
checkpointing

Client routing bug

all clients pinned to one partition due to bad hashing

Interactive question

Which is the most common hot key in real systems?

A) user:12345
B) global_counter
C) config:rollout_percentage
D) All of the above

Pause and think.

Answer D.

Stadium analogy: In a stadium with many entrances, if everyone insists on using "Gate A" because its the famous one, Gate A becomes the hot partition.

Key insight:

Hot keys are often created by product features that introduce global coordination.

Diagnosing Hot Keys (Inside a Hot Partition)

Scenario

Partition 17 has 10x QPS. Is it one key or many?

Techniques

Heavy hitter detection

Count-Min Sketch
SpaceSaving algorithm
Top-K sampling

Distributed tracing with key tags

careful: high-cardinality tags can blow up your metrics backend

Request logs sampling

sample 1/1000 requests, compute key frequency offline

Storage engine introspection

per-key lock stats
per-range read/write bytes

Hot partition internal distribution (ASCII)

Decision game

If the hot partition is caused by one key, which mitigation is usually most effective?

A) Move the partition to a bigger node
B) Increase replication factor
C) Split the key (key salting / sharded counters)
D) Increase client timeouts

Pause and think.

Answer C.

Key insight:

If the hotspot is a single key, you need key-level sharding, not partition-level shuffling.

Mitigation Strategy Map (Pick the Right Lever)

Scenario

You have 30 minutes to stabilize production, then a week to fix properly.

Two-phase approach

Immediate stabilization (minutes-hours)

shed load
rate limit
cache
move leadership
isolate noisy tenants

Structural fix (days-weeks)

redesign partition key
split ranges
introduce write aggregation
change data model

Comparison table: Mitigation techniques

Technique	Works best for	Pros	Cons / Trade-offs
Rate limiting	sudden bursts / noisy neighbors	fast, safe	drops requests, needs fairness
Request hedging	tail latency from rare slow replicas	reduces p99	can amplify load if misused
Caching	read-hot keys	huge win	stampedes, staleness
Key salting	single hot key	spreads writes	complicates reads/queries
Range splitting	hot ranges	localizes growth	operational complexity
Leader rebalancing	leader hotspots	quick relief	moves problem if intrinsic
Adaptive partitioning	evolving skew	automatic	complex, can thrash
Write coalescing	counters/aggregates	reduces write QPS	increases staleness

Key insight:

The mitigation must match the hotspots shape: key, range, leader, or maintenance.

Immediate Tactics - Stop the Bleeding

Scenario

Partition 17 leader is at 100% CPU. p99 is exploding. Customers are timing out.

Tactic 1: Backpressure and load shedding

Interactive question: Where should you apply backpressure first?

A) At the storage node
B) At the client
C) At the load balancer / API gateway
D) Everywhere simultaneously

Pause and think.

Progressive reveal Start at C (edge) if possible, because it prevents unnecessary work from entering the system. Then propagate B (client) via retry budgets. Storage-node backpressure (A) is necessary but often too late.

Restaurant analogy: Stop people at the door before they occupy tables and overwhelm the kitchen.

Key insight:

Backpressure is most effective when applied upstream.

Production insight:

Prefer load shedding with explicit errors (429/503 with Retry-After) over silent timeouts; timeouts trigger retries and amplify load.

Tactic 2: Retry budgets (anti-retry storm)

If clients retry blindly, they amplify hotspots.

Use exponential backoff + jitter
Cap retries
Prefer deadline propagation
Use retry budgets: a fixed percentage of traffic allowed to be retries

python

Key insight:

A hot partition often becomes catastrophic only after retries turn it into a traffic multiplier.

Production insight:

Implement retry budgets per client / per endpoint / per partition when possible.
Ensure idempotency (idempotency keys, conditional writes) so retries dont corrupt state.

Tactic 3: Hot key caching + request coalescing

For read-heavy hot keys:

cache at edge or service layer
use singleflight / request coalescing to prevent stampedes

javascript

Trade-offs:

staleness
cache invalidation complexity

Key insight:

Caching helps when the hot resource is read compute, not write serialization.

Failure scenario:

If the backing store is down, singleflight can turn into a "shared failure." Consider:
- negative caching with short TTL
- serving stale (SWR)
- circuit breakers to avoid hammering the store

Tactic 4: Move leadership (when leadership is the hotspot)

If one node hosts too many leaders:

rebalance leaders
ensure rack/zone spread
avoid "preferred leader" pinning

Failure scenario: Leader moves can cause:

temporary unavailability
follower catch-up load
client metadata cache invalidation

Key insight:

Leader rebalancing fixes placement hotspots, not intrinsic key hotspots.

Production insight:

Use load-aware leader placement (CPU/service-time/queue depth) rather than round-robin.
Rate-limit leader moves; treat them like deploys.

Structural Fixes - Designing for Skew

Scenario

Your product has a "global like counter" updated on every page view. Thats a write-hot key.

Option 1: Sharded counters (key salting)

Instead of:

likes:global -> single key

Use:

likes:global:shard_i for i in [0..k)
reads aggregate across shards](streamdown:incomplete-link)

Pause and think Whats the main trade-off?

A) Higher write latency
B) Higher read complexity and eventual consistency
C) More disk usage only
D) Requires strong consistency across shards

Answer B.

python

Analogy: Instead of one cashier handling all tips, each cashier keeps a jar; at closing time, you sum them.

Key insight:

Sharding a hot key turns serialization into parallelism, at the cost of aggregation complexity.

Consistency model clarification:

Sharded counters typically provide eventual consistency for reads of the total.
If you need strict monotonic reads, youll pay with coordination (e.g., single-writer, consensus, or transactional aggregation), which can reintroduce hotspots.

Option 2: Write buffering and coalescing

For counters, metrics, and "last seen" fields:

buffer writes in memory
flush periodically
coalesce multiple updates

Failure scenario:

buffer loss on crash
duplicates on retry

Mitigate with:

WAL for buffer
idempotent updates

Key insight:

Coalescing reduces write QPS by changing semantics from "every event" to "latest/aggregate."

Performance note:

Coalescing reduces write amplification and lock contention, but increases tail latency for "freshness." Make freshness an explicit product requirement.

Option 3: Range partitioning with split/merge (hot ranges)

If partitioning by time or increasing IDs:

newest range is hottest

Mitigations:

pre-split future ranges
split hot ranges dynamically
merge cold ranges

[IMAGE] Range partition timeline (ASCII)

Key insight:

Range splits convert a hot range into multiple smaller ranges, but require routing updates and can cause rebalancing churn.

Failure scenario:

Splits/merges are metadata operations; if metadata is stored in a consensus system, the metadata store can become the new hot partition.

Option 4: Adaptive partitioning (automatic skew handling)

Systems may:

detect hotspots
create sub-partitions
migrate keys

Trade-offs:

complexity
oscillation (thrashing)
metadata overhead

Think about it: How do you avoid thrashing?

Progressive reveal Use hysteresis:

only split when skew persists for N minutes
only merge when cold persists
rate-limit moves

Key insight:

Automation needs stability controls (hysteresis, budgets) to avoid becoming its own outage.

Production insight:

Also add blast-radius controls:
- max concurrent moves
- max bandwidth for rebalancing
- "do not move" list for currently-hot partitions during incidents

Decision Game - Pick a Mitigation Under Constraints

Scenario

You run a multi-tenant event ingestion pipeline.

Partition key: tenant_id
One tenant just onboarded a batch job and now produces 60% of traffic.
You must keep other tenants healthy.

Which mitigation is best?

A) Increase partitions and keep keying by tenant_id B) Add per-tenant rate limits and isolate heavy tenant into dedicated partitions C) Reduce replication factor cluster-wide D) Disable retries for everyone

Pause and think.

Answer B.

Why:

preserves fairness
isolates noisy neighbor
doesnt punish all tenants

Key insight:

Hot partitions are often a fairness problem disguised as a performance problem.

Failure Scenarios You Must Plan For

Scenario 1: Hot partition triggers leader failover loop

Sequence:

Partition leader overloaded
health check fails
leader election
new leader takes over but is also overloaded
repeat

Mitigations:

health checks based on correct signals (avoid CPU-only)
election dampening
load-aware leader selection

Production insight:

Prefer health checks based on ability to serve (queue depth, request timeouts, internal saturation) rather than raw CPU.

Scenario 2: Rebalancing amplifies hotspot

Rebalancing moves a hot shard:

triggers data copy
saturates network/disk
increases latency for everyone

Mitigations:

throttle rebalancing
isolate rebalancing traffic
avoid moving hottest shards during incident

Scenario 3: Cache stampede on a hot read key

Mitigations:

request coalescing
stale-while-revalidate
probabilistic early refresh

Scenario 4: Hot partition + tail latency = retry storm

Mitigations:

retry budgets
idempotency keys
hedged requests with caps

Key insight:

Hot partitions rarely kill you alone; they kill you via control-plane reactions (retries, elections, rebalancing).

Distributed Coordination Angle - Hot Partitions in Consensus Systems

Scenario

You store metadata in a consensus-backed store (Raft/Paxos) and notice one Raft group is overloaded.

Why it happens

One Raft group holds "directory" metadata
Many clients read/write that group
Leader becomes bottleneck because:
- it serializes log appends
- it handles read leases
- it replicates to followers

Pause and think Which is true about a hot Raft group?

Followers can absorb leader write load automatically.
Adding more followers increases write throughput linearly.
The leader is a natural hotspot for writes.
Raft eliminates hotspots by design.

Answer 3.

Mitigations (consensus-specific)

shard metadata across multiple Raft groups
avoid global monotonic counters in one group
use leases/read-only queries on followers (if safe)
batch proposals

Multiple Raft groups with one hot group (ASCII)

Key insight:

Consensus groups are partitions too; a "hot partition" can be a hot Raft group.

CAP/consistency note:

Linearizable writes require leader coordination; under network partitions you must choose:
- CP: reject/unavailable rather than risk split-brain
- AP: accept writes locally and reconcile later (not Rafts model)
Many "hot partition" mitigations (caching, coalescing, async aggregation) intentionally relax consistency to preserve availability and latency.

Stream Processing Angle - Hot Partitions in Kafka/Flink/Spark

Scenario

Kafka topic has 48 partitions. Consumer group lags, but only on partitions 3 and 17.

Why it happens

producer keys skewed
"unknown" key defaulted to constant
partitioner bug

Interactive question If you increase consumer parallelism, will lag disappear?

Pause and think.

Progressive reveal Only if you can increase partition parallelism or change key distribution. If one partition is hot, one consumer still owns it (within a consumer group), so lag persists.

Mitigations

change partition key (include more entropy)
use sticky partitioning carefully
split topics / add partitions (with caveats)
implement two-stage aggregation: pre-aggregate by random shard, then final aggregate

javascript

Key insight:

In streams, hot partitions show up as consumer lag concentration per partition.

Ordering clarification:

Salting changes ordering guarantees. You can preserve ordering within a sub-stream (e.g., per order_id) but not necessarily for all events of a tenant unless you keep them on one partition.

Trade-offs - Consistency, Ordering, and the Cost of Spreading Load

Scenario

You want to "just salt the key." But your product requires:

per-user ordering
read-your-writes
uniqueness constraints

Pause and think Which property is hardest to preserve while mitigating hot keys?

A) Availability
B) Ordering
C) Latency
D) Storage cost

Answer Often B (ordering).

Comparison table: Mitigation vs semantics

Mitigation	Ordering	Strong consistency	Complexity
Key salting	No (unless careful)	Harder	Medium
Two-stage aggregation	Per shard yes, global no	Harder	High
Leader rebalancing	Yes	Yes	Low
Caching	Reads yes	No (stale)	Medium
Range splitting	Within range yes	Yes	Medium

Key insight:

Hot partition mitigation is often a negotiation between throughput and semantic guarantees.

CAP framing (practical):

Under partitions, you cant have both strict consistency and availability. Many mitigations intentionally move you toward AP for reads (cache/SWR) while keeping CP for writes (leader/consensus), or vice versa depending on product needs.

Operational Playbook - Step-by-Step Incident Response

Scenario

Its happening again. You have 15 minutes to restore SLOs.

Step 0: Confirm symptom pattern

p95/p99 latency up
one node or one partition saturated

Step 1: Identify the hot partition

top partitions by service time
heatmap

Step 2: Determine hotspot type

Decision tree (pause and think):

If one partition hot, but many keys inside it are evenly used -> likely range or tenant skew.
If one key dominates -> hot key.
If node hot due to many leaders -> leader placement.
If I/O hot with compaction backlog -> maintenance hotspot.

Progressive reveal Use the above to pick mitigation.

Step 3: Apply safe immediate mitigations

rate limit or shed load
disable aggressive retries
enable caching/coalescing
move leaders (if safe)

Step 4: Post-incident: structural fix

change partitioning
implement key sharding
add fairness controls

Key insight:

Incidents are won by classification speed: identify the hotspot type quickly.

Production insight:

During the incident, freeze "helpful automation" that can amplify churn:
- aggressive rebalancers
- auto-leader-movers
- auto-scaling that triggers re-sharding without guardrails

Designing Alerts That Dont Cry Wolf

Scenario

You add an alert "max/median partition QPS > 5" and get paged constantly during normal diurnal patterns.

Better alert design

alert on sustained skew (e.g., >10x for 10 minutes)
combine skew with saturation (queue depth/latency)
separate read vs write skew
add "top key share" if available

Matching exercise

Match alert type to what it catches:

Alert	Catches
A) max/median QPS skew	1) uneven traffic distribution
B) partition queue depth	2) saturation / backpressure
C) replication lag on one partition	3) leader overload / network
D) compaction backlog	4) LSM maintenance debt

Pause and think.

Answer A->1, B->2, C->3, D->4.

Key insight:

Alerts should detect imbalance + impact, not just imbalance.

Real-World Patterns (What Big Systems Actually Do)

Pattern 1: "Celebrity user" isolation

detect top users/tenants
route them to dedicated shards

Pattern 2: Multi-level partitioning

first by tenant
then by hashed sub-key

Pattern 3: Hierarchical aggregation

edge aggregation -> regional -> global

Pattern 4: Load-aware placement

place hot shards on stronger nodes
keep headroom

Pattern 5: Explicit fairness

per-tenant quotas
weighted fair queueing

Key insight:

At scale, hot partition mitigation becomes traffic engineering + fairness, not only data modeling.

Worked Example - Hot Partition in a Sharded KV Store

Scenario

You operate a sharded KV store with:

128 shards
shard leader per shard
replication factor 3
hash partitioning by key

A new feature introduces key feature_flags:global read on every request and updated frequently.

Step-by-step diagnosis

Heatmap shows shard 42 has 25x read QPS.
Within shard 42, top key is feature_flags:global with 90% of reads.
CPU on shard 42 leader is saturated; followers are fine.

Pause and think Whats the best fix?

A) Increase shard count to 1024
B) Cache feature_flags:global in application with TTL + singleflight
C) Move shard 42 to a bigger node
D) Increase replication factor to 5

Progressive reveal B is the best first fix.

Why:

It targets the read-hot key directly.
It avoids changing semantics much (TTL acceptable).

Longer-term you might also:

move feature flags to a separate system optimized for fanout
push updates via pub/sub

typescript

Key insight:

Read-hot global config is best handled by fanout distribution, not by hammering storage.

When Mitigation Backfires (And How to Notice)

Scenario

You deployed key salting for a hot counter. Writes spread out, but now reads are slow and expensive.

Failure modes

fan-in reads across many shards increase latency
partial failures cause undercount
inconsistent reads cause UI jitter

Pause and think How do you make reads cheaper?

Progressive reveal Introduce a materialized aggregate:

background job periodically sums shards into likes:global:total
reads hit total
writes continue to shards

Trade-offs:

eventual consistency
more moving parts

Key insight:

Many mitigations shift cost from write path to read path (or vice versa). Measure both.

Production insight:

For fan-in reads, consider:
- parallel reads with timeouts + partial aggregation
- caching the aggregate
- storing per-shard deltas and using a streaming aggregator

Final Synthesis - The Hot Partition Gauntlet

Scenario

Youre designing a new multi-region system:

user activity events
near-real-time counters
per-tenant dashboards

You must choose:

partition keys
alerting
mitigation strategies

Synthesis challenge (pause and think)

Given requirements:

per-tenant ordering for events
dashboards tolerate 30s staleness
one tenant may be 100x larger than median

Design a strategy:

What is your primary partition key?
How do you prevent one tenant from melting the cluster?
How do you handle global counters?
What do you alert on?

Pause and write your answers.

Progressive reveal (one strong solution)

Partition by tenant_id, but sub-partition within tenant for scalable processing:
- partition_key = (tenant_id, shard = hash(event_id) % k(tenant_id))
- keep ordering where needed by using per-stream keys for ordered subsets.
Enforce per-tenant quotas and isolate heavy tenants:
- dedicated partitions or dedicated topic
- weighted fair queueing
Global counters:
- sharded counters + periodic aggregation
- or stream-based aggregation pipeline (edge -> regional -> global)
Alerting:
- skew (max/median) + saturation (queue depth/latency)
- consumer lag concentration per partition
- top key share within partition (if measurable)

Key insight:

The best hot-partition strategy is proactive: design for skew, enforce fairness, and build observability that pinpoints hotspots quickly.

Multi-region note:

Cross-region replication can turn a hot partition into a WAN hotspot (egress, replication lag). Consider:
- local writes + async replication for dashboards (AP-ish reads)
- per-region aggregation then global rollup

Closing Challenge Questions

In your current system, what are the top 5 keys/tenants by traffic? Do you know?
Do you have per-partition queue depth and latency histograms?
What happens when a leader is overloaded - do clients retry responsibly?
Which mitigation would you deploy in 15 minutes vs 2 weeks?

Appendix: Quick Checklist

Partition-level metrics exist and are cheap enough to keep
Skew metric (max/median, p95/p50, Gini) tracked over time
Top keys for hot partitions discoverable (sampling/sketch)
Retry budgets and deadline propagation implemented
Rate limiting / fairness controls per tenant
Safe leader rebalancing and throttled rebalancing
Structural mitigations: salting, splitting, aggregation, caching

Key Takeaways

Hot partitions occur when traffic concentrates on a single shard or partition — caused by celebrity users, viral content, or skewed partition keys
Detect hot partitions by monitoring per-partition latency and throughput — a sudden spike in one partition while others are idle is the signal
Key splitting distributes hot keys across multiple partitions — append a random suffix to spread load, aggregate at read time
Local caching absorbs repeated reads for hot keys — application-level caches reduce traffic to the hot partition

Previous Zero Downtime Migration Up next Time Series Database Optimization

Chapter complete!

Up next Time Series Database Optimization

Continue