The storage engine behind Cassandra, RocksDB, and LevelDB — optimized for write-heavy workloads by turning random writes into sequential I/O.
Audience: engineers who already know what an LSM tree is, and now want to reason about compaction strategies under distributed workloads, failures, and operational constraints.
You run a distributed key-value store backed by an LSM-tree engine. Writes are blazing fast. Then:
You look at metrics and see the same culprit every time: compaction.
If compaction is “just background housekeeping,” why does it dominate user-visible performance?
Take 10 seconds.
...
Compaction is not just housekeeping; it’s the mechanism that defines your read amplification, write amplification, space amplification, and failure recovery behavior. In distributed systems, compaction also interacts with:
Compaction isn’t a background detail. It’s a core part of your storage protocol.
Key insight
Choosing a compaction strategy is choosing a performance envelope and a failure mode profile.
Imagine a busy coffee shop.
Compaction is the staff periodically:
Different compaction strategies are like different restaurant policies:
Key insight
Compaction is the trade-off engine: it converts write-optimized ingestion into read-optimized layout—at a cost.
Compaction typically:
A. Compaction is required mostly to reclaim disk space.
B. Compaction is required mostly to reduce the number of files and thus reduce read amplification.
C. Compaction is required mostly to maintain invariants that make point lookups and range scans efficient.
Pause and think.
...
Answer reveal: B and C are the key reasons, and A is only sometimes. Many systems can tolerate temporary space bloat but cannot tolerate unbounded read amplification.
Key insight
Space reclamation is a side effect; the main job is controlling read amplification and maintaining layout invariants.
You must choose compaction strategy defaults for a distributed store with:
You’re deciding between:
Which one would you pick if your top priority is consistent read latency? Which if your top priority is max write throughput?
...
But distributed reality complicates this. We’ll build up the reasoning.
Key insight
In distributed systems, compaction choice affects not just local performance but also replication lag, repair cost, and operational stability.
Match each statement to STC or LC.
Pause and think.
...
Answer reveal 1 -> STC (lower write amplification in many cases) 2 -> LC (lower read amplification) 3 -> STC (more duplicates across overlapping runs) 4 -> LC (more predictable range scans)
Key insight
STC trades space (and read amplification) for write efficiency; LC trades write amplification for read predictability.
You run a delivery hub. Small packages (fresh SSTables) arrive constantly. To reduce clutter, you wait until you have, say, 4 similarly sized bundles, then you merge them into a larger bundle.
That’s STC.
If SSTables overlap, what happens to point reads?
...
Answer reveal Point reads may need to check multiple SSTables (and their bloom filters) because the key could be in any overlapping run. Read amplification can grow with the number of runs.
Key insight
Overlap is STC’s defining feature: it preserves write efficiency but makes reads less predictable.
Reality: STC compaction can be bursty.
In distributed systems, IO spikes can align across nodes (same workload patterns), causing correlated latency events.
Key insight
STC’s burstiness can create synchronized compaction storms across a cluster.
Consider a replicated LSM store:
Independent compaction can lead to:
If your coordinator does hedged reads (send to multiple replicas), which replica is likely to win?
...
Answer reveal The replica with lower read amplification (fewer SSTables to consult, better cache locality) wins more often, becoming “hotter,” potentially causing load imbalance.
Key insight
Compaction strategy can indirectly create replica load skew via tail-latency winners.
In STC, because overlapping runs persist, tombstones may take longer to be fully purged.
“STC tends to purge tombstones quickly because it merges many files at once.”
“STC can keep tombstones around longer because overlap delays full key history convergence.”
Pause and think.
...
Answer reveal: 2.
Distributed impact:
Key insight
STC can be painful for delete-heavy or TTL-heavy workloads unless tuned (or paired with time-window compaction).
Instead of piles of tickets on the counter, you install shelves:
Rule: except for L0, each shelf contains files that do not overlap in key range.
When a shelf exceeds its size limit, you move some files down by merging them with overlapping files in the next shelf.
If levels don’t overlap, what happens to point reads?
...
Answer reveal A point read typically consults:
So read amplification is bounded and predictable.
Key insight
LC buys predictable reads by enforcing non-overlap—at the cost of rewriting data many times.
LC can be worse when:
Key insight
LC is not “better,” it’s more read-optimized and more write-expensive.
With LC, each replica tends to have:
This reduces variance across replicas and stabilizes quorum read latency.
If you do quorum reads (R=2 of 3) and you pick the fastest two responses, how does LC help tail latency?
...
Answer reveal By reducing per-replica variance, LC reduces the chance that one replica is dramatically slower due to checking many overlapping runs. That tightens the distribution, improving p99.
Key insight
LC’s biggest distributed advantage is variance reduction—not just lower average read cost.
Analogy:
Key insight
Compaction strategy is a billing model: you decide whether to spend IO budget on writes or reads.
| Dimension | Size-Tiered (STC) | Leveled (LC) | Distributed consequence |
|---|---|---|---|
| Read amplification (point) | Higher, variable | Lower, bounded | Tail latency variance affects quorum/hedged reads |
| Write amplification | Lower (often) | Higher (rewrite across levels) | Replication lag if compaction steals IO from WAL/fsync |
| Space amplification | Higher (duplicates across runs) | Lower (less duplication) | More disk pressure -> more rebalancing events |
| Compaction pattern | Bursty | More continuous | Cluster-wide IO storms vs steady background |
| Tombstone GC | Can be slower | Typically faster/more predictable | Repair traffic and TTL workloads |
| Range scans | Often worse (many overlaps) | Better (organized ranges) | Analytics queries less disruptive |
| Operational tuning | Simpler knobs but tricky bursts | More knobs; predictable once tuned | SLO management and capacity planning |
Key insight
In a cluster, variance and correlation matter as much as averages.
Compaction is local work with global consequences. It interacts with:
A) STC + heavy deletes + frequent repairs
B) LC + write-heavy workload + small IO budget
C) Either strategy + no compaction throttling
Pause and think.
...
Answer reveal All can be dangerous, but C is the cluster killer: unthrottled compaction can starve foreground IO and replication.
Key insight
Compaction must be treated as a first-class resource consumer in your distributed scheduler.
Goal: directional reasoning, not exact formulas.
WA ~= total bytes written to storage / bytes ingested.
RA ~= number of SSTables consulted per read (plus IO per consult).
SA ~= physical bytes stored / logical live bytes.
If your workload is 95% point reads and you have strict p99 latency SLO, which amplification dominates your choice?
...
Answer reveal Read amplification dominates—so LC often wins, or you must heavily tune STC and caching.
Key insight
For SLO-driven systems, the tail behavior of RA is often the deciding factor.
Match the workload to the likely better default:
Workloads:
Strategies:
Pause and think.
...
Answer reveal (typical) 1 -> It depends; consider TWCS (or STC tuned for TTL) 2 -> LC 3 -> LC (range scans benefit) 4 -> STC (lower WA; but must throttle)
Key insight
Real systems often use hybrids (TWCS, tiered+leveled) to match time-based access patterns.
Tenant A: write-heavy ingestion. Tenant B: latency-sensitive reads.
Compaction steals IO from reads and writes. If you don’t schedule it, the disk becomes a battleground.
If you throttle compaction too much, what happens?
...
Answer reveal You accumulate compaction debt:
Key insight
Throttling compaction is like deferring maintenance on a delivery fleet: debt compounds.
Which knob is most directly tied to preventing write stalls in LC?
A) bloom filter bits-per-key
B) L0 slowdown/stop triggers
C) compression algorithm
...
Answer reveal: B.
Key insight
LC turns compaction into a backpressure mechanism: L0 is the emergency brake.
Compaction preserves semantics, but failures happen mid-compaction.
Pause and think: what could go wrong?
...
Answer reveal Engines treat compaction output as new immutable files and atomically install them via a manifest/version edit.
Failure modes:
Distributed consequence:
Key insight
Compaction relies on atomic metadata updates; correctness is easy, but post-crash performance can drift.
Compaction affects the cost of reconciliation:
Pause and think: does compaction reduce the amount of data that must be repaired across replicas?
...
Answer reveal Not directly. Repair is about logical divergence, not physical layout. Compaction changes scan efficiency and tombstone retention.
Key insight
Compaction doesn’t fix inconsistency; it changes the cost profile of detecting and healing it.
Reality: compaction is local work with global consequences.
Key insight
Local IO contention becomes global tail latency through coordinated request fan-out.
STCS (size-tiered), LCS (leveled), TWCS (time-window).
Often leveled compaction for predictable reads.
Minor/major compactions; major compaction can be disruptive.
Key insight
Strict latency SLOs often push systems toward leveled; time-series pushes toward time-window.
In LC:
Analogy: clinic waiting room overflow slows the whole clinic.
Key insight
Keeping L0 under control is the difference between stable latency and meltdown.
STC is always better for range scans because it rewrites less.
LC is generally better for range scans because non-overlapping levels reduce redundant reads.
Range scans are unaffected by compaction strategy because data is sorted in all SSTables.
...
Answer reveal: 2.
Key insight
Sorted runs aren’t enough; overlap determines redundant scanning.
Hit rate is 95%, p99 still spikes.
Pause and think: how?
...
Answer reveal Tail reads may touch many SSTables, causing extra metadata lookups and cache misses on indexes/filters.
Key insight
Compaction shapes the working set of metadata, not just data blocks.
Compaction strategy affects data movement:
Pause and think: which strategy is more likely to cause extra network transfer during range movement?
...
Answer reveal: STC.
Key insight
Overlap increases data movement amplification during rebalancing.
LC can reduce redundant reads during scans, but repair cost is still dominated by data size + network + hashing.
Key insight
LC improves scan efficiency; it doesn’t make repair free.
Toy model:
Question: for a point lookup, what’s the maximum SSTables you might consult?
...
Answer reveal: roughly 12 (6 from L0 + 1 per level).
Now STC: 4 tiers with 10 overlapping runs each -> worst case ~40.
Key insight
Bounded worst-case is a major reason LC is used for latency-sensitive systems.
Decision game: You see frequent write stalls in LC. Which change is most likely to help without changing hardware?
A) Increase L0 stop trigger
B) Increase compaction parallelism / background threads
C) Decrease bloom filter false positive rate
...
Answer reveal: often B (if headroom exists) or A (if you can tolerate more L0 overlap). C helps reads but not compaction throughput.
Key insight
Write stalls mean compaction throughput < ingestion rate.
Compaction cannot drop overwritten versions/tombstones visible to an active snapshot.
Effects:
Distributed consequence: one long-running query on one replica can increase disk usage and trigger rebalancing/throttling.
Key insight
Snapshots turn compaction from garbage collection into version management.
Hot partitions flush more, create more L0 files, and can trigger compaction hotspotting.
Pause and think: which strategy is more sensitive to hot partitions causing write stalls?
...
Answer reveal: often LC due to L0 stop/slowdown behavior.
Key insight
LC enforces invariants continuously; hot spots violate them quickly.
| Symptom | Likely cause | Strategy association |
|---|---|---|
| Periodic IO spikes, p99 spikes at same time daily | bursty compaction | STC more common |
| Sustained high disk write throughput | continuous compaction | LC |
| Write stalls / backpressure | L0 too large, compaction behind | LC |
| Disk usage keeps growing | deletes + insufficient compaction + snapshots | both; STC often worse |
| Quorum reads often wait for 1 slow replica | replica variance in RA | STC |
Key insight
Graph L0 file count, pending compactions, compaction bytes/sec, and disk util together.
New node bootstraps data and then triggers compaction to reach steady-state (“compaction aftershock”).
Pause and think: which strategy tends to make bootstrap more painful?
...
Answer reveal: often LC, because it must build leveled invariants (rewriting data multiple times) unless you have a bulk-load path.
Key insight
Bulk load + leveled invariants needs special handling to avoid rewriting the dataset multiple times.
Key insight
Hybrids exist because workloads are not stationary: recent and cold data have different access patterns.
Recent data: optimize ingestion, accept overlap. Old data: optimize scans/reads, enforce order.
Key insight
The best strategy is often “different strategies for different temperatures.”
3-region active-active, local quorum reads, cross-region latency high.
Question: would you choose the same strategy everywhere?
...
Answer reveal Not necessarily: read-heavy regions may prefer LC; ingestion-heavy regions may prefer STC/TWCS. But this increases operational complexity (repair, disk usage differences, streaming efficiency).
Key insight
Geo deployments can use compaction policy as specialization, but it complicates ops.
...
Answer reveal: true: 1, 3, 4. False: 2.
Key insight
Read amplification is about how many overlapping runs remain, not just merge size.
Use when:
Mitigate with: bloom filters, metadata caching, smoothing/throttling, TWCS for TTL.
Use when:
Mitigate with: adequate compaction throughput, IO isolation, careful L0 triggers, bulk-load paths.
Key insight
Choose based on what you can afford under failure and rebuild, not just steady state.
Hardware helps, but storms can saturate any disk; correlated behavior still breaks SLOs.
Better: compaction budgets, IO isolation, and planning for rebuild/repair capacity.
Key insight
Capacity planning must include compaction + repair + rebuild, not just foreground QPS.
200-node cluster, RF=3, 80% writes, p99 read SLO 50ms, NVMe shared, ingestion peaks at noon, TTL=7 days, weekly node replacements.
Choose:
Pause and think (write down your answer).
...
Answer reveal (one reasonable solution)
Key insight
The best design is the one whose failure behavior you can operate.