System Design: Key-Value Store

Requirements

Functional Requirements:

GET, PUT, DELETE operations on arbitrary byte keys and values
Keys up to 1 KB; values up to 10 MB
Tunable consistency: eventual consistency (AP) or strong consistency (CP) configurable per operation
Conditional writes (compare-and-swap)
TTL support per key
Batch GET and PUT operations

Non-Functional Requirements:

Sub-10ms p99 GET latency
Sub-20ms p99 PUT latency
1 million ops/sec aggregate throughput
Horizontal scalability to 100 billion keys
99.999% availability (AP mode) or 99.9% (CP mode)

Scale Estimation

With 100 billion keys at average 1 KB per key-value pair, total data is 100 TB. Consistent hashing with 1,000 nodes and 3x replication means each node stores ~300 GB (100 TB * 3 / 1,000). With a 512 GB SSD per node, this comfortably fits. At 1 million ops/sec across 1,000 nodes, each node handles 1,000 ops/sec — trivially achievable. The bottleneck is more likely to be network or replication overhead than per-node compute. An LSM-tree storage engine on each node handles 100,000+ writes/sec with efficient compaction.*

High-Level Architecture

The design follows Amazon Dynamo's architecture: a leaderless, AP distributed key-value store with consistent hashing and quorum-based reads/writes. Every node is equal — there are no dedicated coordinators or leaders for data operations. The client (or a coordinator node in the request path) hashes the key to determine its position on the consistent hash ring, identifies the N (typically 3) nodes responsible for that position (preference list), and sends reads/writes to those N nodes using quorum (R=2 read replicas, W=2 write replicas, N=3). As long as R+W>N, the system guarantees that reads see the most recent write.

For writes, the coordinator sends the PUT to all N nodes. Each node writes to its local LSM-tree and returns an acknowledgment. The write is considered successful after W=2 acknowledgments. The third write proceeds asynchronously ("hinted handoff" if the target is temporarily unavailable: the write is stored on a healthy node with a hint to forward it when the target recovers). Vector clocks are used to version values: each PUT operation stamps the new version with a vector clock increment for the writing node. On read, if multiple versions are returned (concurrent writes from different coordinators), the client is responsible for reconciling them (semantic reconciliation, e.g., "last-write-wins" or application-specific merge).

For reads, the coordinator sends GET to R=2 nodes from the preference list. If both return the same version (same vector clock), the response is immediate. If versions diverge (conflicting concurrent writes), the coordinator returns all versions to the client along with their vector clocks, letting the client resolve the conflict. This "always writeable" design sacrifices consistency for availability — the system never rejects a write due to network partitions.

Core Components

LSM-Tree Storage Engine

Each node uses an LSM-tree (Log-Structured Merge Tree) as its local storage engine. Writes go to an in-memory memtable (red-black tree, sorted by key). When the memtable reaches a threshold (e.g., 64 MB), it is flushed to disk as an immutable SSTable (Sorted String Table) file. SSTables are organized in levels: L0 (freshest, up to 4 files before compaction), L1 (10 MB), L2 (100 MB), etc. Compaction merges overlapping SSTables from adjacent levels, removing deleted keys (tombstones) and merging duplicate key versions. Read performance: check memtable → L0 SSTables → L1... (O(log N) per level). Bloom filters (one per SSTable) short-circuit lookups for keys not in a given SSTable.

Consistent Hashing Ring

The key space is a 128-bit integer ring (MD5 or SHA-1 of key). Each physical node is assigned 150–200 virtual nodes (vnodes) on the ring to ensure even key distribution even with heterogeneous node sizes. The preference list for a key consists of the N nodes whose vnode positions are immediately clockwise of the key's hash position, skipping vnodes from the same physical node (to ensure N distinct physical nodes hold replicas). Ring state is maintained by a gossip protocol: each node periodically exchanges ring state with 2 randomly selected peers, converging to global consistency in O(log N) rounds.

Vector Clocks for Conflict Detection

A vector clock is a map of (node_id → counter). Each PUT operation by coordinator C increments the counter for C in the vector clock: {C: 5, D: 3, ...}. Given two versions with clocks VC1 and VC2: VC1 descends from VC2 if every counter in VC1 is ≥ the corresponding counter in VC2 (VC1 is a later version — no conflict). If neither VC1 nor VC2 dominates the other (e.g., VC1={C:5, D:3}, VC2={C:4, D:4}), they are concurrent writes — a conflict. The client receives both values and resolves: for shopping cart data, the semantically correct merge is the union of both carts. For counters, the maximum. For arbitrary data, "last-write-wins" using a physical timestamp is common but risks losing data.

Database Design

Each node stores data in an LSM-tree with a RocksDB-compatible format: SSTable files on disk (.sst), a manifest file tracking SSTable metadata, and a WAL (write-ahead log) for durability. The WAL is fsynced on every write (configurable: sync-per-write, sync-per-second, no-sync). Metadata (consistent hash ring state, node membership, vnode assignments) is stored in a distributed configuration store (ZooKeeper or etcd) and cached locally on each node. TTL is implemented by storing expiry timestamp alongside each value; expired keys are lazily deleted on read and proactively cleaned by a background thread that scans SSTables for expired entries during compaction.

API Design

Scaling & Bottlenecks

Compaction is the primary write-amplification source. LSM-tree write amplification can reach 10–30x: a single user write triggers multiple compaction merges at different levels before data settles in the lowest level. This translates to 10–30x more bytes written to disk than user-visible writes. SSDs degrade with high write amplification. Mitigation: (1) tiered compaction policy instead of leveled (lower write amplification at cost of higher read amplification); (2) sizable memtables (reduce flush frequency); (3) SSDs with higher endurance (TLC vs. QLC NAND). Read amplification is bounded by the number of levels + bloom filters — typically 1–3 disk reads per GET.

Hot key problem: in AP mode, all writes for a hot key route to the same 3 preference list nodes. These 3 nodes receive disproportionate load. Shard splitting (treating a hot key as multiple virtual keys: key#0 through key#N distributed across different ring positions) handles write-heavy hot keys. For read-heavy hot keys, adding dedicated read replicas or application-side caching (local LRU) absorbs the load. Adaptive replication (automatically increasing replica count for hot keys, from 3 to 5+) is a more elegant but complex solution.

Key Trade-offs

AP vs. CP: Dynamo-style AP stores are always writable but risk stale reads and require client-side conflict resolution; Raft-based CP stores guarantee linearizability but reject writes during network partitions
Leaderless vs. leader-based replication: Leaderless replication (Dynamo) distributes write load but requires quorum writes; leader-based (Raft) simplifies consistency but creates leader bottlenecks
LSM-tree vs. B-tree: LSM-tree has higher write throughput (sequential append, fast compaction) but slower reads and higher write amplification; B-tree has predictable read performance but slower writes (random I/O)
Synchronous vs. asynchronous WAL: Synchronous WAL fsync per write ensures zero data loss on crash but adds 1–5ms per write; async WAL (batch fsync every 100ms) improves throughput 10x at risk of losing 100ms of writes on hard crash