System Design: Distributed Configuration Management

Requirements

Functional Requirements:

Store and retrieve configuration key-value pairs with versioning
Watch for changes to specific keys or key prefixes and receive real-time notifications
Atomic compare-and-swap (CAS) operations for optimistic concurrency control
Organize configs in a hierarchical namespace: /env/service/key
Support leases (TTL-based expiry) for ephemeral config and leader election primitives
Provide transactions: multi-key read-write operations that succeed or fail atomically

Non-Functional Requirements:

Strong consistency: all reads reflect the most recent committed write (linearizable)
99.999% availability with a 3-node or 5-node cluster tolerating (n-1)/2 failures
Read latency under 5ms p99; write latency under 20ms p99 (Raft consensus round trip)
Support 10,000 clients with active watches, receiving notifications within 100ms of a change

Scale Estimation

A large deployment: 10,000 microservice instances × 5 config keys watched each = 50,000 active watches. On a config change, 50,000 watch notifications must be sent within 100ms — a push fan-out of 50,000 gRPC stream writes. Config data volume is small: 100,000 keys × 1 KB average = 100 MB total, easily fitting in memory on each node. Write rate: 100 config changes/sec (high for a config system). Read rate: 5,000/sec from services fetching config at startup and on watch miss.

High-Level Architecture

The system is a replicated state machine built on the Raft consensus algorithm. A cluster of 3 or 5 nodes elects a leader; all writes go through the leader, which replicates log entries to followers before committing. Once a quorum (majority) acknowledges the entry, it's committed and applied to the state machine (the in-memory B-tree of key-value pairs). This ensures all nodes eventually have identical state, and any majority sub-cluster can continue operating.

Clients connect to any node but are redirected to the leader for writes. Reads can be served by any node with stale-read semantics, or by the leader with linearizable semantics (the default). To avoid stale reads from followers, etcd uses a ReadIndex mechanism: before serving a read, the leader confirms it's still the leader by checking that a majority of followers are still reachable, then applies the read at the confirmed commit index.

The watch mechanism is implemented as a long-lived gRPC stream from client to server. The server maintains a watch registry: a map from key/prefix to list of active watchers. When a write commits, the server scans the watch registry and sends events to all matching watchers via their gRPC streams. This fan-out is O(watchers_per_key) per write — for popular config keys, this can be large.

Core Components

Raft Consensus Engine

Raft divides time into terms. Each term begins with a leader election: if a follower doesn't hear from the leader within an election timeout (150-300ms randomized), it increments its term and sends RequestVote RPCs. A node votes for the first candidate with an up-to-date log it receives a vote request from. Once a candidate gets votes from a majority, it becomes leader and sends AppendEntries heartbeats. The leader appends client writes as log entries and sends them to followers; once a majority acknowledges, the entry is committed. Log entries are persisted to disk (WAL — write-ahead log) before being acknowledged, ensuring durability.

MVCC Storage Engine

The storage layer uses multi-version concurrency control (MVCC). Every key has a revision history: each write creates a new revision rather than overwriting. This enables: (1) watch delivery from a specific revision (catch up missed events after reconnect), (2) transactions that check a key's current version before writing (optimistic locking), (3) snapshot reads at a consistent point in time. Keys are stored in a B-tree indexed by (key, revision). Old revisions are compacted periodically — the compaction process removes all but the latest N revisions or all revisions older than N seconds.

Lease Manager

Leases are TTL-based handles attached to keys. A key with a lease expires when the lease TTL expires (unless renewed). Clients renew leases by sending KeepAlive RPCs to the leader. If a client crashes (fails to renew), the lease expires and all attached keys are automatically deleted. This enables distributed leader election (first service to successfully create a key with a lease wins the election; when the leader crashes, the key expires and another service can claim it) and ephemeral service registration (services register their address in a config key with a 10-second lease, renewing every 3 seconds — if they crash, their registration disappears within 10 seconds).

Database Design

The storage engine is entirely in-memory (B-tree) with a write-ahead log (WAL) for durability. On startup, the node replays the WAL to reconstruct the in-memory state, or loads from the most recent snapshot (a serialized B-tree checkpoint) and replays only the WAL entries after the snapshot. Snapshots are taken when the WAL grows beyond a threshold (e.g., 10,000 entries). The snapshot + WAL approach bounds startup recovery time regardless of total write history.

Metadata (cluster membership, term, vote) is stored in a separate BoltDB file for persistence across restarts. The backend storage is pluggable — etcd uses BoltDB (a Go B-tree key-value store with ACID transactions) for the WAL and snapshot persistence.

API Design

Scaling & Bottlenecks

Raft serializes all writes through the leader — horizontal write scaling is impossible in a single cluster. Write throughput is bounded by: leader CPU, network RTT to followers, and fsync latency (WAL writes). Mitigation: batch writes (group multiple client writes into a single Raft log entry); use fast SSDs for WAL (fsync latency dominates); keep cluster size at 3 or 5 nodes (adding more followers increases quorum size and write latency).

Watch fan-out bottlenecks when many clients watch a single frequently-changing key. The leader must send N watch events for each write where N is the watcher count. Mitigation: use a dedicated watch gRPC server that pulls from the Raft commit stream and fans out asynchronously, decoupling write commit latency from watch delivery latency. Watch events can be batched: send multiple events in a single gRPC stream write to reduce syscall overhead.

Key Trade-offs

Linearizable vs. serializable reads: Linearizable reads (etcd's default) guarantee seeing the latest write but require confirming leader status on each read (adds latency); serializable reads (follower reads) are faster but may return stale data
3-node vs. 5-node cluster: 3-node tolerates 1 failure and has lower write latency (quorum = 2); 5-node tolerates 2 simultaneous failures but adds write latency (quorum = 3) and cost
Compaction aggressiveness: Aggressive compaction frees memory but loses history, breaking watchers that were disconnected for too long; conservative compaction preserves history at the cost of memory growth
Lease TTL length: Short TTLs (1-2s) detect failures quickly but require frequent KeepAlive traffic and risk false expiry during network hiccups; long TTLs (30s) are more stable but mean slow failure detection