System Design: Distributed Lock Service

Requirements

Functional Requirements:

Acquire an exclusive lock on a named resource across multiple distributed processes
Locks expire automatically (TTL-based) if the holder crashes without releasing
Provide lock renewal (keepalive) to extend TTL for long-running operations
Return a fencing token (monotonically increasing number) with each lock acquisition for safe conditional writes
Support tryLock (non-blocking, returns immediately) and lock (blocking with timeout)
Fair locking: processes waiting for a lock acquire it in FIFO order

Non-Functional Requirements:

Lock acquisition latency under 10ms p99
Safety: at most one process holds a given lock at any time (mutual exclusion)
Liveness: a failed lock holder's lock expires within TTL + clock skew seconds
Survive failure of (n-1)/2 nodes without losing lock safety

Scale Estimation

A large deployment: 10,000 services, each acquiring ~10 locks/sec = 100,000 lock operations/sec. Most locks are uncontended (the same process acquires the same lock repeatedly with no competition) — these are fast single-Redis operations. Contended locks (multiple processes competing) require a queue: at 100 waiting processes per lock, each acquire triggers a notification to the next waiter. Total active locks: ~50,000 at any time. Lock metadata (name, TTL, token, holder): ~200 bytes × 50,000 = 10 MB in Redis.

High-Level Architecture

Three approaches, each with different safety guarantees:

Single Redis node: Use SET key value NX PX ttl_ms (SET if Not eXists with expiry). Simple, fast (<1ms), but a single point of failure — Redis crash loses all lock state. Not suitable for strong safety guarantees.

Redlock (Redis Cluster): Acquire lock on N independent Redis nodes (no replication between them). A lock is acquired if and only if a majority (⌊N/2⌋+1) of nodes are successfully locked within a validity window (total TTL minus acquisition time). Release requires releasing on all nodes. Redlock provides safety against single-node failures but has known weaknesses: if a process pauses (GC, VM migration) after acquiring the majority but before using the lock, the TTL may expire and another process can acquire — fencing tokens are essential to handle this.

etcd/ZooKeeper: Create an ephemeral node (ZooKeeper) or a key with a lease (etcd). The lock holder maintains the session/lease via heartbeats. Session expiry (if the holder crashes) automatically deletes the node/key, releasing the lock. ZooKeeper's sequential ephemeral nodes implement fair queuing: each waiter creates a node with a sequence number and watches the node immediately before it — when that node is deleted (previous holder released), the watcher acquires the lock. This avoids thundering herd.

Core Components

Fencing Token Generator

Each lock acquisition returns a fencing token — a monotonically increasing integer (the etcd revision or ZooKeeper zxid at acquisition time). The token is passed with every write operation to the protected resource. The resource's storage layer (database, file system, distributed storage) rejects writes with a token lower than the last seen token. This ensures that a slow process that resumes after its lock expired cannot corrupt state — its stale token is rejected even if it doesn't know its lock has expired. Without fencing tokens, a process pause followed by lock expiry and re-acquisition by another process leads to both processes thinking they hold the lock simultaneously.

Lease Manager (etcd-based)

Implemented on etcd leases. Lock acquisition: write key /locks/{name} with value {holder_id, token} using a transaction: txn(compare: key doesn't exist, success: put key with lease_id, failure: get current holder). The etcd transaction is atomic — exactly one writer succeeds. The winner must renew the lease via KeepAlive RPCs before TTL expiry. If the winner crashes, the lease expires and the key is automatically deleted. Waiters watch the key: when it's deleted, all waiters retry the transaction simultaneously — only one wins. For fair ordering, waiters use a separate queue key with sequential writes.

Deadlock Detection

Lock-wait graphs detect deadlocks: process A holds lock X and waits for lock Y; process B holds lock Y and waits for lock X — a cycle. A background detector periodically builds the wait graph from active lock state and finds cycles. On cycle detection: the process with the lowest-priority or newest lock acquisition is chosen as the victim and its lock is forcibly released (by deleting the key in etcd). The victim's next KeepAlive fails, signaling that its lock was revoked. Deadlock detection adds overhead and is only practical in systems where deadlocks are possible (multiple locks per operation).

Database Design

For the etcd-based implementation, lock state is stored as etcd keys with leases. Each lock key stores: holder_id (which process holds the lock), acquired_at timestamp, fencing_token (the etcd revision at acquisition), and lock metadata (resource name, TTL). Lease IDs are bound to the holder's etcd session; session expiry cascades to lease expiry and key deletion.

For audit and debugging, lock acquisition and release events are written to an append-only log (separate from the lock state). This enables diagnosis of contention hot spots, average hold times, and deadlock frequencies — valuable for capacity planning and identifying algorithmic inefficiencies in client code.

API Design

Scaling & Bottlenecks

Hot locks (one resource locked thousands of times per second) bottleneck on the single serialization point — all acquires for a given lock serialize. Mitigation: partition the protected resource (instead of one global lock, use per-shard locks) to reduce contention. Read-write locks (allow multiple readers, one writer) increase concurrency for read-heavy workloads. Optimistic locking (compare-and-swap rather than mutual exclusion) eliminates locks entirely when conflicts are rare.

For the Redlock approach, acquiring locks on 5 independent Redis nodes adds latency proportional to the slowest node (typically 5-15ms). The algorithm requires all 5 acquires to complete within the validity window (TTL minus acquisition time). If one Redis node is slow, acquisitions fail and must be retried with exponential backoff — under high contention, retry storms can develop. Mitigation: use a consensus-based approach (etcd) that serializes through a single leader rather than requiring majority votes from independent nodes.

Key Trade-offs

Redlock vs. ZooKeeper/etcd: Redlock is simpler to deploy (just Redis) but has theoretical safety issues under network partitions and process pauses; ZooKeeper/etcd provide stronger guarantees via consensus but are heavier dependencies
TTL length: Short TTLs (1-5s) limit the window where a crashed holder's lock blocks others but risk expiry during transient pauses (GC, network blip); long TTLs (30s+) are robust but delay recovery after holder crash
Fair vs. unfair locking: Fair (FIFO) locking prevents starvation but adds queue management overhead; unfair (whoever tries first wins) is simpler and has lower latency but can starve low-frequency waiters
Fencing tokens vs. no fencing: Fencing tokens are essential for safety under process pauses but require the protected resource to support conditional writes — many storage systems (S3, older databases) don't natively support token-based conditional writes