System Design: Distributed Cache (Redis Cluster)

Requirements

Functional Requirements:

Store key-value pairs with configurable TTL
Support data structures: strings, hashes, lists, sets, sorted sets
Sub-millisecond read and write latency
Automatic partitioning across multiple nodes (consistent hashing)
Replication: primary-replica pairs for fault tolerance
Pub/Sub messaging for real-time notifications

Non-Functional Requirements:

1 million operations/sec aggregate throughput
P99 latency under 1ms for GET operations
Support 100 TB total cached data across cluster
Automatic failover in under 30 seconds
Horizontal scaling — add nodes without downtime

Scale Estimation

With 100 TB of cached data and an average key-value size of 1 KB, the cache holds 100 billion keys. Redis uses ~70 bytes overhead per key (for the dict entry, SDS string, and LRU info). Total memory: 100B * (1,024 + 70) bytes ≈ 109 TB of RAM required. With 512 GB RAM per cache node, 213 nodes are needed. Redis Cluster supports up to 1,000 nodes. At 1 million ops/sec across 200 nodes, each node handles 5,000 ops/sec — well within Redis's single-node limit of 100,000+ ops/sec. Actual deployment uses smaller node counts with vertical scaling (high-memory instances).*

High-Level Architecture

Redis Cluster partitions the key space into 16,384 hash slots. Each primary node owns a subset of slots (e.g., 16 nodes, each owning 1,024 slots). Consistent hashing (CRC16 of key mod 16,384) determines which slot (and therefore which node) owns each key. Each primary node has 1–3 replica nodes for fault tolerance. A cluster bus (gossip protocol on port 16379 = 10000 + 6379) propagates node state, slot ownership changes, and failure detection between all nodes. Clients use a cluster-aware client library (redis-py-cluster, Jedis Cluster) that maintains the slot-to-node mapping and routes requests directly to the correct node.

For writes, the client computes the slot for the key and sends the command to the primary node owning that slot. The primary executes the command and asynchronously replicates to its replicas. Replication is asynchronous by default (non-blocking), meaning the primary acknowledges the write before replicas confirm. This provides lower write latency but risks losing a few recent writes if the primary fails before replication completes. WAIT command can enforce synchronous replication for critical writes. Reads can be served by replicas (READONLY mode on replica) to distribute read load.

Automatic failover: if a primary node fails, its replicas detect the failure via heartbeat timeouts and initiate an election. A replica that has received the most up-to-date replication offset wins the election (majority vote from other primaries via cluster bus). The winning replica promotes itself to primary and takes ownership of the failed node's slots. The cluster remains available throughout failover (other slots are unaffected). Failover completes in 15–30 seconds (configurable via cluster-node-timeout).

Core Components

Hash Slot Partitioning

Redis Cluster uses 16,384 hash slots (not a simple consistent hash ring) for predictable, configurable partitioning. Each key maps to a slot via CRC16(key) mod 16,384. Hash tags ({user}) allow co-locating related keys on the same slot for multi-key operations. Slot migration (CLUSTER SETSLOT, CLUSTER MIGRATE) moves individual slots between nodes without downtime, enabling live rebalancing when nodes are added or removed. Migrating a slot involves: marking the slot as migrating on the source, importing on the target, migrating keys in batches (1,000 at a time), then atomically flipping the slot ownership in the cluster state.

Eviction Policies

Redis supports 8 eviction policies triggered when memory usage reaches maxmemory: noeviction (return error on writes, safe for critical data); allkeys-lru (evict least recently used key from all keys, general purpose); volatile-lru (evict LRU key with TTL set, preserves non-expiring keys); allkeys-lfu (evict least frequently used, better for skewed workloads); allkeys-random (evict random key, for uniform access patterns). LFU-based eviction (Redis 4.0+) uses a probabilistic LFU counter (Morris counter) per key: 8 bits represent frequency with logarithmic scaling, updated probabilistically on each access. LFU outperforms LRU for Zipf-distributed access patterns (most caches).

Hot Key Handling

Hot keys (a single key receiving millions of requests/sec) create node-level bottlenecks that consistent hashing cannot solve — all requests for a hot key route to the same node. Solutions: (1) Local cache: application-side in-process LRU cache (e.g., Caffeine) serves the top-100 hot keys with a 100ms TTL, absorbing 99% of hot key traffic before it hits Redis; (2) Key sharding: append a random suffix to hot key (e.g., user:123:0 through user:123:9), spreading requests across 10 keys on different slots, with a read-merge step in the application; (3) Read replicas: route reads for hot keys to Redis replicas (READONLY mode), distributing traffic across primary + N replicas; (4) Request coalescing: use a singleflight pattern to ensure only one backend request is in flight per key at any time.

Database Design

Redis stores all data in memory, with optional persistence to disk. RDB (Redis Database) persistence creates point-in-time snapshots by forking the process (copy-on-write) and serializing all key-value pairs to a compact binary file. RDB is suitable for periodic backup (every 1–15 minutes). AOF (Append Only File) logs every write command; replaying the AOF recovers the full dataset after a restart. AOF with fsync-every-second provides <1 second RPO. Combination mode (AOF + RDB) is common in production: AOF ensures recent write durability, RDB provides fast startup. Cluster-level persistence requires configuring both RDB and AOF on each primary node.

API Design

Scaling & Bottlenecks

Memory is the primary scaling constraint. Redis is a single-threaded command processor (I/O is multi-threaded in Redis 6+), so each core handles ~100,000–200,000 simple ops/sec. For write-heavy workloads, vertical scaling (more cores via threaded I/O) helps. For memory-heavy workloads, horizontal scaling (more cluster nodes) is necessary. Key expiration is handled by a lazy + active expiration strategy: lazy (check on access) handles the common case; active (sample 20 random volatile keys every 100ms and evict expired ones) prevents memory bloat from keys that are never accessed again.

Network bandwidth can be a bottleneck for large value operations (GET on 1 MB values). A cluster node with 10 Gbps NIC can deliver ~10 GB/s, saturated at 10,000 concurrent 1 MB GET operations/sec. For large objects, consider compression (snappy/zstd at the client level) or chunking. Redis Cluster's gossip protocol generates O(N²) message volume as cluster size grows; beyond 100–200 nodes, gossip overhead becomes significant. Structuring the cluster as multiple smaller clusters (each handling a subset of the key space) with cross-cluster routing in the application layer scales beyond this limit.

Key Trade-offs

Async replication vs. durability: Async replication gives sub-ms write latency but risks 1–2 seconds of data loss on primary failure; synchronous replication (WAIT 1 100) adds 1ms per write but ensures no data loss
Memory vs. persistence: Maximizing cache size requires using available RAM for data; RDB/AOF persistence adds disk I/O overhead; purely in-memory (no persistence) is fastest but requires warm-up after restart
LRU vs. LFU eviction: LRU is simple and predictable; LFU better handles access pattern changes (a key that was hot 1 hour ago but is now cold is not evicted by LRU; LFU frequency decay handles this)
Cluster sharding vs. application-level sharding: Redis Cluster provides transparent sharding via cluster-aware clients; application-level sharding (consistent hash in app code hitting multiple standalone Redis instances) avoids cross-slot operation limitations but couples sharding logic to application code