SYSTEM_DESIGN

System Design: Load Balancer

Design a software load balancer that distributes traffic across backend servers using algorithms like consistent hashing, least connections, and weighted round-robin, with health checking and session persistence.

16 min readUpdated Jan 15, 2025
system-designload-balancerconsistent-hashinginfrastructurenetworking

Requirements

Functional Requirements:

  • Distribute incoming TCP/HTTP requests across a pool of backend servers
  • Support multiple load balancing algorithms: round-robin, weighted round-robin, least connections, IP hash, consistent hashing
  • Health check backends and remove unhealthy instances from rotation
  • Support session persistence (sticky sessions): route requests from the same client to the same backend
  • SSL/TLS termination: decrypt HTTPS at the load balancer, forward HTTP to backends
  • Layer 7 routing: route based on HTTP host, path, headers, or cookies to different backend pools

Non-Functional Requirements:

  • Handle 1 million concurrent connections and 500,000 new connections/sec
  • Sub-millisecond routing decision latency
  • 99.999% availability: load balancer failure must not cause service disruption
  • Zero-downtime backend pool changes: add/remove backends without dropping connections

Scale Estimation

1M concurrent connections × average 1 KB state per connection = 1 GB connection state in memory. At 500,000 new connections/sec, the LB must complete TLS handshakes at that rate — TLS 1.3 with session resumption (0-RTT) reduces handshake cost. For L7 routing, HTTP header parsing adds ~0.1ms per request. With 10 Gbps NICs, a single LB node handles ~800,000 HTTP requests/sec (assuming 1.5 KB average request). A cluster of 4 LB nodes handles 3.2M RPS with 2x redundancy.

High-Level Architecture

For high availability, load balancers are deployed in active-active pairs behind a DNS or anycast IP. ECMP (Equal-Cost Multi-Path) routing at the network layer distributes flows across LB instances — each LB handles a subset of traffic. This avoids the single-active-passive failover delay (VRRP/HSRP failover takes 1-3 seconds) in favor of always-on distribution.

The LB has two layers: L4 (transport) and L7 (application). L4 handles raw TCP/UDP connections using kernel bypass (DPDK or XDP) for maximum throughput — the kernel network stack is bypassed, and packets are processed directly in userspace. L7 handles HTTP/HTTPS, terminates TLS, parses headers, and makes routing decisions. L7 is slower than L4 (header parsing overhead) but enables content-based routing and request manipulation.

Backend health is monitored by active health checks (the LB sends periodic probe requests to each backend) and passive health checks (monitoring real traffic error rates — if a backend returns 5xx for >10% of requests in a 10-second window, it's considered unhealthy). Passive health checks detect failures faster than active probes (no waiting for the next probe interval).

Core Components

Connection Dispatcher

For L4 load balancing, the connection dispatcher uses a flow table: a hash map of (src_ip, src_port, dst_ip, dst_port) → selected_backend. The first packet of a new connection triggers a backend selection (using the configured algorithm); subsequent packets from the same flow are routed to the same backend by table lookup — O(1) per packet. The flow table has a bounded size (evict entries after connection close or idle timeout). For consistent hashing, the virtual node ring maps the hash of (client_ip + service_key) to a backend — adding or removing a backend only affects 1/N of connections (where N is the backend count).

Least Connections Algorithm

Each backend maintains an atomic counter of active connections. When a new connection arrives, the LB selects the backend with the fewest active connections. In a distributed LB cluster, each node maintains its own counter (not globally shared) — a backend might appear to have 100 connections from LB-1's perspective and 50 from LB-2's perspective. Global coordination (via shared state in Redis) is too slow (adds 1ms latency per connection). The approximation of per-node counters is acceptable: over time, the aggregate load balances out.

Health Check Engine

Active health checks run in background goroutines, one per backend. HTTP health checks: send GET /health every 5 seconds with a 2-second timeout; 2 consecutive failures mark unhealthy; 3 consecutive successes after recovery mark healthy again (hysteresis prevents flapping). DNS-based backends (upstream clusters with hostnames) re-resolve DNS every 30 seconds to pick up new IP addresses. When a backend is marked unhealthy, the LB immediately stops sending new connections — in-flight requests to that backend are allowed to complete (connection draining) up to a configurable timeout (30 seconds).

Database Design

The load balancer is largely stateless. Configuration (backend pools, routing rules, SSL certificates, health check parameters) is stored externally in a configuration store (etcd or Consul) and pulled on startup and via watch. The only runtime state is the flow table (in-memory, per-node) and connection counters. Sticky session state (client → backend mapping) is stored in a shared Redis cluster if session persistence must survive LB node failure; for best-effort stickiness (lost on LB restart), local memory suffices.

SSL certificates are stored in the configuration store with automatic rotation — the LB watches for certificate updates and reloads them without restarting (hot reload). Certificate storage must be encrypted at rest; integration with a secrets manager (Vault) provides automatic rotation when certs approach expiry.

API Design

Scaling & Bottlenecks

The LB's connection table size limits concurrent connection count. At 1M connections, each entry is ~200 bytes (src/dst IP+port, backend, state) = 200 MB — manageable. Beyond 10M connections, the table exceeds memory; mitigation: use flow-based ECMP at the router level to shard connections across more LB nodes. TLS handshake CPU is the other bottleneck — at 500K new connections/sec with full handshakes, a modern CPU core handles ~50K TLS handshakes/sec, requiring 10 cores just for TLS. Session resumption (TLS 1.3 session tickets or PSK) reduces this dramatically for returning clients.

Bandwidth is the ultimate physical limit. At 10 Gbps per NIC × 4 NICs per LB node = 40 Gbps. For a 100 Gbps service, 3 LB nodes are needed with 2x headroom. SmartNICs (NVIDIA BlueField, Pensando) offload L4 load balancing entirely to the NIC, freeing CPU for L7 processing.

Key Trade-offs

  • L4 vs. L7 load balancing: L4 is faster (no HTTP parsing, kernel-bypass possible) but limited to IP/port routing; L7 enables content-based routing, request manipulation, and better observability at higher latency cost
  • Consistent hashing vs. round-robin: Consistent hashing minimizes cache invalidation when backends are added/removed (only 1/N connections reassigned) but can create hot spots if the hash function isn't well-distributed; round-robin is simple and avoids hot spots but disrupts all cache affinity on backend changes
  • Active-active vs. active-passive HA: Active-active uses all hardware at full capacity and has no failover delay; active-passive wastes 50% capacity but simplifies state synchronization (no split-brain)
  • SSL termination at LB vs. end-to-end TLS: Terminating at the LB enables L7 inspection and reduces backend CPU load, but the LB-to-backend hop is unencrypted (requires trusted internal network or mTLS for the backend leg)

GO DEEPER

Master this topic in our 12-week cohort

Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.