Load Balancing Explained: Distributing Traffic Across Servers
How load balancing works — algorithms, health checks, Layer 4 vs Layer 7, sticky sessions, and how Netflix and Google distribute billions of requests.
Load Balancing
Load balancing is the process of distributing incoming network traffic across multiple servers to ensure no single server is overwhelmed, improving availability, throughput, and response times.
What It Really Means
Imagine a single web server handling all your traffic. At 100 requests per second it is fine. At 10,000, it buckles. At 100,000, it crashes. You cannot keep buying bigger servers forever — there is always a ceiling. The alternative is horizontal scaling: run many servers and spread traffic across them.
A load balancer sits between clients and your server fleet. It receives every incoming request and decides which backend server should handle it. If a server goes down, the load balancer stops sending traffic to it. If you add new servers, the load balancer starts including them. Clients never know how many servers are behind the scenes — they talk to one address, and the load balancer does the rest.
Load balancing operates at different layers of the network stack, uses various algorithms to make routing decisions, and is often the first line of defense for both scalability and reliability. Every large-scale system — from Google Search to Netflix streaming — depends on multiple layers of load balancing.
How It Works in Practice
Layer 4 vs Layer 7 Load Balancing
Layer 4 (Transport Layer) load balancers route based on IP address and TCP port. They do not inspect the HTTP request content. This makes them extremely fast — they can handle millions of connections per second.
Google's Maglev is a Layer 4 load balancer that uses consistent hashing to distribute packets across backend servers. It runs on commodity hardware and handles Google's entire frontend traffic.
Layer 7 (Application Layer) load balancers inspect HTTP headers, URLs, cookies, and request bodies. They can make smarter routing decisions: send API traffic to one server pool, static assets to another, and WebSocket connections to a third.
Netflix uses Layer 7 load balancing with Zuul to route requests based on URL patterns, user segments, and canary deployment rules. A request to /api/profiles goes to the profile service fleet, while /api/recommendations goes to a completely different set of servers.
Load Balancing Algorithms
Round Robin: Requests go to servers in sequential order (1, 2, 3, 1, 2, 3...). Simple and fair when all servers have equal capacity and all requests have equal cost.
Weighted Round Robin: Servers with more capacity get more traffic. A 16-core server gets twice the traffic of an 8-core server.
Least Connections: Route to the server with the fewest active connections. Works well when request processing times vary significantly.
Consistent Hashing: Hash the request (e.g., by user ID) to deterministically route to the same server. Used when you want session affinity or cache locality without sticky sessions.
Random with Two Choices (Power of Two): Pick two random servers and route to the one with fewer connections. Surprisingly effective — provides near-optimal load distribution with minimal coordination.
Health Checks
A load balancer must know which servers are healthy. It periodically sends health check probes:
- TCP check: Can I open a connection to port 8080? (basic liveness)
- HTTP check: Does GET /health return 200? (application-level health)
- Deep health check: Does the server respond within 100ms and return valid data? (thorough but expensive)
Amazon ALB performs health checks every 30 seconds by default. If a target fails 2 consecutive checks, it is marked unhealthy and removed from rotation. After 3 consecutive successes, it is added back.
Implementation
Trade-offs
Advantages
- Horizontal scalability: Add servers to handle more traffic without upgrading hardware
- High availability: Automatic failover when servers go down
- Zero-downtime deployments: Drain traffic from servers being updated
- SSL termination: Offload TLS handshake overhead from backend servers
Disadvantages
- Single point of failure: The load balancer itself can fail (solved with redundant LBs)
- Added latency: Extra network hop adds 0.5-2ms per request
- Session state complexity: Stateful applications need sticky sessions or external session stores
- Cost: Managed load balancers (ALB, NLB) charge per hour and per request
Sticky Sessions vs Stateless Design
Sticky sessions pin a user to the same server using cookies or IP hashing. This breaks load distribution and failover. The better approach is stateless servers with shared session storage (Redis, database) — any server can handle any request.
Common Misconceptions
- "Load balancing is just round robin" — Round robin ignores server load, request complexity, and server capacity. Production systems typically use least-connections or weighted algorithms.
- "One load balancer is enough" — A single load balancer is a single point of failure. Production deployments use pairs (active-passive or active-active) with DNS failover.
- "Load balancers only distribute HTTP traffic" — Layer 4 load balancers handle any TCP/UDP traffic: databases, message queues, game servers, DNS.
- "More servers behind a load balancer always improves performance" — If your bottleneck is the database, adding more application servers just moves the contention point. Profile before scaling.
- "Cloud load balancers handle unlimited traffic" — AWS ALB can handle most workloads, but sudden traffic spikes require pre-warming. Contact AWS support before expected surge events.
How This Appears in Interviews
Load balancing comes up in nearly every system design interview:
- "How would you design a URL shortener that handles 10,000 requests per second?" — Multiple application servers behind a load balancer. Discuss algorithm choice and health checks.
- "Your application servers are unevenly loaded" — Review the load balancing algorithm. Round robin with varying request costs leads to imbalance. Switch to least-connections.
- "How do you deploy without downtime?" — Rolling deployment with the load balancer draining connections from servers being updated.
- "How does Google handle billions of requests daily?" — Multiple layers: DNS-based geographic routing, Layer 4 (Maglev) for packet distribution, Layer 7 (GFE) for HTTP routing.
See our System Design Interview Guide for end-to-end walkthroughs.
Related Concepts
- Consistent Hashing — the algorithm behind many load balancing strategies
- Heartbeat Mechanism — how load balancers detect server failures
- Partition Tolerance — what happens when load balancers lose contact with backends
- Interview Questions on System Design
- System Design Interview Guide
- Algoroq Pricing — practice load balancing design questions
GO DEEPER
Learn from senior engineers in our 12-week cohort
Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.
// RELATED CONCEPTS
Consistent Hashing Explained: Distributing Data Without Reshuffling Everything
Learn how consistent hashing distributes data across nodes with minimal disruption when nodes join or leave, with real examples from DynamoDB and Cassandra.
Database Sharding Explained: Splitting Data Across Multiple Databases
Master database sharding — partitioning strategies, shard key selection, rebalancing challenges, and real examples from Instagram, Discord, and Vitess.
Database Replication Explained: Keeping Data in Sync Across Nodes
How database replication works in distributed systems — synchronous vs asynchronous, leader-follower vs multi-leader, replication lag, and production trade-offs.
Partition Tolerance Explained: Surviving Network Failures in Distributed Systems
How partition tolerance works — why network partitions are inevitable, CAP theorem implications, partition handling strategies, and real-world examples.
Split-Brain Problem Explained: When Distributed Systems Disagree on Who Is in Charge
How split-brain occurs in distributed systems — causes, consequences, fencing tokens, STONITH, quorum-based prevention, and real-world outage examples.
CAP Theorem Explained: Consistency, Availability, and Partition Tolerance
A clear, practical explanation of the CAP theorem — what it really means, how it applies to real distributed systems, common misconceptions, and how to discuss it in system design interviews.