How load balancing works — algorithms, health checks, Layer 4 vs Layer 7, sticky sessions, and how Netflix and Google distribute billions of requests.

Load Balancing

Load balancing is the process of distributing incoming network traffic across multiple servers to ensure no single server is overwhelmed, improving availability, throughput, and response times.

What It Really Means

Imagine a single web server handling all your traffic. At 100 requests per second it is fine. At 10,000, it buckles. At 100,000, it crashes. You cannot keep buying bigger servers forever — there is always a ceiling. The alternative is horizontal scaling: run many servers and spread traffic across them.

A load balancer sits between clients and your server fleet. It receives every incoming request and decides which backend server should handle it. If a server goes down, the load balancer stops sending traffic to it. If you add new servers, the load balancer starts including them. Clients never know how many servers are behind the scenes — they talk to one address, and the load balancer does the rest.

Load balancing operates at different layers of the network stack, uses various algorithms to make routing decisions, and is often the first line of defense for both scalability and reliability. Every large-scale system — from Google Search to Netflix streaming — depends on multiple layers of load balancing.

How It Works in Practice

Layer 4 vs Layer 7 Load Balancing

Layer 4 (Transport Layer) load balancers route based on IP address and TCP port. They do not inspect the HTTP request content. This makes them extremely fast — they can handle millions of connections per second.

Google's Maglev is a Layer 4 load balancer that uses consistent hashing to distribute packets across backend servers. It runs on commodity hardware and handles Google's entire frontend traffic.

Layer 7 (Application Layer) load balancers inspect HTTP headers, URLs, cookies, and request bodies. They can make smarter routing decisions: send API traffic to one server pool, static assets to another, and WebSocket connections to a third.

Netflix uses Layer 7 load balancing with Zuul to route requests based on URL patterns, user segments, and canary deployment rules. A request to /api/profiles goes to the profile service fleet, while /api/recommendations goes to a completely different set of servers.

Load Balancing Algorithms

Round Robin: Requests go to servers in sequential order (1, 2, 3, 1, 2, 3...). Simple and fair when all servers have equal capacity and all requests have equal cost.

Weighted Round Robin: Servers with more capacity get more traffic. A 16-core server gets twice the traffic of an 8-core server.

Least Connections: Route to the server with the fewest active connections. Works well when request processing times vary significantly.

Consistent Hashing: Hash the request (e.g., by user ID) to deterministically route to the same server. Used when you want session affinity or cache locality without sticky sessions.

Random with Two Choices (Power of Two): Pick two random servers and route to the one with fewer connections. Surprisingly effective — provides near-optimal load distribution with minimal coordination.

Health Checks

A load balancer must know which servers are healthy. It periodically sends health check probes:

TCP check: Can I open a connection to port 8080? (basic liveness)
HTTP check: Does GET /health return 200? (application-level health)
Deep health check: Does the server respond within 100ms and return valid data? (thorough but expensive)

Amazon ALB performs health checks every 30 seconds by default. If a target fails 2 consecutive checks, it is marked unhealthy and removed from rotation. After 3 consecutive successes, it is added back.

Implementation

python

nginx

Trade-offs

Advantages

Horizontal scalability: Add servers to handle more traffic without upgrading hardware
High availability: Automatic failover when servers go down
Zero-downtime deployments: Drain traffic from servers being updated
SSL termination: Offload TLS handshake overhead from backend servers

Disadvantages

Single point of failure: The load balancer itself can fail (solved with redundant LBs)
Added latency: Extra network hop adds 0.5-2ms per request
Session state complexity: Stateful applications need sticky sessions or external session stores
Cost: Managed load balancers (ALB, NLB) charge per hour and per request

Sticky Sessions vs Stateless Design

Sticky sessions pin a user to the same server using cookies or IP hashing. This breaks load distribution and failover. The better approach is stateless servers with shared session storage (Redis, database) — any server can handle any request.

Common Misconceptions

"Load balancing is just round robin" — Round robin ignores server load, request complexity, and server capacity. Production systems typically use least-connections or weighted algorithms.
"One load balancer is enough" — A single load balancer is a single point of failure. Production deployments use pairs (active-passive or active-active) with DNS failover.
"Load balancers only distribute HTTP traffic" — Layer 4 load balancers handle any TCP/UDP traffic: databases, message queues, game servers, DNS.
"More servers behind a load balancer always improves performance" — If your bottleneck is the database, adding more application servers just moves the contention point. Profile before scaling.
"Cloud load balancers handle unlimited traffic" — AWS ALB can handle most workloads, but sudden traffic spikes require pre-warming. Contact AWS support before expected surge events.

How This Appears in Interviews

Load balancing comes up in nearly every system design interview:

"How would you design a URL shortener that handles 10,000 requests per second?" — Multiple application servers behind a load balancer. Discuss algorithm choice and health checks.
"Your application servers are unevenly loaded" — Review the load balancing algorithm. Round robin with varying request costs leads to imbalance. Switch to least-connections.
"How do you deploy without downtime?" — Rolling deployment with the load balancer draining connections from servers being updated.
"How does Google handle billions of requests daily?" — Multiple layers: DNS-based geographic routing, Layer 4 (Maglev) for packet distribution, Layer 7 (GFE) for HTTP routing.

See our System Design Interview Guide for end-to-end walkthroughs.

Related Concepts

Consistent Hashing — the algorithm behind many load balancing strategies
Heartbeat Mechanism — how load balancers detect server failures
Partition Tolerance — what happens when load balancers lose contact with backends
Interview Questions on System Design
System Design Interview Guide
Algoroq Pricing — practice load balancing design questions

Load Balancing Explained: Distributing Traffic Across Servers

Load Balancing

What It Really Means

How It Works in Practice

Layer 4 vs Layer 7 Load Balancing

Load Balancing Algorithms

Health Checks

Implementation

Trade-offs

Advantages

Disadvantages

Sticky Sessions vs Stateless Design

Common Misconceptions

How This Appears in Interviews

Related Concepts

Learn from senior engineers in our 12-week cohort

Consistent Hashing Explained: Distributing Data Without Reshuffling Everything

Database Sharding Explained: Splitting Data Across Multiple Databases

Database Replication Explained: Keeping Data in Sync Across Nodes

Partition Tolerance Explained: Surviving Network Failures in Distributed Systems

Split-Brain Problem Explained: When Distributed Systems Disagree on Who Is in Charge

CAP Theorem Explained: Consistency, Availability, and Partition Tolerance