INTERVIEW_QUESTIONS

Load Balancing Interview Questions for Senior Engineers (2026)

15 real load balancing interview questions with detailed answer frameworks covering algorithms, health checks, global traffic management, and production trade-offs asked at top tech companies.

20 min readUpdated Apr 20, 2026
interview-questionsload-balancingsenior-engineerdistributed-systemsinfrastructure

Why Load Balancing Matters in Senior Engineering Interviews

Load balancing is one of those topics that separates senior engineers from mid-level candidates in system design interviews. Every distributed system at scale relies on load balancing, yet most candidates only scratch the surface with a generic mention of "put a load balancer in front of the servers." At companies like Google, Amazon, Netflix, and Meta, interviewers expect you to reason deeply about load balancing algorithms, failure detection, connection draining, layer 4 vs layer 7 trade-offs, and global traffic management.

The reason load balancing appears so frequently is that it intersects with nearly every other system design topic. Whether you are designing a URL shortener, a ride-sharing platform like Uber, or a streaming service like Netflix, load balancing decisions directly impact latency, availability, and cost. A senior engineer must understand not just what load balancing does but how different strategies affect tail latency, how health checks interact with deployment strategies, and how to handle cascading failures caused by misconfigured load balancers.

For a deeper technical exploration, see how load balancing works. For broader context on distributed systems interviews, check the system design interview guide and explore the learning paths tailored to senior engineers.

1. Explain the difference between Layer 4 and Layer 7 load balancing. When would you choose one over the other?

What the interviewer is really asking: Do you understand the OSI model deeply enough to reason about performance, functionality, and cost trade-offs between transport-level and application-level load balancing?

Answer framework:

Layer 4 (L4) load balancers operate at the transport layer, making routing decisions based on IP addresses, TCP/UDP ports, and connection metadata without inspecting the actual payload. They forward raw TCP or UDP packets. Layer 7 (L7) load balancers operate at the application layer, inspecting HTTP headers, cookies, URL paths, and even request bodies to make intelligent routing decisions.

L4 load balancers are significantly faster because they do minimal processing per packet. They can handle millions of concurrent connections with low CPU overhead. AWS Network Load Balancer, for example, can handle over 1 million new connections per second with sub-microsecond added latency. They are ideal for non-HTTP protocols, TCP passthrough scenarios, and extremely high-throughput workloads where you do not need content-based routing.

L7 load balancers add latency (typically 1-5ms) because they must fully parse the HTTP request. However, they enable powerful features: path-based routing (send /api/* to backend services, /static/* to a CDN origin), header-based routing (route by API version or tenant ID), cookie-based session affinity, request/response transformation, and SSL termination with HTTP/2 multiplexing. AWS Application Load Balancer and NGINX operate at this layer.

In practice, most modern architectures use both. L4 at the edge for raw throughput and DDoS mitigation, L7 internally for intelligent routing between microservices. At Netflix, Zuul operates as an L7 gateway while lower-level load balancing happens at L4. At Google, Maglev provides L4 load balancing at the network edge, while Envoy sidecars handle L7 routing within the service mesh.

A common mistake is defaulting to L7 for everything. If you have a gRPC service doing inter-service communication at 100K requests per second, L4 with client-side load balancing might be more efficient than routing through an L7 proxy. Another mistake is forgetting that L4 cannot do SSL termination in the traditional sense since it forwards encrypted packets directly.

Follow-up questions:

  • How does TLS termination differ between L4 and L7 load balancers?
  • Can you use L4 load balancing with HTTP/2 multiplexed connections effectively?
  • What happens to WebSocket connections at each layer?

2. Compare round-robin, least-connections, and weighted load balancing algorithms. What are the failure modes of each?

What the interviewer is really asking: Can you reason about algorithm behavior under real-world conditions like heterogeneous servers, slow requests, and varying payload sizes?

Answer framework:

Round-robin distributes requests sequentially across servers. It is dead simple and works well when all servers are identical and all requests take roughly the same time. The failure mode appears when requests have vastly different processing times. If one in ten requests takes 30 seconds (a report generation query, for example), round-robin can accidentally stack several slow requests on the same server, causing it to become overloaded while other servers sit idle. This is the "unlucky server" problem.

Least-connections routes each new request to the server with the fewest active connections. This naturally adapts to heterogeneous workloads because a slow server accumulates connections and stops receiving new ones. However, it has its own failure mode: during a rolling deployment, a freshly restarted server has zero connections and gets flooded with requests before it is fully warmed up (JVM class loading, cache population, connection pool initialization). This can cause the new instance to crash, triggering a cascading restart cycle.

Weighted algorithms (weighted round-robin or weighted least-connections) assign a weight to each server reflecting its capacity. A server with weight 3 gets three times the traffic of a server with weight 1. This handles heterogeneous hardware well. The failure mode is stale weights: if a server's capacity degrades (memory leak, noisy neighbor on shared infrastructure) but its weight remains static, it receives more traffic than it can handle. This is why dynamic weight adjustment based on real-time metrics is critical in production.

More advanced algorithms exist. Power-of-two-random-choices (P2C) picks two random servers and sends the request to the one with fewer active connections. This achieves near-optimal distribution with O(1) decision time and avoids the synchronization overhead of true least-connections in distributed load balancer deployments. This is what Uber and many gRPC implementations use. Peak EWMA (Exponentially Weighted Moving Average) tracks the latency of each server and favors faster ones, accounting for both load and inherent server speed differences.

At Amazon, internal load balancers use a combination of least-outstanding-requests with health-adjusted weights, where a server's weight is dynamically reduced if its error rate increases or latency spikes. Understanding these real-world adaptations demonstrates senior-level depth.

Follow-up questions:

  • How would you implement load balancing for servers with different CPU and memory configurations?
  • What is the thundering herd problem in the context of least-connections after a server restart?
  • How does the choice of algorithm change for long-lived connections vs short-lived requests?

3. How do you design health checks for a load balancer? What is the difference between active and passive health checks?

What the interviewer is really asking: Do you understand the nuances of failure detection, the trade-offs between speed and accuracy, and how health checks interact with deployments and cascading failures?

Answer framework:

Active health checks are periodic probes sent by the load balancer to each backend server. A typical configuration pings a /healthz endpoint every 10 seconds with a 5-second timeout. After 3 consecutive failures, the server is marked unhealthy and removed from the rotation. After 2 consecutive successes, it is restored. This is the most common approach used by AWS ALB, NGINX, and HAProxy.

The critical design decision is what the health check endpoint actually validates. A shallow health check (returning 200 OK immediately) only confirms the process is running. A deep health check verifies database connectivity, cache connectivity, and downstream dependency availability. The trade-off: deep checks catch more failure modes but can cause a healthy server to be marked unhealthy because an unrelated dependency is slow. This is especially dangerous when combined with the CAP theorem implications of network partitions.

Passive health checks (also called outlier detection) monitor actual traffic to detect failures. If a server returns 5 consecutive 5xx errors or exceeds a latency threshold on real requests, it is ejected. Envoy's outlier detection uses this approach. The advantage is zero additional network traffic and detection of failures that synthetic health checks miss (for example, a server that responds to health checks but returns errors for certain query patterns). The disadvantage is that real users experience the failures before detection kicks in.

The best production systems use both. Active health checks as a baseline with a 10-30 second interval, and passive health checks for rapid detection of runtime failures. Configure aggressive passive detection (eject after 3 consecutive 5xx errors) with conservative active restoration (require 5 consecutive successes before re-adding).

Critical pitfalls: setting health check intervals too aggressively (every 1 second across 1000 backends means 1000 health check requests per second from the load balancer alone). Forgetting to implement graceful shutdown where a server returns unhealthy from its health check before it stops accepting connections, allowing in-flight requests to complete. Not implementing circuit breaking when too many backends are unhealthy, which can cause cascading failures as surviving servers get overwhelmed. For related patterns, see how consistent hashing helps redistribute load during partial failures.

Follow-up questions:

  • How do you prevent a flapping health check from destabilizing the pool?
  • What should happen when 80% of backends fail health checks simultaneously?
  • How do health checks interact with blue-green deployments?

4. Explain consistent hashing and its role in load balancing. How does it handle node failures?

What the interviewer is really asking: Can you explain this fundamental distributed systems concept clearly, discuss virtual nodes, and articulate why consistent hashing matters for stateful services?

Answer framework:

Consistent hashing arranges both servers and request keys on a circular hash ring (0 to 2^32 - 1). Each request key is hashed and assigned to the first server encountered clockwise on the ring. When a server is removed, only the keys that mapped to that server are redistributed to the next server on the ring. With N servers, only K/N keys are remapped on average (where K is the total number of keys), compared to traditional modulo hashing where nearly all keys are remapped.

The naive implementation has a serious problem: with only a few physical servers, the hash space is divided unevenly, leading to hot spots. The solution is virtual nodes (vnodes). Each physical server is mapped to 100-200 points on the ring. This provides a statistically even distribution. Amazon DynamoDB uses 150 virtual nodes per physical node. The trade-off is increased memory for the ring metadata and slightly slower lookups, but this is negligible in practice.

Consistent hashing is essential for stateful load balancing scenarios: session affinity (route a user to the same backend to use in-memory session state), distributed caches (ensure requests for the same cache key always hit the same cache node to maximize hit rates), and sharded databases (route queries for a given shard key to the correct database instance).

When a node fails, its keys smoothly migrate to adjacent nodes on the ring. When a new node is added, it takes a proportional share of keys from its neighbors. This minimal disruption is critical for systems like distributed caches where remapping keys causes cache misses that can cascade into database overload (a thundering herd or cold cache storm).

For systems with heterogeneous server capacities, assign more virtual nodes to more powerful servers. A server with 2x the memory gets 2x the virtual nodes and therefore handles 2x the traffic. This integrates naturally with weighted load balancing.

At Google, consistent hashing is used in the Maglev load balancer with a custom hash table that provides both consistent hashing properties and connection-tracking. Understanding this level of implementation detail demonstrates real production experience.

Follow-up questions:

  • How does bounded-load consistent hashing (from Google's Maglev paper) prevent hot spots?
  • What happens to consistent hashing performance with very high churn (servers joining and leaving rapidly)?
  • How would you implement consistent hashing for a service that needs to rebalance data when nodes are added?

5. How does a global load balancer (GSLB) work? Describe the architecture for routing traffic across multiple data centers.

What the interviewer is really asking: Do you understand GeoDNS, anycast, BGP, and the latency/availability trade-offs of multi-region architectures?

Answer framework:

Global Server Load Balancing distributes traffic across geographically dispersed data centers. There are three primary mechanisms.

DNS-based GSLB (GeoDNS): the authoritative DNS server examines the client's resolver IP address, maps it to a geographic region using a GeoIP database, and returns the IP address of the nearest data center. The client then connects directly to that data center. The advantage is simplicity and wide compatibility. The disadvantages are DNS caching (clients and resolvers cache DNS responses, so failover is slow, typically 30-300 seconds depending on TTL), resolver location may not match user location (corporate proxies, public resolvers like Google's 8.8.8.8), and coarse-grained routing (per data center, not per server). AWS Route 53, Cloudflare, and Akamai all provide GeoDNS.

Anycast routing: the same IP address is announced via BGP from multiple data center locations. The internet's routing infrastructure naturally directs packets to the nearest announcement point. This provides instant failover (BGP reconverges in seconds when a location goes down), works for UDP naturally, and is the foundation of most CDN architectures. The challenge is TCP: if a BGP route changes mid-connection, packets may be routed to a different data center that has no knowledge of the TCP state, causing the connection to break. Solutions include ECMP (Equal-Cost Multi-Path) pinning and connection tracking. Cloudflare serves all traffic via anycast.

Application-level GSLB: a centralized controller monitors all data centers and directs traffic based on real-time metrics (server health, current load, estimated latency). This provides the most intelligent routing but adds a dependency on the controller's availability. Often combined with DNS-based GSLB where the controller updates DNS records dynamically.

For multi-region architecture, design for active-active (all regions serve traffic simultaneously) rather than active-passive (one region serves, others are standby). Active-active provides better latency and resource utilization. The challenge is data consistency across regions, which connects to eventual consistency and the CAP theorem. For Netflix's architecture, active-active across three AWS regions ensures any single region can fail without user impact.

A common interview mistake is ignoring the data layer. GSLB routes requests to the nearest data center, but if the data center does not have the user's data (because the user's primary region is elsewhere), the request must be proxied or redirected, adding latency. This is why data locality and user-region affinity are critical.

Follow-up questions:

  • How would you handle failover when an entire data center goes offline?
  • What is the impact of DNS TTL on failover time, and how do you balance TTL against DNS query volume?
  • How do you route traffic for a user who is physically in Europe but whose data resides in US-East?

6. What is connection draining and why is it critical during deployments?

What the interviewer is really asking: Do you think about the operational details that prevent user-facing errors during routine maintenance and deployments?

Answer framework:

Connection draining (also called graceful shutdown or deregistration delay) is the process of allowing existing in-flight requests to complete while stopping the assignment of new requests to a server that is being removed from the pool. Without connection draining, removing a server causes all active requests on that server to fail immediately, resulting in user-visible errors.

The implementation works in phases. First, the server signals it is shutting down (via health check returning unhealthy, or explicit API call to the load balancer). Second, the load balancer stops sending new requests to that server. Third, existing connections are allowed to continue until they complete or a configurable timeout expires (typically 30-300 seconds). Fourth, after the timeout, any remaining connections are forcibly terminated.

This is critical for several deployment scenarios. During rolling deployments, you are replacing instances one by one. Without draining, each replacement causes a burst of errors for users whose requests were on the old instance. For long-lived connections like WebSockets (used in real-time chat or live dashboards), abrupt termination disconnects users and forces re-establishment. For database connections, abruptly killing a connection during a transaction can leave the database in an inconsistent state.

The draining timeout must be tuned carefully. Too short (5 seconds) and long-running requests like file uploads or report generation are terminated. Too long (10 minutes) and deployments take forever across a fleet of hundreds of servers (if you have 200 servers and drain for 5 minutes each with 10 servers draining in parallel, that is 100 minutes for a deployment). At Amazon, most services use a 30-second drain timeout with special handling for known long-running operations.

Advanced considerations: during connection draining, the server should stop initiating new outbound work (background jobs, async processing) while completing in-flight work. Kubernetes implements this via the preStop hook and terminationGracePeriodSeconds. In AWS, target group deregistration delay controls the drain timeout.

Common mistake: not testing your draining behavior under load. Many teams discover their drain timeout is too short only during peak traffic when requests take longer than usual. Related: learn about how load balancing works in production environments.

Follow-up questions:

  • How do you handle connection draining for gRPC streams that may last hours?
  • What happens if a server crashes instead of shutting down gracefully?
  • How does connection draining interact with auto-scaling events?

7. How would you implement session affinity (sticky sessions)? What are the trade-offs?

What the interviewer is really asking: Do you understand why sticky sessions are often an anti-pattern, and can you articulate when they are genuinely necessary versus when they indicate a design flaw?

Answer framework:

Session affinity ensures that all requests from a given user are routed to the same backend server. Implementation approaches include cookie-based affinity (the load balancer sets a cookie like SERVERID=backend-3, and subsequent requests include this cookie), IP-based affinity (hash the client IP to a consistent server), and header-based affinity (route based on a custom header like X-Session-ID).

Cookie-based is the most reliable because IP-based breaks when users are behind NAT (all users from a corporate network share one IP) or change networks (mobile users switching from WiFi to cellular). Consistent hashing on session ID provides the best balance of stickiness and even distribution.

The trade-offs are significant. First, uneven load distribution: some users generate far more traffic than others. A sticky session means a heavy user is always on the same server, potentially overloading it while others are idle. Second, reduced availability: if a server goes down, all users stuck to it must re-establish their sessions elsewhere, causing a burst of session re-creation. Third, deployment complications: during rolling deployments, you cannot drain users off a server quickly because their sessions are pinned to it. Fourth, horizontal scaling becomes harder because you cannot freely redistribute users.

The senior-level insight is that session affinity is often a code smell. It usually means the application stores session state in server memory instead of an external store like Redis or a database. The better architecture is stateless servers with externalized session state. This lets the load balancer freely distribute requests, simplifies scaling, and eliminates the session-loss problem on server failure.

However, there are legitimate uses. WebSocket connections are inherently stateful and must maintain affinity for the connection duration. Some workloads benefit from local caching where routing the same user to the same server improves cache hit rates (for example, a user's recently accessed data is in the server's local cache). In-memory computation like real-time analytics aggregation sometimes requires affinity.

At Google, the Maglev load balancer maintains connection affinity while supporting consistent hashing, allowing backend changes without disrupting existing connections. For further reading, see the distributed systems guide.

Follow-up questions:

  • How do you handle sticky sessions when auto-scaling adds or removes servers?
  • What happens to sticky sessions during a data center failover?
  • How would you migrate from sticky sessions to stateless architecture without downtime?

8. Describe how you would handle a cascading failure caused by a load balancer misconfiguration.

What the interviewer is really asking: Have you experienced real production incidents? Can you reason about failure modes, circuit breaking, and graceful degradation?

Answer framework:

A classic cascading failure scenario: your service has 10 backends, each handling 1,000 QPS (10,000 QPS total, at 70% capacity). One server fails a health check and is removed. The remaining 9 servers now receive 1,111 QPS each, pushing them to 78% capacity. Under higher load, response times increase, causing more health check timeouts. Two more servers are removed. The remaining 7 now receive 1,428 QPS each, approaching 100% capacity. They start failing. Within minutes, all servers are marked unhealthy and the service is down.

The root causes: health check thresholds were too aggressive (removing servers too quickly), there was no load shedding mechanism (servers accepted all traffic until they crashed), and no minimum healthy threshold (the load balancer happily routed all traffic to fewer and fewer servers).

Prevention strategies. First, configure a panic threshold: if more than 50% of backends are unhealthy, route traffic to all backends including unhealthy ones. This is better than concentrating load on the few remaining healthy servers. Envoy implements this as the panic threshold in outlier detection. Second, implement load shedding at the server level: when CPU exceeds 85% or the request queue depth exceeds a threshold, start rejecting new requests with 503 status codes. This prevents a server from being crushed by traffic it cannot handle. Third, use circuit breakers between services: if a downstream dependency is failing, fail fast instead of tying up connections waiting for timeouts. This prevents slow dependencies from consuming your connection pool and causing your service to become unavailable.

Fourth, implement retry budgets rather than fixed retry counts. If 20% of requests to a backend are retries, stop retrying. Without this, retries amplify load during failures. Each failed request generates 3 retries, tripling the load on an already struggling service. Fifth, pre-provision headroom: run at 50-60% capacity so you can absorb the load from failed instances without cascading. This costs more but is essential for critical services.

At Amazon, the 2020 Kinesis outage was partly caused by cascading failures in front-end fleet load balancing. Post-mortem action items included adding panic thresholds and reducing blast radius through cell-based architecture. Understanding real incidents like this demonstrates senior-level experience.

Follow-up questions:

  • How do you distinguish between a server that is unhealthy and a server that is overloaded?
  • How would you implement automatic rollback of a bad deployment that causes cascading failures?
  • What metrics would you monitor to detect the early stages of a cascading failure?

9. How does load balancing work in a microservices architecture with service meshes?

What the interviewer is really asking: Do you understand modern infrastructure patterns like sidecar proxies, client-side load balancing, and service discovery?

Answer framework:

In a microservices architecture, load balancing shifts from centralized hardware/software load balancers to distributed, client-side load balancing. There are three common patterns.

Client-side load balancing: the calling service discovers available instances via a service registry (Consul, etcd, Kubernetes DNS) and makes the routing decision locally. gRPC's built-in load balancing uses this approach. The advantage is eliminating a network hop through a central load balancer. The disadvantage is that load balancing logic is embedded in every service, making updates difficult.

Service mesh with sidecar proxies: each service instance has a sidecar proxy (Envoy in Istio, Linkerd's proxy) that intercepts all network traffic. The sidecar handles load balancing, retries, circuit breaking, mTLS, and observability transparently. The application code is unaware of these concerns. At Google, most services run with Envoy sidecars. The advantages are language-agnostic load balancing (the same proxy handles Go, Java, Python services), centralized policy management, and consistent observability. The disadvantage is added latency (each hop adds ~1ms) and operational complexity of managing the mesh.

Server-side load balancing with a central proxy: traditional approach where all traffic routes through a central load balancer like NGINX, HAProxy, or a cloud load balancer. Simpler to operate but creates a potential bottleneck and single point of failure.

In a service mesh, load balancing algorithms are configured per-service. Critical services might use P2C with real-time latency tracking, while batch processing services use simple round-robin. The control plane (Istio's istiod, for example) pushes load balancing configuration to all sidecar proxies.

Service discovery is the foundation of all these patterns. In Kubernetes, services are discovered via DNS (service-name.namespace.svc.cluster.local) or the Kubernetes API. The load balancer or sidecar watches for endpoint changes and updates its routing table in real-time. For Kafka-based event-driven architectures, load balancing happens through consumer groups and partition assignment rather than traditional request routing. See Kafka vs RabbitMQ for more on message-based load distribution.

Follow-up questions:

  • How does service mesh load balancing handle cross-cluster routing?
  • What is the performance overhead of a sidecar proxy?
  • How do you implement locality-aware load balancing to prefer same-zone backends?

10. How would you load balance gRPC traffic differently from HTTP/1.1 traffic?

What the interviewer is really asking: Do you understand HTTP/2 multiplexing and why traditional load balancing breaks with persistent multiplexed connections?

Answer framework:

HTTP/1.1 typically uses one request per connection (or a small number with keep-alive pipelining). A standard L4 load balancer distributes connections across backends, and since each connection carries roughly one request at a time, load is distributed evenly.

gRPC uses HTTP/2, which multiplexes many concurrent requests over a single long-lived TCP connection. If an L4 load balancer distributes TCP connections, and a client opens one connection to a backend, all of that client's requests go to the same backend regardless of load. With 10 backends and 10 client connections, you might have all connections to 3 backends and 7 backends sitting idle.

There are several solutions. L7 load balancing: an L7 load balancer (Envoy, NGINX with gRPC module) that understands HTTP/2 can distribute individual gRPC requests across backends even within a single client connection. The load balancer terminates the client's HTTP/2 connection and opens separate connections to backends. This is the most transparent solution but adds latency.

Client-side load balancing: gRPC has built-in support for client-side load balancing. The client resolves the service name to multiple backend addresses (via DNS, service registry, or a custom resolver), opens connections to multiple backends, and distributes requests across them. gRPC supports round-robin and pick-first (failover) policies out of the box, and custom policies like weighted round-robin or P2C can be implemented.

Lookaside load balancing (external load balancer): the client queries an external load balancing service (like Google's gRPC-LB or Envoy's xDS API) for backend assignments. The external service has a global view of backend health and load, providing better decisions than the client can make alone. This is used at Google internally for most gRPC services.

The headless service pattern in Kubernetes is common for gRPC: instead of using a ClusterIP service (which creates a single virtual IP that acts as an L4 load balancer), use a headless service that returns all pod IPs via DNS. The gRPC client then opens connections to all pods and performs client-side load balancing.

Common mistake: using a standard Kubernetes Service (ClusterIP) with gRPC and wondering why load is uneven. The ClusterIP is an L4 construct that distributes connections, not requests. You need either headless services with client-side LB or an L7 load balancer like Istio/Envoy.

Follow-up questions:

  • How do you handle connection-level flow control in HTTP/2 when load balancing?
  • What happens to in-flight gRPC requests when a backend is removed?
  • How would you implement graceful connection migration for long-lived gRPC streams?

11. What is the difference between a reverse proxy and a load balancer? When do you need both?

What the interviewer is really asking: Can you distinguish between related but distinct concepts, and do you understand the security and caching roles that a reverse proxy provides beyond load balancing?

Answer framework:

A reverse proxy sits in front of backend servers and forwards client requests to them. A load balancer distributes traffic across multiple backend servers. In practice, most reverse proxies include load balancing capabilities, and most load balancers act as reverse proxies, which is why the concepts are often confused.

The key distinction is in their primary purpose. A reverse proxy's primary functions include: hiding backend server topology from clients (clients only see the proxy's IP), SSL/TLS termination, response caching, request/response compression, rate limiting, and Web Application Firewall (WAF) integration. A load balancer's primary function is distributing traffic for scalability and availability.

You need both (or a single component that performs both roles) in most production architectures. NGINX, HAProxy, and Envoy all serve as both reverse proxy and load balancer. Cloud services often separate them: AWS has CloudFront (reverse proxy/CDN) in front of ALB (load balancer) in front of target groups.

A common architecture at scale: client requests hit a CDN edge (reverse proxy with caching), then a global load balancer (GSLB) routes to the nearest regional data center, then a regional reverse proxy handles SSL termination and rate limiting, then a local load balancer distributes to backend servers. Each layer serves a distinct purpose.

In a microservices architecture, an API gateway (like Kong, Ambassador, or AWS API Gateway) combines reverse proxy and load balancing with API-specific features: authentication, request transformation, rate limiting per API key, and request routing based on path or headers. For REST vs GraphQL architectures, the API gateway plays different roles in routing and aggregating requests.

The senior-level nuance is understanding when to collapse these layers and when to keep them separate. For a small service with 5 backends, a single NGINX instance handling both reverse proxy and load balancing is appropriate. For a global service like Netflix with millions of concurrent users, each layer is a separate specialized system.

Follow-up questions:

  • How does a reverse proxy improve security compared to exposing backends directly?
  • When would you use a separate caching layer vs the reverse proxy's built-in cache?
  • How do you handle request tracing through multiple proxy layers?

12. How do you load balance database traffic? What are the challenges compared to stateless application servers?

What the interviewer is really asking: Do you understand the fundamental difference between load balancing stateless compute and stateful storage, and can you handle read replicas, write routing, and connection pooling?

Answer framework:

Database load balancing is fundamentally different from application load balancing because databases are stateful. You cannot simply round-robin queries to any database server. You must distinguish between reads and writes, handle replication lag, and manage connection limits.

For read/write splitting, the most common pattern is to route all writes to the primary database and distribute reads across read replicas. This requires query classification at the proxy layer. Tools like ProxySQL (for MySQL) and PgBouncer (for PostgreSQL) provide this capability. The challenge is replication lag: after a write to the primary, the data might not be available on replicas for milliseconds to seconds. If a user writes and then immediately reads (read-your-writes consistency), routing the read to a replica might return stale data. Solutions include: route reads to the primary for a short window after a write (1-2 seconds), use semi-synchronous replication (primary waits for at least one replica to confirm the write), or track the replication position and route reads to replicas that have caught up.

Connection pooling is critical because database connections are expensive (PostgreSQL forks a process per connection, MySQL creates a thread). Application servers might have 100 instances, each wanting 10 connections, requiring 1,000 database connections. A connection pooler like PgBouncer can multiplex 1,000 application connections onto 100 database connections. At the load balancer level, ensure connection limits per backend to prevent a traffic surge from overwhelming the database.

For sharded databases, load balancing is really routing: the load balancer must inspect the query (or a routing key in the request) to determine which shard holds the data. This is typically application-level routing using consistent hashing rather than traditional load balancing.

At Amazon, Aurora uses a custom proxy that provides automatic read/write splitting, connection pooling, and failover. Understanding these managed service implementations shows you can leverage existing solutions rather than building from scratch. For context on data consistency challenges, see the CAP theorem.

Follow-up questions:

  • How do you handle a failover from primary to replica with minimal downtime?
  • What happens to in-flight transactions during a database failover?
  • How would you implement cross-shard queries with database load balancing?

13. Explain the concept of load shedding. How do you implement it without degrading user experience?

What the interviewer is really asking: Can you design systems that fail gracefully under extreme load, prioritizing important traffic over less critical work?

Answer framework:

Load shedding is the practice of intentionally dropping requests when a system is overloaded, rather than attempting to process all requests and failing at all of them due to resource exhaustion. The analogy is from power grids: during peak demand, utilities intentionally cut power to less critical areas to prevent a total grid failure.

Implementation approaches. First, queue-based shedding: set a maximum request queue depth. When the queue is full, reject new requests immediately with a 503 status code and a Retry-After header. This prevents requests from waiting in queue for so long that the client has already timed out, wasting server resources processing a request that nobody is waiting for.

Second, priority-based shedding: classify requests by priority. Payment processing is critical (never shed), search queries are important (shed under extreme load), recommendation refreshes are optional (shed early). Use request headers or API key metadata to assign priority. When load exceeds capacity, start shedding from the lowest priority up.

Third, adaptive shedding based on latency: monitor the server's P99 latency in real-time. When P99 exceeds an SLO threshold (say, 500ms), start rejecting a percentage of requests. Increase the rejection rate as latency rises. This creates a feedback loop that stabilizes latency at the target. Google's CoDel (Controlled Delay) algorithm uses this approach.

Fourth, client-cooperative shedding: the server returns a 503 with Retry-After header, and well-behaved clients implement exponential backoff with jitter. The load balancer can assist by returning 503 responses at the proxy layer before requests even reach the backend, reducing backend load.

The key to maintaining user experience during shedding is graceful degradation: instead of returning errors, return degraded responses. A product page could show cached prices instead of real-time prices, skip the personalized recommendation section, and use a cached product description. The user sees a slightly stale page instead of an error. This approach is used extensively at Netflix and Amazon, as described in the distributed systems guide.

Common mistake: implementing load shedding without observability. You must have dashboards and alerts showing shed rates by priority, so you can detect when shedding is occurring and diagnose the root cause.

Follow-up questions:

  • How do you prevent load shedding from disproportionately affecting certain users?
  • How do you test load shedding behavior without impacting production traffic?
  • What is the difference between load shedding and rate limiting?

14. How do you test a load balancer configuration before deploying to production?

What the interviewer is really asking: Do you follow rigorous operational practices, including testing infrastructure changes with the same discipline as code changes?

Answer framework:

Load balancer misconfigurations are one of the most common causes of production outages. Testing must be systematic and multi-layered.

First, configuration validation and linting: before any deployment, validate the configuration syntax. NGINX has nginx -t, HAProxy has haproxy -c, and infrastructure-as-code tools like Terraform have terraform plan. For cloud load balancers, use IaC with plan/preview stages. Integrate this into CI/CD so invalid configurations never reach production.

Second, integration testing in a staging environment: deploy the configuration to a staging environment that mirrors production topology. Run synthetic traffic that exercises all routing rules: different URL paths (for path-based routing), different headers (for header-based routing), different client IPs (for geo-routing). Verify that requests land on the correct backend groups. Check that health check endpoints are accessible and return the expected status codes.

Third, load testing: use tools like k6, Locust, or Gatling to generate realistic traffic patterns at production-like volumes. Measure the load balancer's impact on latency (P50, P99, P99.9), verify even distribution across backends, test behavior when backends are slower than normal, and verify connection draining during simulated deployments.

Fourth, chaos testing: intentionally fail backends and verify the load balancer routes around them correctly. Kill a backend and measure time-to-detection (how long until the health check marks it unhealthy) and time-to-recovery (how long until the load balancer detects the backend is back). Test what happens when more than half the backends fail simultaneously (panic threshold behavior).

Fifth, canary deployment of the configuration: apply the new configuration to a small percentage of traffic (1-5%) using a separate load balancer or traffic splitting. Compare error rates and latency between the canary and the existing configuration. Promote to full traffic only if metrics are within acceptable bounds.

Sixth, automated rollback: if post-deployment metrics (error rate, latency, request volume) deviate beyond thresholds within the first 15 minutes, automatically roll back to the previous configuration. This requires versioned configurations and automated deployment tooling.

At Amazon, load balancer configuration changes go through the same review and deployment pipeline as code changes, including mandatory testing in a gamma (pre-production) environment.

Follow-up questions:

  • How do you simulate geographic routing in a test environment?
  • What metrics would trigger an automatic rollback of a load balancer configuration?
  • How do you handle configuration drift between load balancer instances?

15. Design a load balancing solution for a globally distributed API serving 10 million requests per second.

What the interviewer is really asking: Can you synthesize all load balancing concepts into a coherent, production-grade architecture at extreme scale?

Answer framework:

At 10M RPS globally, you need a multi-tier architecture with redundancy at every layer.

Tier 1: DNS and Anycast. Use anycast to advertise the same IP addresses from PoPs in all major regions (North America, Europe, Asia-Pacific, South America). Anycast provides automatic geo-routing via BGP and DDoS resilience since attack traffic is distributed across all PoPs. Complement with GeoDNS for clients that need region-specific responses (data residency requirements). Set DNS TTL to 60 seconds for reasonable failover speed.

Tier 2: Edge PoPs with L4 load balancing. At each PoP, deploy L4 load balancers (DPDK-based software load balancers or dedicated hardware) using Direct Server Return (DSR) for maximum throughput. DSR means return traffic goes directly from the backend to the client, bypassing the load balancer. This is critical at 10M RPS because the load balancer only handles incoming packets, not response traffic (which is 10-100x larger by bytes). Use ECMP (Equal-Cost Multi-Path) across multiple L4 instances for horizontal scaling. Maglev-style consistent hashing ensures connection affinity across the ECMP group.

Tier 3: L7 load balancing and API gateway. Behind the L4 tier, deploy L7 proxies (Envoy) that handle TLS termination, HTTP/2, authentication, rate limiting, and path-based routing. Scale to hundreds of instances per PoP. Use P2C or weighted least-connections for backend selection.

Tier 4: Backend service clusters. Deploy backend services across multiple availability zones within each region. Use service mesh (Istio/Envoy sidecars) for inter-service load balancing with locality-aware routing (prefer same-zone backends, fall back to cross-zone). For stateful services, use consistent hashing to minimize redistribution during scaling events.

Cross-cutting concerns: centralized observability with per-tier metrics (connections, RPS, latency histograms, error rates), automated failover where if a PoP's error rate exceeds 5%, shift traffic to adjacent PoPs via DNS weight adjustment, and capacity planning based on per-PoP headroom targeting 60% utilization to absorb regional failover traffic.

The cost architecture matters at this scale. 10M RPS through a cloud load balancer like AWS ALB at approximately $0.008 per LCU-hour would cost hundreds of thousands per month. At this scale, software load balancers on bare metal or reserved instances are significantly cheaper. Companies like Netflix, Google, and Amazon all run custom load balancing infrastructure at this scale. For more on these architectures, see the system design interview guide and pricing considerations.

Follow-up questions:

  • How would you handle a DDoS attack targeting 100M RPS at this architecture?
  • How do you roll out load balancer software updates across all PoPs without downtime?
  • What is the failure domain of this architecture and what is the blast radius of a single component failure?

Common Mistakes in Load Balancing Interviews

  1. Treating load balancing as a solved, simple problem. Saying "just add an ELB" without discussing algorithm selection, health check tuning, connection draining, or failure modes signals mid-level thinking. Senior engineers understand that load balancer configuration is an ongoing operational concern, not a one-time setup.

  2. Ignoring the stateful vs stateless distinction. Applying the same load balancing strategy to stateless web servers and stateful database connections is a fundamental error. Database load balancing requires read/write awareness, replication lag handling, and connection pooling.

  3. Not considering tail latency. Load balancing algorithms that optimize for average latency can produce terrible P99 latency. A senior candidate should discuss how algorithm choice affects the latency distribution, not just the average.

  4. Forgetting about load balancer failures. The load balancer itself is a single point of failure unless you deploy it redundantly. Discuss active-passive pairs, DNS failover between load balancers, and what happens when the load balancer is the bottleneck.

  5. Over-relying on sticky sessions. Proposing sticky sessions as a first solution instead of stateless architecture with externalized state indicates a lack of experience with modern distributed systems. Sticky sessions should be a last resort, not a default.

How to Prepare for Load Balancing Interview Questions

Build hands-on experience with load balancing tools. Set up NGINX or HAProxy locally and experiment with different algorithms, health check configurations, and failure scenarios. Deploy a simple service behind a Kubernetes Ingress and observe how traffic distribution changes as you scale pods up and down.

Study real-world architectures. Read the Maglev paper from Google, Netflix's Zuul architecture blog posts, and Envoy's documentation on load balancing policies. Understanding how these systems solve load balancing at scale gives you concrete examples to reference in interviews.

Practice failure scenario reasoning. For each load balancing concept, ask yourself: what happens when this fails? What is the blast radius? How do I detect and recover? This failure-mode thinking is what distinguishes senior candidates.

For a comprehensive preparation plan, see the system design interview guide, explore distributed systems concepts, and use the learning paths to build your knowledge systematically.

Related Resources

GO DEEPER

Master this topic in our 12-week cohort

Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.