Courses 0%
50
Reliability And Resilience · Chapter 50 of 51

Global Load Balancing

Akhil
Akhil Sharma
20 min

Global Load Balancing (GSLB) - Routing Traffic Across the Planet

Audience: infrastructure engineers and network architects managing global traffic distribution.

This article assumes:

  • Your users are distributed globally across continents and time zones.
  • Network conditions vary: latency, packet loss, routing policies differ by geography.
  • Load balancers within a region aren't enough - you need cross-region intelligence.
  • DNS alone is too slow and too dumb for modern traffic management.

Challenge: Your "global" load balancer isn't actually global

Scenario

You deploy to three AWS regions: US, EU, and Asia.

You set up Route53 with latency-based routing. "We're globally load balanced!" you announce.

Then reality hits:

  • Asian users get routed to US during an AWS routing table glitch
  • EU region is overloaded but DNS keeps sending traffic there (60-second TTL delay)
  • A DDoS attack in US region affects all users (DNS doesn't detect application-layer attacks)
  • Your "global" setup failed the first real test

Interactive question (pause and think)

What's the difference between:

  1. DNS-based routing (Route53, Cloudflare DNS)
  2. Anycast-based routing (BGP announces)
  3. True Global Server Load Balancing (GSLB)

Take 10 seconds.

Progressive reveal (question -> think -> answer)

Answer: Increasing levels of intelligence and control.

  1. DNS routing: Dumb, slow, client-controlled
  2. Anycast: Fast, automatic, but no application awareness
  3. GSLB: Intelligent, application-aware, dynamic routing decisions

Real-world analogy (airport routing)

Imagine routing passengers to airports:

DNS routing: At ticket purchase time, you assign them to "nearest airport" based on their home address. If that airport closes, they're stuck with outdated ticket.

Anycast: Passengers head to any flight with the same flight number. Airline network automatically routes them to an airport that can serve them.

GSLB: Intelligent routing system considers: airport capacity, weather, security delays, even passenger preferences. Dynamically reroutes in real-time.

Key insight box

GSLB is the control plane for global traffic distribution, making routing decisions based on real-time health, capacity, and performance.

Challenge question

If GSLB is so much better, why does anyone still use basic DNS routing?


Mental model - GSLB as traffic orchestrator

Scenario

You have three data centers: US-West, US-East, and EU.

Traditional load balancing (within region):

  • Client → Regional LB → Pick backend server

Global load balancing (across regions):

  • Client → GSLB → Pick region → Regional LB → Pick server

Interactive question (pause and think)

What information does GSLB need to make good routing decisions?

A. Server CPU/memory utilization B. Network latency to each region C. Regional health and capacity D. All of the above, plus business logic

Progressive reveal

Answer: D.

GSLB is a decision engine that considers:

  • Client location (geography, network)
  • Endpoint health (is region up? degraded?)
  • Capacity (can region handle more load?)
  • Performance (latency, throughput)
  • Business rules (compliance, cost, SLAs)

Mental model

Think of GSLB as:

  • Air traffic control for network packets
  • Smart router with real-time world view
  • Policy engine enforcing business rules

The goal: Route each request to the "best" endpoint at that moment.

Real-world parallel (emergency dispatch)

When you call 911, the dispatcher doesn't just send the "nearest" ambulance. They consider:

  • Which ambulances are available (capacity)
  • Which are already en route to other emergencies (load)
  • Which hospital has the right facilities (capability)
  • Traffic conditions (latency)

GSLB does the same for network traffic.

Key insight box

GSLB moves routing intelligence from static configuration (DNS) to dynamic decision-making based on real-time telemetry.

Challenge question

Should GSLB routing decisions prioritize latency, cost, or reliability? Or does it depend on the request type?


Understanding GSLB architectures - DNS vs proxy vs hybrid

Scenario

Your team debates: "Should our GSLB be DNS-based, proxy-based, or hybrid?"

Different architectures, different trade-offs.

GSLB architecture patterns

Pattern 1: DNS-based GSLB

text

Pattern 2: Proxy-based GSLB (Reverse proxy)

text

Pattern 3: Hybrid (DNS + Proxy)

text

Visual comparison

text

Interactive question

You're building a global API serving mobile apps. Which GSLB pattern?

A. DNS-based (simple, cheap) B. Proxy-based (full control) C. Hybrid (best of both)

Progressive reveal

Answer: Start with A (DNS-based), migrate to C (hybrid) as you scale.

Why? Mobile apps can handle DNS TTL caching better than browsers. Start simple, add complexity only when needed.

Key insight box

GSLB architecture is not one-size-fits-all. Start with DNS, add proxy layers as requirements demand per-request intelligence.

Challenge question

Can you implement GSLB without any special infrastructure, using only standard DNS and load balancers?


GSLB health checks and failover - the detection problem

Scenario

Your EU region is degraded (high latency, elevated errors). When should GSLB stop routing traffic there?

Health check strategies

Strategy 1: Binary health checks (up/down)

yaml

Strategy 2: Weighted health (capacity-aware)

yaml

Strategy 3: Active vs passive health checks

yaml

Failover decision logic

go

Avoiding thundering herd during failover

yaml

Key insight box

GSLB health checks are not just "is it up?" but "can it handle more load, and how well is it performing?"

Challenge question

Your health check shows region is healthy, but real user requests are failing. What went wrong?


GSLB routing policies - latency vs cost vs compliance

Scenario

You have three regions: US ($), EU ($$), Asia ($$$).

User in Japan makes a request. Where should GSLB route it?

Think about it

Factors to consider:

  • Latency: Asia is nearest (50ms), US is 150ms, EU is 200ms
  • Cost: Asia is 3x more expensive than US
  • Compliance: User data might need to stay in specific region
  • Capacity: Asia might be at 90% capacity, US at 50%

Interactive question (pause and think)

Which routing policy?

A. Always route to nearest region (minimize latency) B. Always route to cheapest region (minimize cost) C. Balance latency and cost (weighted decision) D. Different policy per request type (API vs static assets)

Progressive reveal

Answer: Usually C or D.

One-size-fits-all routing is almost always wrong.

GSLB routing policies catalog

Policy 1: Latency-based (performance-first)

yaml

Policy 2: Cost-based (spend-first)

yaml

Policy 3: Geographic (compliance-first)

yaml

Policy 4: Weighted (multi-objective)

yaml

Policy 5: Request-aware (path-based)

yaml

Real-world example: Stripe's routing

yaml

Key insight box

The best GSLB routing policy depends on request type. API calls, static assets, and batch jobs should use different policies.

Challenge question

Can you dynamically adjust routing weights based on time of day (e.g., route to cheaper regions during low-traffic hours)?


Anycast and BGP - GSLB's secret weapon

Scenario

You want instant failover (no DNS caching) and you want users to reach "nearest" region automatically.

Enter: Anycast with BGP.

What is Anycast?

text

How Anycast works with GSLB

text

Anycast implementation

yaml

Anycast + GSLB hybrid

yaml

Anycast limitations

yaml

Key insight box

Anycast solves the "instant failover" problem that DNS can't. But it requires BGP expertise and works best for stateless protocols.

Challenge question

Can you use anycast for a database connection (stateful TCP)? What breaks?


GSLB observability - can you see what it's deciding?

Scenario

Your GSLB routes traffic globally. Users in EU complain about slow response times.

Questions:

  • Are they being routed to EU region or somewhere else?
  • Is EU region healthy or degraded?
  • Is GSLB making the right decision?

Without observability, you're blind.

GSLB metrics to track

Routing decision metrics:

yaml

Client perspective metrics:

yaml

Failover metrics:

yaml

GSLB debugging dashboard

text

Distributed tracing for GSLB

yaml

Key insight box

GSLB without observability is a black box. Instrument every routing decision so you can debug user complaints and optimize over time.

Challenge question

User reports: "I'm in London but your API is slow." How do you use GSLB metrics to debug where they're actually being routed?


Final synthesis - Design your GSLB strategy

Synthesis challenge

You're the infrastructure lead for a global video streaming platform.

Requirements:

  • 100 million users across North America (45%), Europe (30%), Asia (20%), South America (5%)
  • Video content delivery (high bandwidth)
  • Real-time playback start (< 2 second buffering)
  • Cost-sensitive (bandwidth is expensive)
  • Compliance: EU content must stay in EU for licensing

Constraints:

  • CDN costs: $0.05/GB in US, $0.08/GB in EU, $0.12/GB in Asia
  • Origin servers: US-East, EU-West, Asia-Pacific
  • Budget: Must reduce bandwidth costs by 20%
  • Current: No GSLB, users manually select region

Your tasks (pause and think)

  1. Choose GSLB architecture (DNS, proxy, hybrid?)
  2. Define routing policies (latency vs cost?)
  3. Design health check strategy
  4. Plan failover approach
  5. Implement cost optimization
  6. Define observability requirements

Write down your design.

Progressive reveal (one possible solution)

1. GSLB architecture:

yaml

2. Routing policies:

yaml

3. Health check strategy:

yaml

4. Failover approach:

yaml

5. Cost optimization:

yaml

6. Observability requirements:

yaml

Key insight box

GSLB for video streaming is about balancing three constraints: latency (user experience), cost (bandwidth), and compliance (licensing).

Final challenge question

During a live sports event, traffic to EU region spikes 10x. Your GSLB routes overflow traffic to US region. EU users complain about licensing errors ("content not available in your region"). How do you fix the routing policy to handle spikes without violating licensing?


Appendix: Quick checklist (printable)

GSLB architecture selection:

  • Define requirements (latency, cost, compliance, observability)
  • Choose architecture (DNS, proxy, or hybrid)
  • Consider anycast for fast failover (if BGP available)
  • Plan for CDN integration (if content delivery)
  • Estimate costs (infrastructure, bandwidth, operational)

Routing policy design:

  • Define policies per request type (API, video, static)
  • Balance latency, cost, capacity in scoring function
  • Implement geographic routing for compliance
  • Add time-based routing for cost optimization
  • Support user-tier differentiation (free vs paid)

Health checking:

  • Implement active synthetic checks
  • Add passive real-user monitoring
  • Define healthy/degraded/unhealthy thresholds
  • Set check frequency (balance timeliness vs load)
  • Create escalation triggers (when to page humans)

Failover planning:

  • Define failure detection criteria
  • Implement gradual failover (not instant)
  • Add hysteresis to prevent flapping
  • Plan failback strategy (gradual restoration)
  • Test failover monthly (Game Days)

Observability:

  • Track routing decisions (where is traffic going?)
  • Monitor health scores per region
  • Measure cost per region (bandwidth, compute)
  • Implement distributed tracing (end-to-end visibility)
  • Create dashboards for ops team

Cost optimization:

  • Leverage CDN caching (reduce origin bandwidth)
  • Route to cheaper regions when latency allows
  • Implement time-based cost optimization
  • Monitor spend per region in real-time
  • Set budget alerts

Red flags (reassess GSLB):

  • Failover takes > 60 seconds (DNS TTL too high, or no anycast)
  • Users routed to wrong region frequently (policy broken)
  • Bandwidth costs increased after GSLB (inefficient routing)
  • Regional health checks unreliable (synthetic != real traffic)
  • GSLB is a black box (no observability)

Key Takeaways

  1. Global load balancing routes users to the optimal datacenter based on geography, health, and load — using DNS, anycast, or a combination
  2. GeoDNS returns different IP addresses based on the client's location — routing European users to Frankfurt and US users to Virginia
  3. Anycast advertises the same IP from multiple locations — network routing automatically directs packets to the nearest one
  4. Health checks remove unhealthy regions from DNS rotation — automatic failover when a datacenter goes down
  5. DNS TTL affects failover speed — lower TTLs enable faster failover but increase DNS query volume
Chapter complete!

Course Complete!

You've finished all 51 chapters of

System Design Advanced

Browse courses
Up next Cascading Failure Prevention
Continue