System Design: Blue-Green Deployment System

Requirements

Functional Requirements:

Maintain two production environments (blue and green) with identical infrastructure
Deploy new version to the idle environment while production traffic serves the active one
Switch traffic atomically from old to new version with zero dropped requests
Rollback instantly (switch traffic back) if the new version shows issues
Run smoke tests and integration tests against the idle environment before switching
Handle database schema changes that must be compatible with both versions simultaneously

Non-Functional Requirements:

Traffic cutover completes in under 30 seconds
Zero user-visible errors during cutover (no 5xx, no dropped connections)
Rollback time under 60 seconds
Both environments consume ~100% of normal production resources (2x total cost during deployment)

Scale Estimation

A production system serving 100,000 requests/sec. During cutover, all 100,000 RPS must transition from blue to green without interruption. In-flight requests at the moment of cutover: 100,000 RPS × 100ms average response time = 10,000 concurrent requests. These must complete on the blue environment even after traffic is routed to green (connection draining). Blue environment must stay alive for 60 seconds post-cutover to drain in-flight requests. Two environments = 2× compute cost during deployments.

High-Level Architecture

The load balancer is the traffic switch. Blue and green environments sit behind the same load balancer (or DNS entry), with one environment receiving 100% of traffic at any time. Cutover is a load balancer configuration update: change the upstream backend pool from the blue cluster to the green cluster. Modern load balancers (AWS ALB, GCP LB, Nginx, Envoy) apply configuration changes without dropping existing connections — in-flight requests on blue connections complete normally; new connections go to green.

DNS-based cutover (change the DNS A record from blue IP to green IP) is an alternative but has a propagation delay of 30 seconds to 5 minutes due to DNS TTL. For zero-downtime cutover with second-level precision, load-balancer-based switching is required. DNS-based cutover is used for coarser-grained environments (staging vs. production) where propagation delay is acceptable.

The deployment workflow: (1) deploy new code to idle environment (e.g., green); (2) run health checks and smoke tests against green behind a test load balancer entry; (3) gradually shift traffic (0% → 10% → 50% → 100%) or atomically switch (0% → 100%); (4) monitor error rates and latency on green; (5) drain blue (wait for in-flight requests to complete); (6) keep blue warm for rollback for 30 minutes; (7) decommission old blue (or repurpose as next deploy target).

Core Components

Traffic Switch

The traffic switch is a load balancer target group configuration. In Kubernetes: update the Service selector to point to green pods (change version: blue to version: green in the selector — all new connections route to green pods immediately; existing connections to blue pods are drained). In AWS: update the ALB listener rule's target group from the blue target group to the green target group via the ALB API. The switch is atomic from the load balancer's perspective — the configuration update is applied and the next accepted connection goes to green. In-flight requests on existing blue connections complete uninterrupted.

Connection Draining

After the traffic switch, blue instances stop receiving new connections but must complete in-flight requests. Connection draining configuration: set a deregistration delay (AWS: 60-300 seconds, Nginx: proxy_read_timeout) on the blue target group. During this window, the load balancer marks blue instances as draining (no new connections, existing connections kept alive until the request completes). After the drain timeout, blue instances are deregistered. Services must handle the SIGTERM signal by stopping acceptance of new connections and allowing in-flight requests to complete — a graceful shutdown handler with a 30-second timeout.

Database Migration Strategy

The hardest problem in blue-green deployment is database schema changes. Both blue (running during green's deployment) and green must work with the same database simultaneously — incompatible schema changes will break one of them. The expand-contract pattern: deploy schema changes in three phases. Expand: add new columns/tables without removing old ones — both blue (old code) and green (new code) work. Deploy green (green uses new columns, blue still uses old). Contract: remove old columns/tables once blue is fully drained. This requires schema migrations to be backward-compatible — no immediately removing columns, no renames (add new + migrate + remove old).

Database Design

Deployment state is stored in a deployment database: deployments (id, service, version, environment: blue/green, status: deploying/ready/active/draining/standby, deployed_at, activated_at), traffic_rules (service, active_environment, cutover_percent). The deployment orchestrator reads and writes this table to coordinate the deployment workflow.

Environment configuration (which blue/green slot maps to which cluster endpoint) is stored in a configuration system (Consul/etcd) and consumed by the load balancer's configuration management layer. When the deployment orchestrator activates green, it updates the config store, which triggers a load balancer configuration reload via watch notification.

API Design

Scaling & Bottlenecks

The 2× resource cost during deployment is the primary scaling concern. For large services (1,000 application servers), doubling for every deployment is expensive. Mitigation: use pre-warmed warm standby pools (keep green at 50% capacity, scale out to 100% only when deploying) — reduces the idle cost at the expense of slower scale-out during deployment. Cloud auto-scaling (spin up green on demand for each deployment, tear down blue after draining) reduces cost to near-zero idle, but adds 3-5 minutes of scale-out time before cutover.

Database schema migrations block zero-downtime deployment if the expand-contract pattern is not followed. A team discipline around backward-compatible migrations is essential — this is often the hardest organizational challenge. Automated migration linting tools (Squawk for PostgreSQL) can detect migrations that would break backward compatibility, enforcing the discipline in CI.

Key Trade-offs

Atomic cutover vs. gradual traffic shift: Atomic switch (0% → 100%) is simple and consistent; gradual shift (canary: 0% → 1% → 10% → 100%) reduces blast radius but requires the new version to handle partial traffic and exposes both versions simultaneously
Blue-green vs. rolling deployment: Blue-green always has a clean cutover point and trivial rollback; rolling deployments (replace pods one-by-one) use half the resources but have a messy rollback (re-deploy old version) and a window where both versions handle traffic
2× resource cost vs. on-demand provisioning: Keeping blue warm enables instant rollback; on-demand provisioning saves 50% cost but makes rollback slow (re-provisioning the old environment takes minutes)
Feature flags vs. blue-green: Feature flags allow toggling features without a deployment and avoid the 2× resource cost, but require flag logic in application code; blue-green is infrastructure-level and works for any code change without application modification