System Design: Canary Deployment System

Requirements

Functional Requirements:

Route a configurable percentage of traffic to the new (canary) version while the rest goes to stable
Gradually increase canary traffic: 1% → 5% → 10% → 25% → 50% → 100%
Automatically promote canary to stable if metrics are healthy for a configurable duration at each step
Automatically rollback canary (0% traffic) if error rate or latency regressions are detected
Support header-based overrides: force a specific user to canary (for internal testing)
Provide a dashboard showing canary vs. stable metric comparison in real time

Non-Functional Requirements:

Traffic split changes take effect within 10 seconds of percentage update
Automatic rollback triggers within 60 seconds of detecting a regression
Statistical significance: don't rollback based on noise — use Mann-Whitney U test or z-test to confirm regressions
Support 10,000 canary deployments/day across all services in the fleet

Scale Estimation

1,000 services × 10 canary deployments/day/service = 10,000 concurrent or sequential canary deployments/day. At any given time: ~100 active canary deployments (assuming each takes ~2 hours to fully promote). Each canary has 2 versions receiving traffic — the load balancer must track 2 backend pools per canary service. Metric comparison: for each canary, compare error rates and latency between canary and stable every 60 seconds. With 100 active canaries × 10 metrics each: 1,000 metric comparisons/minute — trivially handled.

High-Level Architecture

The canary system integrates with the load balancer (for traffic splitting) and the metrics system (for regression detection). Traffic splitting at the load balancer: maintain two backend pools per service (stable pool, canary pool) with weighted routing. Envoy's weighted cluster routing: {stable: weight=99, canary: weight=1} routes 1% of requests to canary. Weight adjustments are made by updating the Envoy control plane configuration, which propagates to all Envoy instances via the xDS API within seconds.

For sticky canary (ensuring a user always hits canary or always hits stable within a session): use a cookie (X-Canary-Version: true) set by the canary traffic router. Once assigned to canary, the cookie causes all subsequent requests from that client to be routed to canary. This improves the user experience (no mid-session version switches) and enables fair A/B comparison (users are fully exposed to one version).

The analysis engine runs continuously during canary deployment. It fetches metrics for both stable and canary backends, applies statistical tests to determine if the difference is significant, and makes promotion/rollback decisions. The analysis engine is decoupled from the traffic controller — it outputs decisions (promote, hold, rollback) that the traffic controller acts on.

Core Components

Traffic Controller

The traffic controller manages the canary lifecycle state machine: NOT_STARTED → RUNNING (1%) → RUNNING (5%) → ... → RUNNING (100%) → PROMOTED (or) → ROLLED_BACK. On each step transition, it updates the load balancer weights via the control plane API. Promotion schedule: after step_duration (e.g., 10 minutes) at each traffic percentage, if the analysis engine reports PASS, advance to the next percentage step. If the analysis engine reports FAIL at any step, immediately set canary weight to 0% and mark the deployment as ROLLED_BACK. Manual override: operators can pause advancement (keep canary at a percentage indefinitely for investigation) or force rollback.

Statistical Analysis Engine

The analysis engine compares canary metrics against two baselines: (1) stable version current metrics (direct comparison) and (2) historical baseline (same time window from previous days — controls for time-of-day effects). Regression detection uses hypothesis testing. For error rates (binary counts): a z-test for proportions compares the canary error rate to the stable error rate. If the difference is statistically significant (p < 0.05) AND practically significant (error rate increased by more than 0.1 percentage points), trigger rollback. For latency percentiles: Mann-Whitney U test on request duration samples is distribution-free — it doesn't assume a normal distribution, making it suitable for real-world latency distributions (which often have long tails). A Bayesian approach (computing the probability that canary is worse than stable) is an alternative that avoids p-value misinterpretation.

Header-Based Override Routing

Internal testers and stakeholders can force-route their traffic to the canary regardless of the current canary percentage. Implementation: a request header X-Force-Canary: true (set by the developer in their browser via a DevTools extension or Postman) bypasses the weighted routing and always routes to the canary backend pool. The API gateway checks for this header before applying percentage-based routing. Access to set this header is restricted to internal users (verified via JWT claim is_internal: true) to prevent external users from gaming the canary routing.

Database Design

PostgreSQL stores canary deployment metadata: canary_deployments (id, service, stable_version, canary_version, status, current_percentage, created_at, promoted_at, rolled_back_at, rollback_reason), canary_steps (deployment_id, percentage, started_at, completed_at, analysis_result: PASS/FAIL/INCONCLUSIVE, metrics_snapshot JSONB). The metrics_snapshot captures the key metrics at each step for post-deployment analysis — useful for understanding the performance characteristics of the new version even when it's fully promoted.

Canary routing configuration (current weights, backend endpoints) is stored in the configuration system (Consul or etcd), not in PostgreSQL. The configuration system is the source of truth for the load balancer's traffic split — PostgreSQL is only for operational metadata and history.

API Design

Scaling & Bottlenecks

At 100 concurrent canary deployments, the analysis engine performs 100 × 10 metrics × 60-second evaluation = 1,000 metric queries/sec from Prometheus or ClickHouse. This is modest but can spike if all canaries start simultaneously (e.g., after a fleet-wide deploy event). Mitigation: jitter the analysis evaluation cycles (spread evaluations randomly within each 60-second window) to avoid synchronized query bursts.

Load balancer weight updates for 100 concurrent canaries require frequent control plane updates (each canary advances every 10 minutes = ~10 weight changes/minute × 100 canaries = ~17 updates/sec to the load balancer control plane). Envoy's xDS protocol handles incremental updates efficiently — each update only sends the changed route configuration, not the full state.

Key Trade-offs

Canary vs. blue-green: Canary is slower (hours for full rollout) but safer — statistical analysis with real traffic catches regressions before full rollout; blue-green is faster (instant cutover) but has a higher blast radius if the new version has bugs
Automatic vs. manual promotion: Automatic promotion based on metrics is efficient but may promote a subtly degraded version if the regression is below the detection threshold; manual promotion gives teams full control but requires human attention for every step
Statistical significance thresholds: Strict thresholds (p < 0.01) reduce false rollbacks but delay detecting real regressions; loose thresholds (p < 0.1) catch regressions faster but may rollback good versions during noisy traffic periods
Sticky vs. non-sticky canary routing: Sticky routing (user always sees same version) provides consistent user experience and better A/B data quality; non-sticky routing is simpler but means a user might see different behavior across requests during the canary period