SYSTEM_DESIGN

System Design: A/B Testing Platform

Design a large-scale A/B testing platform that enables controlled experiments across product features and ML models with statistical rigor, real-time metric tracking, and automated significance detection. Covers assignment, metric collection, and analysis pipelines.

14 min readUpdated Jan 15, 2025
system-designab-testingexperimentationstatisticsproduct-analytics

Requirements

Functional Requirements:

  • Create experiments with multiple variants, traffic allocation percentages, and targeting rules
  • Consistently assign users to variants: same user always gets the same variant (sticky assignment)
  • Collect experiment metrics: primary goals, guardrail metrics, and diagnostic metrics
  • Compute statistical significance automatically using frequentist and Bayesian methods
  • Support mutual exclusion: users enrolled in Experiment A are excluded from Experiment B in the same layer
  • Provide experiment results dashboard with confidence intervals, p-values, and sample size progress

Non-Functional Requirements:

  • Assignment latency under 2ms (added to every user request in the critical path)
  • Support 10,000 concurrent experiments across all products
  • Process 1 billion metric events per day for analysis
  • Statistical results updated every 15 minutes with latest metric data
  • 99.99% availability for the assignment service

Scale Estimation

With 100 million daily active users each making 10 requests, assignment is called 1 billion times/day = 11,500 requests/second. Each call returns variant assignments for all active experiments a user is enrolled in. At 10,000 experiments but with a user enrolled in an average of 5 simultaneously, the assignment response payload is small (~200 bytes). Metric events: 1 billion/day = 11,500 events/second with 500-byte average = 5.75 MB/s ingest.

High-Level Architecture

The platform has three subsystems: Assignment Service, Event Collection, and Analysis Engine. The Assignment Service handles the critical path: for every user request, it determines which experiments the user is eligible for, performs consistent hashing to assign them to a variant, and returns the assignment map. Assignments are cached in the client (browser/mobile) and server-side (CDN edge) to minimize latency.

Event Collection receives metric events (page views, clicks, purchases, errors) with experiment metadata (user_id, experiment_id, variant_id) and routes them to Kafka. A Flink job aggregates metrics by (experiment_id, variant_id, metric_name) in real-time, updating running statistics in a metrics store. The Analysis Engine runs statistical tests on a schedule (every 15 minutes) and publishes results to the experiment dashboard.

Experiment configuration is stored in a low-latency config store (etcd or DynamoDB) with a local SDK cache refreshed every 30 seconds. The SDK evaluates targeting rules (user attributes, device type, region) client-side to determine eligibility, then uses consistent hashing (MurmurHash on user_id + experiment_id) to assign the variant. This design requires no network call for assignment after the initial config fetch.

Core Components

Assignment Service & SDK

The assignment SDK runs in every application server. On initialization, it fetches the active experiment configuration (JSON blob listing all experiments, targeting rules, and traffic allocations) from a CDN-cached endpoint. For each user request, it: (1) filters experiments by targeting rules, (2) hashes (user_id + experiment_id) mod 10,000 to get a bucket number, (3) maps bucket to variant based on traffic allocation ranges, (4) logs the assignment event to a local buffer flushed to Kafka every second. The SDK is thread-safe and adds under 100 microseconds of latency per request.

Metric Aggregation Pipeline

Kafka receives metric events tagged with experiment and variant context. A Flink job maintains per-(experiment, variant, metric) running aggregates: sum, count, sum_of_squares (for variance calculation), and percentile sketches (T-Digest for latency metrics). These aggregates are checkpointed to S3 every 5 minutes and the latest values written to a ClickHouse table for the Analysis Engine to query.

Statistical Analysis Engine

Every 15 minutes, the Analysis Engine fetches the latest metric aggregates for all running experiments and computes: (1) frequentist two-sample t-test with Welch's correction for unequal variance, producing p-values and 95% confidence intervals; (2) Bayesian posterior update giving the probability that the treatment effect is positive; (3) sequential testing correction (e-values or alpha-spending functions) to prevent inflated false positive rates from peeking at results before the experiment completes. Results are published to the dashboard and trigger Slack notifications when significance is reached.

Database Design

PostgreSQL for experiment metadata: experiments (id, name, owner, status, start_date, end_date, primary_metric, guardrail_metrics JSON, layer_id), variants (id, experiment_id, name, traffic_percent, config_json), targeting_rules (id, experiment_id, rule_expression). ClickHouse for metric aggregates: (experiment_id, variant_id, metric_name, window_start DateTime, sum Float64, count UInt64, sum_sq Float64) partitioned by window_start. Redis for assignment cache: key assign:{user_id}:{experiment_id} → variant_id with TTL = experiment end date.

API Design

GET /sdk/config — Return all active experiment configurations for SDK initialization (CDN-cached, refreshed every 30 seconds). POST /metrics/events — Batch ingest of metric events with experiment context; accepts up to 1,000 events per call. GET /experiments/{experiment_id}/results — Return statistical results for all variants: means, confidence intervals, p-values, and significance status. POST /experiments/{experiment_id}/stop — Stop an experiment and lock variant assignments (stop new enrollments but continue metric collection for 24 hours for lagged conversions).

Scaling & Bottlenecks

The assignment service bottleneck is experiment config distribution at 11,500 requests/second. The config-as-CDN-edge-cached JSON approach eliminates centralized assignment server entirely: the SDK runs the assignment logic locally after fetching config. Config size grows with the number of experiments; compressing the JSON blob (gzip: 10x compression) and only including experiments relevant to the user's platform reduces SDK config size from 1 MB to under 50 KB.

Metric event processing at 1 billion events/day requires Kafka partitioned by (experiment_id, variant_id) for co-location of aggregation state. At 100 Kafka partitions with 10 Flink tasks per partition, the aggregation job processes 10,000 events/second per task with stateful aggregation in RocksDB state backend. Metric storage compaction: downsampling 1-minute aggregates to 1-hour after 30 days reduces long-term storage by 60x.

Key Trade-offs

  • Client-side vs. server-side assignment: Client-side assignment (SDK) eliminates network latency but creates assignment consistency challenges for multi-device users; server-side assignment is consistent across devices but adds 1–5ms latency on every request.
  • Frequentist vs. Bayesian analysis: Frequentist p-values are widely understood but require pre-specified sample sizes to control false positives; Bayesian methods enable continuous monitoring without inflated Type I error rates but are harder to communicate to non-statisticians.
  • Per-experiment vs. global holdout: Per-experiment holdouts measure the effect of individual features; a global holdout (10% of users excluded from all experiments) measures the combined effect of all experiments running simultaneously.
  • Strict mutual exclusion vs. overlapping experiments: Strict mutual exclusion eliminates interaction effects but limits the number of simultaneous experiments; overlapping experiments maximize throughput but risk interaction effects that confound results.

GO DEEPER

Master this topic in our 12-week cohort

Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.