System Design: A/B Testing Service

Requirements

Functional Requirements:

Create experiments with multiple variants (control + 1-N treatments) and configurable traffic allocation percentages
Deterministic user bucketing: the same user always sees the same variant (consistent experience across sessions and devices)
Feature flag evaluation SDK: client-side and server-side SDKs return the assigned variant for a given user and experiment
Metrics tracking: associate user actions (conversions, revenue, engagement) with experiment variants
Statistical analysis: compute p-values, confidence intervals, and effect sizes with Sequential Testing support for early stopping
Mutual exclusion and layered experiments: ensure certain experiments do not interfere with each other

Non-Functional Requirements:

Evaluate 500K feature flag requests/sec with P99 latency under 5ms
Support 10,000 concurrent experiments across 1,000 customers
99.99% availability for flag evaluation (feature flags are in the critical request path)
Zero flicker: users must never see variant switching on page load
Experiment results available in near-real-time (within 5 minutes of event occurrence)

Scale Estimation

500K flag evaluations/sec × 3 experiments per evaluation = 1.5M experiment evaluations/sec. With local SDK caching (experiment config cached for 60 seconds), actual backend requests: 500K / 60 = 8,333 config fetches/sec. Metric events: 500K evaluations/sec × 0.1 events per evaluation (10% of pageviews produce a conversion event) = 50K metric events/sec. Experiment configurations: 10K experiments × 5KB each = 50MB (easily fits in memory). Metric data: 50K events/sec × 200 bytes = 10MB/sec = 864GB/day. User assignment cache: 100M users × 10 experiments × 20 bytes = 20GB.

High-Level Architecture

The system consists of three layers: the Assignment Layer, the Event Tracking Layer, and the Analysis Layer. The Assignment Layer handles flag evaluation. Client SDKs (JavaScript, iOS, Android) and server-side SDKs (Python, Java, Go, Node.js) embed a lightweight evaluation engine. On initialization, the SDK fetches the experiment configuration (all active experiments, their variants, traffic allocations, and targeting rules) from the Config Service and caches it locally. On each evaluation request, the SDK computes the variant assignment locally using a deterministic hash: hash(experiment_salt + user_id) mod 10000, mapped to traffic allocation buckets. This local evaluation provides sub-millisecond latency with zero network calls.

The Config Service serves experiment configurations. When an experimenter creates or modifies an experiment via the dashboard, the configuration is saved to PostgreSQL and published to a CDN (CloudFront with 30-second TTL). SDKs poll the CDN for config updates every 60 seconds. For server-side SDKs, a streaming connection (SSE or gRPC stream) delivers config updates in real-time (sub-second propagation). The config payload includes experiment metadata, variant definitions, targeting rules (user segment, country, platform), and mutual exclusion group assignments.

The Event Tracking Layer collects metric events from SDKs. Each metric event includes: user_id, experiment_id, variant_id, event_type (impression, conversion, revenue), event_value (for revenue events), and timestamp. Events are sent to an Ingestion API (batched, every 5 seconds or 100 events) and published to Kafka. A Flink stream processor aggregates events into experiment-level metrics in near-real-time.

Core Components

Deterministic Bucketing Algorithm

User assignment uses a two-layer hashing scheme. Layer 1 (Traffic allocation): hash(experiment_id + user_id) mod 10000 determines if the user is in the experiment's traffic allocation (e.g., if traffic = 10%, users with hash values 0-999 are included). Layer 2 (Variant assignment): hash(experiment_salt + user_id) mod 10000 assigns the user to a specific variant within the experiment. Using a different salt for each experiment ensures that a user's assignment in one experiment is independent of their assignment in another (avoiding systematic correlation). The hash function is MurmurHash3 — chosen for its uniform distribution and speed (sub-microsecond). Mutual exclusion groups share Layer 1 hash space: experiments in the same group divide the 0-9999 range into non-overlapping segments, ensuring no user is in more than one experiment from the group.

Statistical Analysis Engine

The analysis engine computes experiment results using sequential testing (Group Sequential Test design) to enable valid early stopping without inflating false positive rates. For each experiment, the engine computes: conversion rate per variant, absolute and relative lift (treatment vs control), confidence interval (95% default), p-value using a two-sided Z-test (for proportions) or Welch's t-test (for continuous metrics), and sample size adequacy. Sequential testing uses O'Brien-Fleming spending function to allocate alpha across interim analyses: early looks require stronger evidence, while the final analysis uses the full alpha budget. The engine runs on each metric event batch (every 5 minutes), updating experiment results in Redis for dashboard consumption. A guardrail metrics system monitors for degradation in key business metrics (e.g., revenue, latency) and auto-pauses experiments that cause significant regression.

Feature Flag SDK Architecture

The SDK is designed for minimal latency and maximum reliability. On initialization, the SDK fetches experiment config from the CDN/Config Service and caches it in memory. Evaluations are 100% local (no network call): the SDK evaluates targeting rules (user attributes match experiment criteria), computes the deterministic bucket assignment, and returns the variant. If the SDK cannot reach the Config Service (network failure), it falls back to a locally cached config (persisted to disk/localStorage). If no config exists (first-ever load), default values are returned. The SDK batches impression events (user saw variant X) and sends them asynchronously, ensuring flag evaluation never blocks the application's critical path. Anti-flicker measures: server-side rendering evaluates flags before HTML generation; client-side SDKs hide content until flag evaluation completes (via a CSS class toggle).

Database Design

PostgreSQL stores experiment configuration: experiments (experiment_id UUID PK, customer_id, name, description, status ENUM(draft, running, paused, completed), traffic_pct INT, targeting_rules JSONB, mutual_exclusion_group_id nullable, created_at, started_at, stopped_at), variants (variant_id, experiment_id, name, weight INT, is_control BOOLEAN, payload JSONB), metrics (metric_id, experiment_id, name, event_type, aggregation ENUM(count, sum, mean), minimum_sample_size INT).

Experiment results are stored in Redis for real-time dashboard access: experiment_results:{experiment_id} (hash with fields per variant: impressions, conversions, revenue_sum, revenue_sum_sq, p_value, confidence_interval, lift). Raw metric events are stored in ClickHouse (a columnar analytics database): events (timestamp, experiment_id, variant_id, user_id, event_type, event_value) partitioned by date. ClickHouse enables fast analytical queries for ad-hoc analysis (e.g., "conversion rate for variant B among iOS users in the US"). The user assignment cache (for debugging "what variant is user X in?") uses Redis: user_assignments:{user_id} → {experiment_id: variant_id}.

API Design

POST /api/v1/experiments — Create an experiment; body contains name, variants, traffic_pct, targeting_rules, metrics; returns experiment_id
GET /api/v1/config?customer_id={id} — Fetch all active experiment configs for a customer; served from CDN with 30-second cache
POST /api/v1/events/batch — Submit metric events; body contains array of {user_id, experiment_id, variant_id, event_type, event_value, timestamp}
GET /api/v1/experiments/{experiment_id}/results — Fetch current experiment results with statistical analysis

Scaling & Bottlenecks

The flag evaluation path has effectively zero backend load because all evaluations happen locally in the SDK using cached config. The Config Service CDN serves 8,333 config fetches/sec — trivially handled by CloudFront. The bottleneck shifts to the event tracking pipeline: 50K metric events/sec ingested into Kafka and processed by Flink. ClickHouse handles the write throughput (50K inserts/sec is well within its capacity) and provides sub-second analytical queries for experiment results.

The statistical analysis computation is CPU-intensive for experiments with many metrics and segments. Each experiment with 10 metrics and 5 segments requires 50 statistical tests per analysis cycle. With 10K concurrent experiments and 5-minute analysis cycles, the system performs 500K statistical tests per cycle. Distributing across 10 analysis workers (each handling 1K experiments) keeps the cycle time under 30 seconds. Segment-level analysis (breaking down results by country, platform, user cohort) multiplies the computation; pre-defined segments are analyzed automatically, while ad-hoc segments trigger on-demand ClickHouse queries.

Key Trade-offs

Client-side evaluation vs server-side evaluation: Client-side evaluation provides sub-millisecond latency and zero backend dependency, but requires shipping experiment configs to the client (potential information leak of experiment names/variants) — config obfuscation and targeting rule evaluation server-side mitigate this
Sequential testing vs fixed-horizon testing: Sequential testing allows valid early stopping (saving time and traffic), but requires a more complex statistical framework and produces slightly wider confidence intervals at the final analysis — the time savings justify the complexity
MurmurHash3 vs cryptographic hash for bucketing: MurmurHash3 is 100x faster with excellent uniformity for bucketing purposes; cryptographic hashes would be overkill and add unnecessary latency — the determinism and uniformity properties are sufficient
CDN-cached config vs WebSocket real-time push: CDN caching adds up to 30 seconds of delay for config changes but provides extreme reliability and scalability (no persistent connections) — real-time push via SSE is offered as an option for server-side SDKs where connection management is simpler