System Design: Feature Flag System

Requirements

Functional Requirements:

Create and manage feature flags with on/off toggles and multi-variant support
Target flags to specific users, segments, or percentage rollouts
Support gradual rollouts: enable for X% of users, ramping up over time
Evaluate flags server-side and client-side (browser/mobile SDK)
Track flag evaluation events for analytics and experimentation
Emergency kill switch: disable a flag globally in under 1 second

Non-Functional Requirements:

Flag evaluation latency under 1ms (local evaluation, not a network call)
99.999% availability — flags must evaluate even if the flag management backend is down
Support 100 billion flag evaluations per day across all customers
Flag configuration changes propagate to all SDKs within 200ms globally

Scale Estimation

100 billion evaluations/day = 1.16 million evaluations/sec. Each evaluation is a local in-memory operation in the SDK — no network call. The bottleneck is configuration streaming: when a flag changes, the update must reach all connected SDK instances within 200ms. With 10 million active SDK instances globally, the streaming fan-out requires careful architecture. Flag configs are small (~1 KB per flag, ~10,000 flags per customer = 10 MB total config per customer). Event tracking: 1.16M evaluations/sec × 200 bytes/event = 232 MB/sec of event data.

High-Level Architecture

The system has two planes: a management plane (flag CRUD, targeting rules, audit logs) and a data plane (config delivery and event ingestion). The management plane is a standard web application backed by PostgreSQL — correctness matters more than raw throughput here. The data plane handles the scale-sensitive operations.

SDKs maintain a local in-memory replica of all flag configurations for their environment. On startup, they fetch the full config snapshot via REST, then subscribe to a streaming connection (Server-Sent Events or WebSocket) for incremental updates. All flag evaluations happen locally against this snapshot — zero network latency. The streaming layer is a pub/sub system (Redis Streams or a dedicated streaming service) that fans out config changes to all connected SDK instances.

For mobile/browser SDKs where persistent connections are impractical, polling with ETags is used: clients poll every 30 seconds, but the response is a 304 Not Modified if nothing changed, minimizing bandwidth. A CDN caches the full config payload with short TTLs (30 seconds) to absorb polling traffic.

Core Components

Flag Evaluation Engine

The evaluation engine runs entirely in-process within the application SDK. It accepts a flag key + evaluation context (user ID, attributes like country, plan, email) and returns the flag's value. Evaluation logic: (1) check if flag is enabled; (2) evaluate targeting rules in order — if the user matches a rule's conditions, return that rule's variation; (3) if no rule matches, apply percentage rollout using consistent hashing (MurmurHash of user_id + flag_key mod 100) to determine which bucket the user falls into; (4) return the default variation. Consistent hashing ensures the same user always gets the same variation across evaluations and SDK instances.

Targeting Rules Engine

Rules are evaluated as an ordered list of conditions. Each condition compares a context attribute (e.g., user.country == "US") or membership in a segment (a pre-computed list of user IDs or attribute-based rules). Segments with millions of users are represented as Bloom filters in the SDK to avoid shipping full user lists. Rule conditions support operators: string match, regex, semver comparison (for app version targeting), numeric range, and date comparison.

Streaming Delivery Service

When a flag is updated in the management plane, a change event is published to a message queue (Kafka). Streaming workers consume the queue, look up which SDK instances are subscribed to that environment, and push the delta (just the changed flag, not the full config). SDK instances apply the delta to their local snapshot atomically. The streaming service is horizontally scaled — each worker handles a subset of customer environments. Connection state (which SDK instances are connected to which worker) is tracked in Redis with TTL-based expiry for disconnected clients.

Database Design

PostgreSQL stores the canonical flag configuration: flags (id, key, name, environment_id, variations JSONB, targeting_rules JSONB, percentage_rollout, enabled), environments (id, project_id, name, sdk_key), segments (id, name, rules JSONB). The JSONB columns store targeting rule trees — flexible enough for complex conditions without schema migrations.

Evaluation events are not written to PostgreSQL. They're buffered in SDKs (flushed every 30 seconds or 1000 events), sent to an ingestion API, written to Kafka, and consumed by a stream processor (Flink) that computes aggregates (impression counts, conversion rates per variation) stored in ClickHouse for analytics queries.

API Design

Scaling & Bottlenecks

The streaming fan-out is the hardest scaling problem. A single flag change in a large customer account triggers updates to millions of SDK instances. A naive broadcast overwhelms the streaming layer. Solution: hierarchical fan-out — a small number of relay nodes subscribe to Kafka and each maintain connections to a subset of SDK instances. This creates a two-level tree: Kafka → relay nodes → SDK instances. Relay nodes are stateless (connection state in Redis) and horizontally scalable.

Event ingestion is the other bottleneck. SDKs batch events and send them to a fleet of stateless ingestion servers that write to Kafka. The ingestion path is async — events are acknowledged immediately upon Kafka write, before any processing. This decouples ingestion throughput from analytics processing speed.

Key Trade-offs

Local evaluation vs. remote evaluation: Local evaluation gives sub-millisecond latency and works offline, but requires shipping all flag configs to the SDK — a privacy concern if configs contain user segment data
SSE vs. WebSocket for streaming: SSE is simpler (HTTP, auto-reconnect, works through proxies) but unidirectional; WebSocket is bidirectional but adds complexity; SSE is preferred for config delivery
Percentage rollout consistency: Using user_id + flag_key as the hash seed ensures sticky bucketing (same user always gets same variant) but means changing a flag key resets all user assignments
Evaluation event sampling: At 1M+ evaluations/sec, logging every event is expensive; sampling (log 1% of evaluations) reduces cost but introduces statistical uncertainty in experiment results