SYSTEM_DESIGN
System Design: Waitlist System (High-Concurrency)
Design a high-concurrency waitlist system for events, products, or services where demand exceeds supply. Covers fair queue ordering, atomic position management, promotion workflows, and anti-bot measures with idempotent processing.
Requirements
Functional Requirements:
- Users join a waitlist for a resource (event ticket, product drop, appointment slot)
- Assign a fair queue position — first-come-first-served with anti-gaming measures
- When a slot opens, automatically promote the front of the queue with a time-limited offer (15 minutes to claim)
- If the promoted user doesn't claim within the time limit, promote the next in line
- Show users their current position and estimated wait time
- Support multiple waitlist types: event-specific, product SKU-specific, and appointment-type
Non-Functional Requirements:
- Handle 500k simultaneous join attempts at waitlist open
- Queue position assignment must be atomic — no two users can receive the same position
- Position promotion must be exactly-once — a slot cannot be offered to two users simultaneously
- Estimated wait time accuracy within ±10 positions
- 99.9% uptime; users waiting expect reliable status updates
Scale Estimation
For a major product drop (limited sneaker release): 500k users attempting to join in the first 60 seconds = ~8,333 joins/second. Available inventory: 10,000 units. Average promotion cycle: 15-minute claim window × some abandonment rate. If 30% abandon, position advances at ~15 minutes × 10k units = promotion events over several hours. Position polling: 500k users × 1 poll/minute = ~8,333 reads/second sustained. WebSocket alternative reduces this to one-time subscribe + push, ~500k active connections.
High-Level Architecture
The system is composed of a Queue Service, a Promotion Engine, and a Status Service. The Queue Service handles join operations atomically using Redis. Queue position assignment uses Redis ZADD with a score that combines a timestamp (for fair ordering) and a small random jitter (to prevent users who send requests within the same millisecond from experiencing unfair tie-breaking based on network proximity to the server). A Lua script ensures the join is atomic: check if user already in queue → if not, ZADD with current score → return assigned position.
The Promotion Engine monitors inventory availability events from Kafka (cancellations, new releases, appointment slot openings). On receiving a slot-opened event, it uses a distributed lock (Redis SETNX) to claim promotion rights, reads the front of the queue (ZRANGE WITHSCORES limit 1), creates a time-limited offer record, and sends the user a notification. The lock prevents two Promotion Engine instances from promoting the same slot simultaneously — exactly-once delivery is critical.
The Status Service answers position queries. For users in positions 1-100 (close to the front), it provides real-time position updates via WebSocket. For users in positions 100+, it provides polling-based status with cached position data refreshed every 30 seconds. This tiered approach avoids WebSocket connection overhead for users who won't be promoted soon while giving front-of-queue users responsive updates.
Core Components
Queue Service (Redis-backed)
The core data structure is a Redis Sorted Set: waitlist:{resource_id} → members are user_id, scores are timestamp_ms + random_jitter_0_to_999. Join is a Lua script: if ZSCORE(key, userId) ~= nil then return existing_position end; score = currentTimeMs() * 1000 + math.random(999); ZADD(key, score, userId); return ZRANK(key, userId) + 1. Position is 1-indexed. The set is backed by a PostgreSQL waitlist_members table as the durable store — Redis is rebuilt from Postgres on cold start. Batch sync from Redis to Postgres runs every 30 seconds.*
Promotion Engine
Consumes slot-availability events from Kafka. Each event includes resource_id and slot_id. The engine: acquires a Redis lock promote_lock:{resource_id}:{slot_id} with 30-second TTL → reads the front user from the sorted set → creates an offer record in PostgreSQL: {offer_id, resource_id, slot_id, user_id, offered_at, expires_at, status: PENDING} → publishes a notification event → atomically moves the user from the main sorted set to a holding set pending claim confirmation. If the offer expires (TTL job checks expires_at), the user is removed from holding, the slot goes back to available, and the Promotion Engine is triggered again.
Notification & Status Service
Sends offer notifications via push (APNs/FCM), SMS, and email simultaneously — the first channel to deliver is sufficient, but all are sent for redundancy. For position updates, a WebSocket gateway maintains connections for front-of-queue users. Redis Pub/Sub channels position_updates:{resource_id} receive position delta events whenever someone ahead in line claims or abandons. The Status Service subscribes and pushes updates to connected users. Estimated wait time is calculated as: (user_position - 1) × average_claim_time_minutes, where the rolling average is computed over the last 100 promotions.
Database Design
Waitlists: waitlists (waitlist_id UUID, resource_id, resource_type ENUM, status ENUM(open, paused, closed), created_at). Members: waitlist_members (member_id, waitlist_id, user_id, position, joined_at, status ENUM(waiting, offered, claimed, abandoned, removed)). Offers: offers (offer_id, waitlist_id, slot_id, user_id, offered_at, expires_at, claimed_at, status ENUM(pending, claimed, expired)). The offers table has a unique constraint on (slot_id, status='pending') to prevent double-offers at the database level as a backstop behind the Redis lock.
For analytics, a waitlist_events append-only table records every state transition with timestamp. This powers dashboards showing conversion rates (join-to-claim), abandonment rates by position bucket, and average wait times. These metrics feed the estimated wait time algorithm and help operators tune promotion timing.
API Design
POST /api/v1/waitlists/{waitlistId}/join — idempotent join; returns {position, estimated_wait_minutes, member_id}.
GET /api/v1/waitlists/{waitlistId}/position — current position and status for authenticated user.
POST /api/v1/offers/{offerId}/claim — user claims a promoted offer; must be within expires_at; idempotent on offer_id.
DELETE /api/v1/waitlists/{waitlistId}/leave — user voluntarily leaves the waitlist.
Scaling & Bottlenecks
The join burst at waitlist open is the primary bottleneck. Redis ZADD operations handle 100k+/second on a single node; a Redis Cluster with 8 shards supports 800k+ ops/second. However, sharding a sorted set across Redis Cluster nodes breaks ZRANGE operations that need global ordering. The solution is to use a single Redis Sorted Set per resource (which keeps one waitlist on one shard — acceptable for most use cases) or to use a global sequence number service (Redis INCR on a dedicated counter node) and distribute the full sorted set to multiple read replicas for position queries.
For extremely high-volume waitlists (500k+ users), position accuracy degrades when many users join simultaneously. A batch join mechanism accepts bursts into a Redis List (LPUSH, O(1)), and a background worker processes the list in order, assigning positions from the sorted set. This smooths the join spike into a controllable stream. Users receive a "you're in the queue, position TBD" response immediately and get their position notification within a few seconds.
Key Trade-offs
- Single sorted set vs. distributed queue: A single Redis sorted set maintains perfect global ordering but limits sharding; distributed queues scale better but require a global ordering mechanism to prevent fairness violations.
- Push vs. poll for position updates: WebSocket push is bandwidth-efficient and low-latency for front-of-queue users but requires stateful connection management; polling scales to arbitrary user counts but creates unnecessary load when positions haven't changed.
- 15-minute claim window vs. instant claim: Longer claim windows give users time to respond but strand inventory for longer on each abandonment; shorter windows speed up the queue but frustrate users who miss notifications.
- Exact position vs. bucket position: Showing an exact position ("You are #4,521") can be gamed and creates high-frequency polling; bucket positions ("Top 5% of queue") are more private, reduce polling motivation, and are accurate enough for user planning.
GO DEEPER
Master this topic in our 12-week cohort
Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.