System Design: Waitlist System (High-Concurrency)

Requirements

Functional Requirements:

Users join a waitlist for a resource (event ticket, product drop, appointment slot)
Assign a fair queue position — first-come-first-served with anti-gaming measures
When a slot opens, automatically promote the front of the queue with a time-limited offer (15 minutes to claim)
If the promoted user doesn't claim within the time limit, promote the next in line
Show users their current position and estimated wait time
Support multiple waitlist types: event-specific, product SKU-specific, and appointment-type

Non-Functional Requirements:

Handle 500k simultaneous join attempts at waitlist open
Queue position assignment must be atomic — no two users can receive the same position
Position promotion must be exactly-once — a slot cannot be offered to two users simultaneously
Estimated wait time accuracy within ±10 positions
99.9% uptime; users waiting expect reliable status updates

Scale Estimation

For a major product drop (limited sneaker release): 500k users attempting to join in the first 60 seconds = ~8,333 joins/second. Available inventory: 10,000 units. Average promotion cycle: 15-minute claim window × some abandonment rate. If 30% abandon, position advances at ~15 minutes × 10k units = promotion events over several hours. Position polling: 500k users × 1 poll/minute = ~8,333 reads/second sustained. WebSocket alternative reduces this to one-time subscribe + push, ~500k active connections.

High-Level Architecture

The system is composed of a Queue Service, a Promotion Engine, and a Status Service. The Queue Service handles join operations atomically using Redis. Queue position assignment uses Redis ZADD with a score that combines a timestamp (for fair ordering) and a small random jitter (to prevent users who send requests within the same millisecond from experiencing unfair tie-breaking based on network proximity to the server). A Lua script ensures the join is atomic: check if user already in queue → if not, ZADD with current score → return assigned position.

The Promotion Engine monitors inventory availability events from Kafka (cancellations, new releases, appointment slot openings). On receiving a slot-opened event, it uses a distributed lock (Redis SETNX) to claim promotion rights, reads the front of the queue (ZRANGE WITHSCORES limit 1), creates a time-limited offer record, and sends the user a notification. The lock prevents two Promotion Engine instances from promoting the same slot simultaneously — exactly-once delivery is critical.

The Status Service answers position queries. For users in positions 1-100 (close to the front), it provides real-time position updates via WebSocket. For users in positions 100+, it provides polling-based status with cached position data refreshed every 30 seconds. This tiered approach avoids WebSocket connection overhead for users who won't be promoted soon while giving front-of-queue users responsive updates.

Core Components

Queue Service (Redis-backed)

The core data structure is a Redis Sorted Set: waitlist:{resource_id} → members are user_id, scores are timestamp_ms + random_jitter_0_to_999. Join is a Lua script: if ZSCORE(key, userId) ~= nil then return existing_position end; score = currentTimeMs() * 1000 + math.random(999); ZADD(key, score, userId); return ZRANK(key, userId) + 1. Position is 1-indexed. The set is backed by a PostgreSQL waitlist_members table as the durable store — Redis is rebuilt from Postgres on cold start. Batch sync from Redis to Postgres runs every 30 seconds.*

Promotion Engine

Consumes slot-availability events from Kafka. Each event includes resource_id and slot_id. The engine: acquires a Redis lock promote_lock:{resource_id}:{slot_id} with 30-second TTL → reads the front user from the sorted set → creates an offer record in PostgreSQL: {offer_id, resource_id, slot_id, user_id, offered_at, expires_at, status: PENDING} → publishes a notification event → atomically moves the user from the main sorted set to a holding set pending claim confirmation. If the offer expires (TTL job checks expires_at), the user is removed from holding, the slot goes back to available, and the Promotion Engine is triggered again.

Notification & Status Service

Sends offer notifications via push (APNs/FCM), SMS, and email simultaneously — the first channel to deliver is sufficient, but all are sent for redundancy. For position updates, a WebSocket gateway maintains connections for front-of-queue users. Redis Pub/Sub channels position_updates:{resource_id} receive position delta events whenever someone ahead in line claims or abandons. The Status Service subscribes and pushes updates to connected users. Estimated wait time is calculated as: (user_position - 1) × average_claim_time_minutes, where the rolling average is computed over the last 100 promotions.

Database Design

Waitlists: waitlists (waitlist_id UUID, resource_id, resource_type ENUM, status ENUM(open, paused, closed), created_at). Members: waitlist_members (member_id, waitlist_id, user_id, position, joined_at, status ENUM(waiting, offered, claimed, abandoned, removed)). Offers: offers (offer_id, waitlist_id, slot_id, user_id, offered_at, expires_at, claimed_at, status ENUM(pending, claimed, expired)). The offers table has a unique constraint on (slot_id, status='pending') to prevent double-offers at the database level as a backstop behind the Redis lock.

For analytics, a waitlist_events append-only table records every state transition with timestamp. This powers dashboards showing conversion rates (join-to-claim), abandonment rates by position bucket, and average wait times. These metrics feed the estimated wait time algorithm and help operators tune promotion timing.

API Design

POST /api/v1/waitlists/{waitlistId}/join — idempotent join; returns {position, estimated_wait_minutes, member_id}.

GET /api/v1/waitlists/{waitlistId}/position — current position and status for authenticated user.

POST /api/v1/offers/{offerId}/claim — user claims a promoted offer; must be within expires_at; idempotent on offer_id.

DELETE /api/v1/waitlists/{waitlistId}/leave — user voluntarily leaves the waitlist.

Scaling & Bottlenecks

The join burst at waitlist open is the primary bottleneck. Redis ZADD operations handle 100k+/second on a single node; a Redis Cluster with 8 shards supports 800k+ ops/second. However, sharding a sorted set across Redis Cluster nodes breaks ZRANGE operations that need global ordering. The solution is to use a single Redis Sorted Set per resource (which keeps one waitlist on one shard — acceptable for most use cases) or to use a global sequence number service (Redis INCR on a dedicated counter node) and distribute the full sorted set to multiple read replicas for position queries.

For extremely high-volume waitlists (500k+ users), position accuracy degrades when many users join simultaneously. A batch join mechanism accepts bursts into a Redis List (LPUSH, O(1)), and a background worker processes the list in order, assigning positions from the sorted set. This smooths the join spike into a controllable stream. Users receive a "you're in the queue, position TBD" response immediately and get their position notification within a few seconds.

Key Trade-offs

Single sorted set vs. distributed queue: A single Redis sorted set maintains perfect global ordering but limits sharding; distributed queues scale better but require a global ordering mechanism to prevent fairness violations.
Push vs. poll for position updates: WebSocket push is bandwidth-efficient and low-latency for front-of-queue users but requires stateful connection management; polling scales to arbitrary user counts but creates unnecessary load when positions haven't changed.
15-minute claim window vs. instant claim: Longer claim windows give users time to respond but strand inventory for longer on each abandonment; shorter windows speed up the queue but frustrate users who miss notifications.
Exact position vs. bucket position: Showing an exact position ("You are #4,521") can be gamed and creates high-frequency polling; bucket positions ("Top 5% of queue") are more private, reduce polling motivation, and are accurate enough for user planning.