SYSTEM_DESIGN

System Design: Waitlist System (High-Concurrency)

Design a high-concurrency waitlist system for events, products, or services where demand exceeds supply. Covers fair queue ordering, atomic position management, promotion workflows, and anti-bot measures with idempotent processing.

12 min readUpdated Jan 15, 2025
system-designticketingwaitlistqueuehigh-concurrencyidempotencyredis

Requirements

Functional Requirements:

  • Users join a waitlist for a resource (event ticket, product drop, appointment slot)
  • Assign a fair queue position — first-come-first-served with anti-gaming measures
  • When a slot opens, automatically promote the front of the queue with a time-limited offer (15 minutes to claim)
  • If the promoted user doesn't claim within the time limit, promote the next in line
  • Show users their current position and estimated wait time
  • Support multiple waitlist types: event-specific, product SKU-specific, and appointment-type

Non-Functional Requirements:

  • Handle 500k simultaneous join attempts at waitlist open
  • Queue position assignment must be atomic — no two users can receive the same position
  • Position promotion must be exactly-once — a slot cannot be offered to two users simultaneously
  • Estimated wait time accuracy within ±10 positions
  • 99.9% uptime; users waiting expect reliable status updates

Scale Estimation

For a major product drop (limited sneaker release): 500k users attempting to join in the first 60 seconds = ~8,333 joins/second. Available inventory: 10,000 units. Average promotion cycle: 15-minute claim window × some abandonment rate. If 30% abandon, position advances at ~15 minutes × 10k units = promotion events over several hours. Position polling: 500k users × 1 poll/minute = ~8,333 reads/second sustained. WebSocket alternative reduces this to one-time subscribe + push, ~500k active connections.

High-Level Architecture

The system is composed of a Queue Service, a Promotion Engine, and a Status Service. The Queue Service handles join operations atomically using Redis. Queue position assignment uses Redis ZADD with a score that combines a timestamp (for fair ordering) and a small random jitter (to prevent users who send requests within the same millisecond from experiencing unfair tie-breaking based on network proximity to the server). A Lua script ensures the join is atomic: check if user already in queue → if not, ZADD with current score → return assigned position.

The Promotion Engine monitors inventory availability events from Kafka (cancellations, new releases, appointment slot openings). On receiving a slot-opened event, it uses a distributed lock (Redis SETNX) to claim promotion rights, reads the front of the queue (ZRANGE WITHSCORES limit 1), creates a time-limited offer record, and sends the user a notification. The lock prevents two Promotion Engine instances from promoting the same slot simultaneously — exactly-once delivery is critical.

The Status Service answers position queries. For users in positions 1-100 (close to the front), it provides real-time position updates via WebSocket. For users in positions 100+, it provides polling-based status with cached position data refreshed every 30 seconds. This tiered approach avoids WebSocket connection overhead for users who won't be promoted soon while giving front-of-queue users responsive updates.

Core Components

Queue Service (Redis-backed)

The core data structure is a Redis Sorted Set: waitlist:{resource_id} → members are user_id, scores are timestamp_ms + random_jitter_0_to_999. Join is a Lua script: if ZSCORE(key, userId) ~= nil then return existing_position end; score = currentTimeMs() * 1000 + math.random(999); ZADD(key, score, userId); return ZRANK(key, userId) + 1. Position is 1-indexed. The set is backed by a PostgreSQL waitlist_members table as the durable store — Redis is rebuilt from Postgres on cold start. Batch sync from Redis to Postgres runs every 30 seconds.*

Promotion Engine

Consumes slot-availability events from Kafka. Each event includes resource_id and slot_id. The engine: acquires a Redis lock promote_lock:{resource_id}:{slot_id} with 30-second TTL → reads the front user from the sorted set → creates an offer record in PostgreSQL: {offer_id, resource_id, slot_id, user_id, offered_at, expires_at, status: PENDING} → publishes a notification event → atomically moves the user from the main sorted set to a holding set pending claim confirmation. If the offer expires (TTL job checks expires_at), the user is removed from holding, the slot goes back to available, and the Promotion Engine is triggered again.

Notification & Status Service

Sends offer notifications via push (APNs/FCM), SMS, and email simultaneously — the first channel to deliver is sufficient, but all are sent for redundancy. For position updates, a WebSocket gateway maintains connections for front-of-queue users. Redis Pub/Sub channels position_updates:{resource_id} receive position delta events whenever someone ahead in line claims or abandons. The Status Service subscribes and pushes updates to connected users. Estimated wait time is calculated as: (user_position - 1) × average_claim_time_minutes, where the rolling average is computed over the last 100 promotions.

Database Design

Waitlists: waitlists (waitlist_id UUID, resource_id, resource_type ENUM, status ENUM(open, paused, closed), created_at). Members: waitlist_members (member_id, waitlist_id, user_id, position, joined_at, status ENUM(waiting, offered, claimed, abandoned, removed)). Offers: offers (offer_id, waitlist_id, slot_id, user_id, offered_at, expires_at, claimed_at, status ENUM(pending, claimed, expired)). The offers table has a unique constraint on (slot_id, status='pending') to prevent double-offers at the database level as a backstop behind the Redis lock.

For analytics, a waitlist_events append-only table records every state transition with timestamp. This powers dashboards showing conversion rates (join-to-claim), abandonment rates by position bucket, and average wait times. These metrics feed the estimated wait time algorithm and help operators tune promotion timing.

API Design

POST /api/v1/waitlists/{waitlistId}/join — idempotent join; returns {position, estimated_wait_minutes, member_id}.

GET /api/v1/waitlists/{waitlistId}/position — current position and status for authenticated user.

POST /api/v1/offers/{offerId}/claim — user claims a promoted offer; must be within expires_at; idempotent on offer_id.

DELETE /api/v1/waitlists/{waitlistId}/leave — user voluntarily leaves the waitlist.

Scaling & Bottlenecks

The join burst at waitlist open is the primary bottleneck. Redis ZADD operations handle 100k+/second on a single node; a Redis Cluster with 8 shards supports 800k+ ops/second. However, sharding a sorted set across Redis Cluster nodes breaks ZRANGE operations that need global ordering. The solution is to use a single Redis Sorted Set per resource (which keeps one waitlist on one shard — acceptable for most use cases) or to use a global sequence number service (Redis INCR on a dedicated counter node) and distribute the full sorted set to multiple read replicas for position queries.

For extremely high-volume waitlists (500k+ users), position accuracy degrades when many users join simultaneously. A batch join mechanism accepts bursts into a Redis List (LPUSH, O(1)), and a background worker processes the list in order, assigning positions from the sorted set. This smooths the join spike into a controllable stream. Users receive a "you're in the queue, position TBD" response immediately and get their position notification within a few seconds.

Key Trade-offs

  • Single sorted set vs. distributed queue: A single Redis sorted set maintains perfect global ordering but limits sharding; distributed queues scale better but require a global ordering mechanism to prevent fairness violations.
  • Push vs. poll for position updates: WebSocket push is bandwidth-efficient and low-latency for front-of-queue users but requires stateful connection management; polling scales to arbitrary user counts but creates unnecessary load when positions haven't changed.
  • 15-minute claim window vs. instant claim: Longer claim windows give users time to respond but strand inventory for longer on each abandonment; shorter windows speed up the queue but frustrate users who miss notifications.
  • Exact position vs. bucket position: Showing an exact position ("You are #4,521") can be gamed and creates high-frequency polling; bucket positions ("Top 5% of queue") are more private, reduce polling motivation, and are accurate enough for user planning.

GO DEEPER

Master this topic in our 12-week cohort

Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.