System Design: Multiplayer Matchmaking System

Requirements

Functional Requirements:

Match players into game sessions based on skill rating (MMR/Elo), latency (ping to server region), game mode, and player preferences
Support group queue: parties of 2-5 players queue together and match as a unit
Dynamic skill expansion: widen acceptable skill range as wait time increases (quality-time trade-off)
Backfill: replace players who disconnect before match start with suitable players from queue
Leaver penalty: track and penalize players who frequently abandon queued matches
Ranked and unranked modes with different matching algorithms

Non-Functional Requirements:

Median match wait time under 45 seconds during peak hours
Match skill imbalance (average team MMR difference) under 100 points for 90% of matches
System must handle 500,000 concurrent players in queue
Matchmaking decisions must be made and communicated within 2 seconds of a viable match forming
Regional servers: separate matching pools per region (NA, EU, Asia) with cross-region fallback

Scale Estimation

500k concurrent players in queue globally across 5 regions = 100k per region on average. For a 5v5 game, we need to form groups of 10 balanced players. If median wait time is 45 seconds and we have 100k players in queue, throughput is 100k / 45 seconds × 10 players per match = ~2,200 matches formed/second per region. The matchmaker must evaluate match quality for candidate groups — a naive O(N²) comparison of all player pairs is infeasible at 100k players. The matchmaker uses a sorted structure (by MMR) to limit comparison to players within the current skill window.

High-Level Architecture

The matchmaking system is a set of regional matching services, each managing its own queue pool. Players enter a queue via the queue API (with their MMR, latency to regional servers, and preferences). The queue API adds the player to a Redis Sorted Set (score = MMR) within their region's queue and publishes a queue-entry event to the regional matching engine.

The matching engine is a stateful loop running every 500ms. Each iteration: (1) read all players from the Redis queue sorted by MMR, (2) apply a sliding window algorithm to find groups of 10 players within the current skill tolerance (starting at ±50 MMR, expanding by 10 MMR every 10 seconds of wait), (3) for each candidate group, verify latency compatibility (all players must have sub-100ms ping to the assigned server region), (4) lock the matched players (atomic Redis multi-key lock), (5) allocate a game server session (via the game server allocation API), (6) notify all matched players of their session details via WebSocket push. If a player doesn't acknowledge within 10 seconds, the match is cancelled and non-accepting players are penalized.

Group queuing (parties) is handled by treating the party as a single queue entity with the average MMR of the party members and the highest individual latency. The matchmaker fills the opposing team with 5 individual players or a matching party, applying an MMR adjustment to account for party coordination bonuses.

Core Components

Queue Manager

The queue manager maintains per-region Redis Sorted Sets for each game mode and rank tier. Players are inserted with a composite key encoding their MMR and queue entry timestamp. The entry_time is stored as a Redis hash field alongside the player's full queue entry (MMR, latency map, preferences, party_id if applicable). The queue manager handles: queue entry validation (not already queued, no active ban), duplicate detection (reconnecting client re-queuing), and queue removal (player cancels, disconnects, or is matched). Queue state is authoritative in Redis with a 10-minute TTL per entry to auto-remove stale connections.

Match Quality Evaluator

The match quality evaluator runs within the matching loop to score candidate matches before committing. Quality score components: (1) MMR balance score — 1 - (|team_A_avg_mmr - team_B_avg_mmr| / max_tolerance), (2) intra-team MMR variance score — lower variance = more consistent team, (3) latency score — average server latency across all players, (4) wait time fairness — players who have waited longest get priority. The composite quality score (weighted sum) must exceed a threshold before a match is formed. This prevents the system from forming technically valid but low-quality matches when slightly better matches are 10 seconds away.

Skill Rating (MMR) Service

Post-match, the MMR service computes Elo-style rating updates based on match outcome and pre-match win probability. Win probability is calculated as: P(A wins) = 1 / (1 + 10^((MMR_B - MMR_A) / 400)). The K-factor (how much MMR changes per match) scales with match count — new players have K=32 for faster calibration, stabilizing at K=16 after 30 matches. The service updates MMR in PostgreSQL (authoritative) and pushes the new value to Redis (for the queue). It also maintains separate ranked and casual MMR tracks — ranked MMR is the matchmaking basis; casual MMR converges toward ranked but updates more slowly.

Database Design

Redis: queue:{region}:{game_mode}:{tier} (Sorted Set: player_id → MMR), queue:player:{player_id} (hash: MMR, latency_map_json, party_id, queued_at, expanded_range), match_lock:{match_id} (string with TTL: locks players during match acceptance phase). PostgreSQL: player_mmr (player_id, game_mode, mmr, uncertainty, match_count, updated_at), matches (match_id, region, game_mode, started_at, team_a_players[], team_b_players[], team_a_mmr_avg, team_b_mmr_avg, outcome), queue_metrics (hourly aggregates: avg_wait_time, avg_mmr_diff, match_count per region/mode), leaver_penalty (player_id, penalty_count, last_penalty_at, ban_until). ClickHouse: match_quality_events (match_id, quality_score components) for matchmaker tuning analytics.

API Design

POST /queue/join — body: {player_id, game_mode, region_preferences: ["NA","EU"]}, validates, adds to queue, returns {queue_id, estimated_wait_seconds}; WebSocket connection maintained for match notification
DELETE /queue/leave — body: {player_id}, removes from queue; applies leaver penalty if match was already in acceptance phase
WebSocket /queue/notifications?player_id={id} — server pushes: MatchFound{session_id, server_endpoint, team_assignment}, MatchCancelled{reason}, WaitTimeUpdate{estimated_seconds}
POST /queue/accept — body: {match_id, player_id}, registers acceptance; after all 10 players accept, game server is provisioned
GET /queue/status?region={r}&game_mode={m} — public endpoint: returns current queue size and estimated wait time by tier

Scaling & Bottlenecks

The matching loop's O(N) scan of the queue sorted set (reading 100k players per iteration every 500ms) is expensive at 200k reads/500ms = 400k Redis ops/second per region — within Redis limits but approaching single-instance capacity. Optimize with tier-based segmentation: divide the MMR range into 10 tiers (Bronze through Diamond) and run independent matching loops per tier. Each tier has ~10k players, reducing scan size by 10× and enabling parallel loop execution.

Match acceptance (10 players must confirm within 10 seconds) creates a thundering herd on the game server allocation service when many matches form simultaneously. Buffer allocation requests in a queue (SQS) to prevent spikes from overwhelming the Agones allocation API. If a player doesn't accept within 10 seconds, immediately return the other 9 players to the front of their queue (priority re-queue), not the back — they should get the next available match without losing their wait time.

Key Trade-offs

Skill balance vs. wait time: Strict skill matching minimizes MMR imbalance but causes long queues for players with extreme ratings (top 1% or bottom 1%); expanding search range over time is a continuous trade-off. Publishing the expansion policy to players ("expanding search in 30s") manages expectations.
Regional vs. global pools: Regional pools minimize latency but create thin queues for niche game modes in small regions; cross-region fallback (accepting 20ms extra latency for faster matching) is a pragmatic solution for small player populations.
Deterministic vs. approximate matching: Solving the optimal matching problem (minimize total MMR imbalance across all concurrent matches) is NP-hard; the sliding window greedy approach provides near-optimal results in O(N log N) per iteration.
Party MMR averaging vs. individual matching: Averaging party MMR means a high-MMR player can "carry" low-MMR friends into higher-tier matches; applying an MMR premium to parties (add 50 to team MMR for each party member advantage) reduces this but penalizes legitimate friend groups.