System Design: Discord

Requirements

Functional Requirements:

Users join servers (guilds) with text channels, voice channels, and forums
Real-time text messaging with embeds, reactions, and attachments
Low-latency voice and video chat in channels (up to hundreds of participants)
Role-based access control with granular channel permissions
Rich presence (game activity, streaming status, custom status)
Bot ecosystem with programmable interactions

Non-Functional Requirements:

200 million MAU, 25 million concurrent users at peak
Text message delivery under 100ms; voice latency under 50ms
99.99% availability; messages must never be lost
Support for servers with 1M+ members (e.g., official game servers)
Voice quality comparable to native gaming voice chat

Scale Estimation

With 200M MAU and high engagement, Discord processes approximately 4 billion messages per day — roughly 46,000 messages per second. Average message size is 150 bytes, producing ~600GB text data daily. Voice: 5M concurrent voice connections at peak, each streaming 50kbps Opus audio = 250Gbps aggregate voice bandwidth. The system manages 19 million active servers, with the largest having over 1M members. Presence updates for 25M concurrent users at 30-second intervals generate ~830K presence writes per second.

High-Level Architecture

Discord's architecture separates three real-time planes: text messaging, voice/video, and presence. The text messaging layer uses a Gateway WebSocket that each client connects to. The Gateway is stateful — it holds the user's session, tracks which guilds and channels they have open, and pushes real-time events (new messages, typing indicators, reactions). Behind the Gateway, a Message Service handles persistence to Cassandra (Discord famously migrated from MongoDB to Cassandra and then to ScyllaDB for message storage) and emits events to a pub/sub layer for fan-out.

The voice layer uses a Selective Forwarding Unit (SFU) architecture. Unlike peer-to-peer (which doesn't scale beyond ~5 participants) or an MCU (which is CPU-expensive from mixing), the SFU receives each participant's audio/video stream and selectively forwards it to other participants without decoding. Discord runs dedicated Voice Servers in multiple regions; when a user joins a voice channel, the server assigns them to the nearest Voice Server. Each Voice Server handles up to 1,000 concurrent voice channels using a custom C++ media relay.

Presence and guild membership are managed by a distributed Guild Service that maintains in-memory state for all active guilds. Each guild is assigned to a Guild Process (conceptually similar to an Erlang actor) that tracks member online status, permissions, and channel state. Guild Processes are distributed across a fleet of servers using consistent hashing on guild_id.

Core Components

Gateway (WebSocket)

The Gateway is Discord's real-time event delivery system. Each client opens a single WebSocket to a Gateway node, which maintains the session state (user ID, active guilds, open channels). The Gateway subscribes to events for the user's guilds via an internal pub/sub system (Redis Pub/Sub originally, now a custom solution called Shard Orchestrator). Events are filtered per-user — a user only receives messages for channels they have permission to view. The Gateway fleet handles 25M concurrent connections using Rust-based servers with io_uring for maximum connection density.

Voice Server (SFU)

Voice Servers implement WebRTC-compatible media relay using a custom C++ engine. Each server handles ~2,000 concurrent participants across multiple voice channels. Audio is encoded using Opus at 64kbps; video uses VP8/H.264 with simulcast (multiple quality layers). The SFU selectively forwards streams based on the client's available bandwidth and active speaker detection — only the top 3 speakers' video streams are forwarded at full quality, reducing bandwidth by 80% in large channels. DTLS-SRTP provides encryption for all media streams.

Guild State Service

Each guild (server) has a dedicated logical process that maintains its state: member list, channel tree, role hierarchy, and permission overrides. For small guilds (<10K members), the full member list is held in memory. For large guilds (>100K members), only online members and recent participants are tracked in-memory; the full member list is fetched from Cassandra on demand. Guild processes communicate via an internal RPC layer, enabling cross-guild operations like user profile updates to propagate.

Database Design

Message storage uses ScyllaDB (a C++ rewrite of Cassandra) with partition key (channel_id, bucket) where bucket is a time-based partition (messages are bucketed into 10-day windows to prevent partition hotspots). Clustering key is message_id (Snowflake format). Columns include author_id, content, embeds (JSON), attachments (JSON array of CDN URLs), reactions (map of emoji to user_id sets), and flags. This schema enables efficient range queries for channel history: SELECT * FROM messages WHERE channel_id = ? AND bucket = ? ORDER BY message_id DESC LIMIT 50.*

Guild metadata (guild_id, name, owner_id, icon_url, features, member_count) is stored in PostgreSQL. Role and permission data lives alongside guild metadata with a separate roles table (role_id, guild_id, permissions_bitfield, position). User data (user_id, username, discriminator, avatar_hash, flags) is stored in a separate PostgreSQL cluster sharded by user_id. Relationships (friends, blocks) use a Cassandra table with partition key user_id and clustering key target_user_id.

API Design

POST /api/v10/channels/{channel_id}/messages — Send a message: {content, embeds?, attachments?, message_reference?}; returns message object with Snowflake ID
GET /api/v10/channels/{channel_id}/messages?before={id}&limit=50 — Fetch channel history with cursor-based pagination
POST /api/v10/channels/{channel_id}/call — Join voice channel; returns Voice Server endpoint and session credentials
PUT /api/v10/guilds/{guild_id}/members/{user_id}/roles/{role_id} — Assign a role to a member; checked against permission hierarchy

Scaling & Bottlenecks

The biggest challenge is large guilds with 1M+ members. When a message is sent in a popular channel, the fan-out to subscribed members creates a thundering herd on the Gateway layer. Discord solves this using lazy delivery — only members who currently have the channel open receive real-time push; others see the unread indicator on their next channel list sync. This reduces fan-out from O(guild_size) to O(active_viewers), typically 3-4 orders of magnitude smaller.

Voice Server scaling requires regional awareness. Discord operates Voice Servers in 15+ regions; users are routed to the nearest region, but when a voice channel spans multiple regions, a relay bridge connects the regional SFUs. This adds ~30ms latency for cross-region participants but avoids routing all traffic through a single origin. ScyllaDB scaling follows a ring topology with automatic rebalancing; Discord runs separate ScyllaDB clusters per message volume tier to isolate noisy workloads.

Key Trade-offs

SFU over MCU for voice: SFU avoids the CPU cost of mixing audio/video streams server-side, but pushes bandwidth and decoding costs to clients — works because gaming PCs and modern phones can handle it
ScyllaDB over PostgreSQL for messages: ScyllaDB's write throughput and time-series partitioning handle message volume better than PostgreSQL, but queries are less flexible (no JOINs, no full-text search — search is a separate Elasticsearch layer)
Lazy delivery for large guilds: Only pushing to active viewers reduces fan-out dramatically, but means notification counts may be slightly delayed for inactive channel viewers
Snowflake IDs over UUIDs: Snowflake IDs encode timestamp for time-ordered queries and are 64-bit (vs UUID's 128-bit), halving index size, but require a centralized ID generation service