SYSTEM_DESIGN

System Design: WhatsApp

Comprehensive system design of WhatsApp covering end-to-end encryption via Signal Protocol, message delivery guarantees, group chat fan-out, and media transfer at 2 billion user scale.

18 min readUpdated Jan 15, 2025
system-designwhatsappmessagingend-to-end-encryption

Requirements

Functional Requirements:

  • Users can send and receive text messages, images, videos, and documents in real time
  • Support for one-on-one and group conversations (up to 1024 members)
  • End-to-end encryption for all messages using the Signal Protocol
  • Message delivery status indicators (sent, delivered, read)
  • Voice and video calling (one-on-one and group)
  • Offline message queuing with delivery upon reconnection

Non-Functional Requirements:

  • 2 billion MAU, 100 billion messages per day
  • Message delivery latency under 200ms for online recipients
  • 99.99% availability with zero message loss
  • Messages must not be readable by the server (true E2E encryption)
  • Support for low-bandwidth and unreliable network conditions

Scale Estimation

100 billion messages per day translates to roughly 1.15 million messages per second. With an average message size of 100 bytes for text, that is approximately 115GB of text data per day. Media messages (images, videos, documents) average 500KB each and constitute about 20% of all messages, adding roughly 10PB of media storage per day. With 2 billion users and an average of 50 contacts each, the contact graph contains 100 billion edges. Connection state management must track 500M+ concurrent WebSocket connections across the server fleet.

High-Level Architecture

WhatsApp's architecture is built on a connection-oriented model. Each client maintains a persistent encrypted connection (originally XMPP-based, now a custom binary protocol over TCP/TLS) to a Chat Server in the nearest datacenter. The Chat Server fleet is organized by user hash ranges — each user is assigned to a specific Chat Server based on consistent hashing of their phone number. When User A sends a message to User B, the message is encrypted client-side using the Signal Protocol (Double Ratchet algorithm with X3DH key agreement), transmitted to A's Chat Server, routed to B's Chat Server via an internal message bus, and pushed to B's device if online.

If User B is offline, the message is stored in a transient message queue (Mnesia or a custom store) and delivered when B reconnects. WhatsApp famously ran on Erlang/OTP for its Chat Servers, leveraging Erlang's lightweight process model to handle millions of concurrent connections per node — each connection is a separate Erlang process consuming only ~2KB of memory. Media files are uploaded separately to an Object Storage service, and only the encrypted media URL and decryption key are sent as the message payload.

The key exchange infrastructure uses a Key Distribution Service that stores public identity keys and signed pre-keys for each user. When User A wants to message User B for the first time, A fetches B's pre-key bundle from the Key Distribution Service and establishes a session using X3DH (Extended Triple Diffie-Hellman). Subsequent messages use the Double Ratchet algorithm, which provides forward secrecy — compromising a single message key does not compromise past or future messages.

Core Components

Chat Server (Erlang/OTP)

Each Chat Server manages hundreds of thousands of persistent connections. The server maintains an in-memory routing table mapping user IDs to connection PIDs. Message routing between Chat Servers uses an internal RPC layer built on Erlang distribution protocol. The server handles message serialization, compression (using zlib for text), connection keepalive, and presence updates. Hot-standby failover ensures that if a Chat Server crashes, its user connections are redistributed within seconds.

Message Queue & Offline Storage

Messages destined for offline users are written to a per-user queue in Mnesia (Erlang's distributed database) with a configurable TTL of 30 days. Upon reconnection, the user's Chat Server drains the queue in order, delivering messages with original timestamps. For group messages, the queue stores one copy per offline member. Queue storage is replicated across two datacenters for durability. Once delivered and acknowledged by the client, messages are deleted from the server — WhatsApp does not retain messages post-delivery.

Signal Protocol Key Infrastructure

The Key Distribution Service stores three key types per user: a long-term Identity Key (Curve25519), a medium-term Signed Pre-Key (rotated weekly), and a set of one-time Pre-Keys (consumed on first contact). The service is backed by a Cassandra cluster partitioned by phone number hash. Key verification uses the Safety Number mechanism — a 60-digit number derived from both users' identity keys that users can compare out-of-band. Group messaging uses Sender Keys: the sender generates a symmetric chain key shared with all group members via pairwise Signal sessions.

Database Design

User profiles (phone number, display name, avatar URL, last seen timestamp, public keys) are stored in a Cassandra cluster partitioned by phone number. The contact graph is stored as an adjacency list in Cassandra with partition key user_phone and clustering key contact_phone. Group metadata (group_id, name, members list, admin list, created_at) lives in a separate Cassandra table. Media files are stored in an S3-compatible object store with content-addressed keys (SHA-256 hash of the encrypted blob) enabling deduplication.

The offline message store uses a wide-column model: partition key recipient_phone, clustering key message_timestamp, columns for sender_phone, encrypted_payload, message_type, and media_url. This allows efficient range scans to retrieve all pending messages for a reconnecting user in chronological order. The store is configured with a TTL of 30 days — undelivered messages expire automatically.

API Design

  • SEND message — Binary protocol frame: {recipient_phone, encrypted_payload, message_type, media_ref, timestamp} — routed through Chat Server
  • ACK delivery — Client acknowledges receipt: {message_id, status: DELIVERED|READ} — triggers sender notification
  • UPLOAD /media — HTTPS multipart upload of encrypted media blob; returns content-addressed URL
  • GET /keys/{phone_number} — Fetch pre-key bundle for initiating a new Signal Protocol session

Scaling & Bottlenecks

The primary scaling challenge is managing 500M+ concurrent TCP connections. WhatsApp addresses this with Erlang's actor model — each connection is an isolated lightweight process, and a single server can handle 2-3 million concurrent connections. The fleet is horizontally scaled using consistent hashing; adding new servers requires rebalancing only a fraction of connections. DNS-based routing directs users to the nearest datacenter, and connection migration between servers during rebalancing uses a handoff protocol where the old server forwards buffered messages to the new server.

Group messaging creates write amplification: a message to a 1024-member group requires 1024 fan-out writes. WhatsApp mitigates this using the Sender Keys optimization — the sender encrypts the message once with a symmetric Sender Key, and each group member decrypts locally. The Sender Key is distributed via pairwise Signal Protocol sessions when a member joins the group. The fan-out for delivery notifications (sent/delivered/read) still requires per-member routing but uses batched delivery to reduce overhead.

Key Trade-offs

  • Erlang/OTP over traditional server frameworks: Erlang's lightweight processes and fault-tolerant supervisor trees enable millions of connections per node, but the ecosystem has a smaller talent pool and fewer libraries than JVM or Go alternatives
  • Delete-after-delivery vs. server-side storage: Not retaining messages after delivery eliminates server-side storage costs and privacy risks, but means multi-device sync requires the primary device to be online
  • Signal Protocol with Sender Keys for groups: Sender Keys reduce group encryption overhead from O(n) encryptions to O(1), but require re-keying when any member leaves (to preserve forward secrecy), which is expensive for large groups
  • Custom binary protocol over standard XMPP: The custom protocol reduces bandwidth by 40% compared to XML-based XMPP, critical for users on 2G/3G networks, but increases client implementation complexity

GO DEEPER

Master this topic in our 12-week cohort

Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.