System Design: Real-Time Chat System

Requirements

Functional Requirements:

Users can send and receive text messages in real time via one-on-one and group conversations
Typing indicators visible to conversation participants
Message delivery and read receipts
Conversation history with pagination
User online/offline presence status
Support for media messages (images, files)

Non-Functional Requirements:

Support 10 million concurrent connected users
Message delivery latency under 100ms for online users
99.99% availability with at-least-once delivery guarantee
Messages must be persisted durably and never lost
Horizontal scalability — adding servers increases capacity linearly

Scale Estimation

With 10M concurrent users sending an average of 5 messages per minute during active sessions (assuming 30% are active at any moment), the system processes approximately 25 million messages per minute — roughly 417,000 messages per second. Average message payload is 200 bytes, yielding ~83MB/sec of message throughput. WebSocket connections: 10M persistent connections at ~10KB memory overhead each = 100GB of connection state across the fleet. Message storage grows at approximately 50GB per day for text alone. Presence updates at 30-second heartbeat intervals produce 333K heartbeats per second.

High-Level Architecture

The system consists of four primary layers: connection, routing, persistence, and presence. The Connection Layer comprises a fleet of WebSocket Gateway servers that terminate client connections. Each Gateway server handles up to 500K concurrent WebSocket connections using an event-driven architecture (e.g., Netty for JVM, or a Go-based server using epoll). Clients connect via wss:// with JWT-based authentication on the initial HTTP upgrade handshake.

The Routing Layer determines how messages flow between connected users. When User A sends a message to User B, the message hits A's Gateway server, which publishes it to a Message Broker (Redis Pub/Sub or Kafka). B's Gateway server, subscribed to B's channel, receives the message and pushes it over B's WebSocket. A Session Registry (backed by Redis) maps user_id → gateway_server_id, enabling the routing layer to know which Gateway holds each user's connection. If the recipient is offline, the message is persisted and queued for later delivery.

The Persistence Layer writes every message to a durable store (Cassandra or PostgreSQL) before acknowledging delivery. This ensures that even if the recipient is offline or the Gateway crashes, the message survives. The Presence Layer tracks online/offline status using heartbeat signals — clients send periodic heartbeats (every 30 seconds), and absence of a heartbeat for 90 seconds marks the user as offline.

Core Components

WebSocket Gateway

The Gateway is the entry point for all real-time communication. It handles the WebSocket lifecycle: connection upgrade, authentication, heartbeat monitoring, and graceful disconnection. Each Gateway node maintains an in-memory map of connection_id → user_id and registers this mapping in the Session Registry (Redis hash: user:{user_id} → {gateway_host, connection_id}). On receiving a message from a client, the Gateway validates the payload, assigns a server-side timestamp, and forwards it to the Message Service via gRPC.

Message Service

The Message Service is the core business logic layer. It receives messages from Gateway nodes, generates a globally unique message_id (Snowflake ID), validates permissions (is the sender a member of this conversation?), writes the message to Cassandra, and emits a Kafka event on the messages topic partitioned by conversation_id. A Fan-out Consumer reads from Kafka, looks up conversation membership, and publishes to each recipient's Redis Pub/Sub channel. The Message Service also handles message edits and deletions with versioning.

Presence Service

The Presence Service maintains user online/offline state. It receives heartbeats from Gateway nodes (batched every 5 seconds per Gateway), updates a Redis key per user with a TTL of 90 seconds (presence:{user_id} → {status, last_seen}). When a key expires, a keyspace notification triggers an offline event published to the user's contacts. To avoid thundering herd presence updates when a Gateway restarts, presence changes are debounced — a user must be offline for 30 seconds before an offline notification is broadcast.

Database Design

Messages are stored in Cassandra with the following schema: partition key conversation_id, clustering key message_id (Snowflake, DESC). Columns: sender_id, content (text), message_type (enum: text, image, file, system), media_url, reply_to_id, edited_at, deleted (boolean), created_at. This schema allows efficient queries: fetch the latest N messages in a conversation with SELECT * FROM messages WHERE conversation_id = ? ORDER BY message_id DESC LIMIT 50.*

Conversation metadata is stored in PostgreSQL: conversations table (conversation_id, type: direct/group, name, created_at) and a conversation_members junction table (conversation_id, user_id, role, joined_at, last_read_message_id). PostgreSQL handles the relational queries well — listing a user's conversations sorted by last activity, checking membership permissions. A Redis cache (sorted set per user: conversations:{user_id} scored by last_message_timestamp) accelerates the conversation list endpoint.

API Design

WebSocket SEND {type: 'message', conversation_id, content, reply_to?} — Send a message; server responds with {type: 'message_ack', message_id, timestamp}
WebSocket SEND {type: 'typing', conversation_id} — Typing indicator; server broadcasts to other conversation members with 3-second debounce
GET /api/conversations/{id}/messages?before={message_id}&limit=50 — Fetch paginated history via REST (not WebSocket, to avoid head-of-line blocking)
GET /api/conversations?limit=20&after={cursor} — List user's conversations sorted by last activity

Scaling & Bottlenecks

The WebSocket Gateway layer is the primary scaling concern. Each server is limited by file descriptors and memory — with careful tuning (increasing ulimit, using connection-efficient runtimes), a single server can handle 500K-1M connections. Horizontal scaling is achieved by adding Gateway nodes behind a Layer 4 load balancer (HAProxy or AWS NLB) with sticky sessions based on a connection token. Gateway nodes are stateless (session state is in Redis), so any node can be replaced without message loss.

The fan-out for large group conversations (1000+ members) is the second bottleneck. Naive fan-out publishes to each member's Redis Pub/Sub channel individually. For large groups, a tiered approach works: the message is published to a group-level channel, and only Gateway nodes with active group members subscribe. This reduces Redis Pub/Sub messages from O(group_size) to O(gateway_nodes_with_members), typically a 100x reduction. Cassandra scaling is straightforward — add nodes to the ring and rebalance; the partition-per-conversation model distributes load naturally.

Key Trade-offs

WebSocket over SSE or long-polling: WebSockets provide bidirectional real-time communication with lower overhead per message, but require more complex infrastructure (sticky LB, connection state tracking) compared to stateless HTTP
Cassandra for messages over PostgreSQL: Cassandra's write throughput and partition model are ideal for append-heavy message workloads, but querying across conversations (e.g., search) requires a separate search index
Redis Pub/Sub for fan-out over Kafka direct: Redis Pub/Sub is faster (sub-millisecond) but fire-and-forget — if a Gateway is down, the message is lost from the real-time path; Kafka provides durability but adds 5-10ms latency. The hybrid approach (persist to Kafka + real-time via Redis) provides both guarantees
At-least-once over exactly-once delivery: At-least-once delivery with client-side deduplication (using message_id) is simpler than distributed exactly-once semantics and acceptable for chat — duplicate messages are rare and easily filtered