System Design: Group Messaging

Requirements

Functional Requirements:

Users can create groups with up to 100,000 members
Members send text, media, and reply messages visible to all group participants
Group admin controls: add/remove members, promote admins, set group metadata
Message ordering consistent across all members
Mention notifications (@user, @everyone)
Group message history accessible to all current members

Non-Functional Requirements:

Support 50 million active groups with an average of 20 members
Message delivery latency under 200ms for online members
99.99% availability with zero message loss
Linear scalability with group size — a 100K-member group should not degrade system performance
Consistent message ordering within a group across all members

Scale Estimation

50 million active groups with an average of 20 members produce roughly 5 billion group messages per day (assuming each group generates 100 messages/day). At 57,870 messages/sec, with an average message size of 200 bytes, text throughput is ~11.5MB/sec. The fan-out challenge: each message must reach all group members. For average groups (20 members), fan-out is 20x, producing 100 billion delivery events per day. For large groups (100K members), a single message creates 100K delivery events. Membership storage: 50M groups × 20 members × 32 bytes per entry = ~32GB.

High-Level Architecture

The architecture uses a shared message log model (inspired by Kafka's design) rather than per-recipient message storage. Each group has a logical message log — an ordered append-only sequence of messages identified by monotonically increasing sequence numbers. When a member sends a message, it is appended to the group's log with a sequence number assigned by a Group Sequencer Service. Members read from this shared log, each maintaining their own read pointer (last_read_seq).

The delivery path has two modes: push and pull. For online members, the system pushes new messages in real time via WebSocket connections. A Group Fan-out Service subscribes to new message events (via Kafka, partitioned by group_id) and publishes to each online member's WebSocket Gateway. For offline members, no fan-out occurs — when they reconnect, they pull missed messages by querying the shared log from their last_read_seq forward. This push-for-online, pull-for-offline hybrid dramatically reduces write amplification.

The Group Membership Service manages group state: member list, roles, join/leave events. It is backed by a PostgreSQL cluster sharded by group_id. When membership changes, the service emits events that update the Fan-out Service's subscription map and invalidate cached member lists. The service enforces permission checks — only admins can add members, non-members cannot send messages.

Core Components

Group Sequencer

The Group Sequencer assigns globally ordered sequence numbers to messages within a group. Each group has a logical sequencer that ensures total ordering. For small groups, a Redis INCR operation on seq:{group_id} suffices. For high-throughput groups, a dedicated sequencer process (partitioned by group_id hash) assigns sequences from a pre-allocated range, avoiding contention. The sequencer guarantees that if message A is sent before message B, A gets a lower sequence number, ensuring causal ordering.

Fan-out Service

The Fan-out Service bridges the shared message log and individual user connections. It consumes from the Kafka group-messages topic, looks up which group members are currently online (via the Session Registry in Redis), and publishes the message to each online member's Gateway node via Redis Pub/Sub. For groups larger than 10K members, the service uses a tiered approach: it publishes to a group-level pub/sub channel, and only Gateway nodes with connected group members subscribe to that channel, reducing fan-out to O(gateway_nodes) instead of O(members).

Membership Service

The Membership Service is the source of truth for group composition. It stores member lists in PostgreSQL: (group_id, user_id, role, joined_at, invited_by). Adding a member to a large group triggers: (1) insert into the members table, (2) emit a Kafka event for the Fan-out Service to update subscriptions, (3) send the new member the last 50 messages from the shared log as initial history. Removing a member revokes access immediately by updating the membership table and disconnecting any active subscription. A Redis cache of group:{group_id} → Set<user_id> accelerates membership lookups for the fan-out path.

Database Design

The shared message log is stored in Cassandra with partition key (group_id, bucket) where bucket is a time-based segment (e.g., one bucket per day for active groups, one per week for less active ones). Clustering key is sequence_number (ASC). Columns: sender_id, content, message_type, media_url, reply_to_seq, mentions (list of user_ids), created_at. This schema enables efficient range reads: SELECT * FROM group_messages WHERE group_id = ? AND bucket = ? AND sequence_number > ? LIMIT 50, which is exactly the query pattern for pulling missed messages.*

Group metadata lives in PostgreSQL: groups table (group_id, name, description, avatar_url, created_by, created_at, max_members, settings JSON) and group_members table (group_id, user_id, role ENUM(admin, member), joined_at). A composite index on (user_id, group_id) enables listing all groups a user belongs to. Read pointers are stored in Redis as a hash: read_pointers:{group_id} → {user_id: last_read_seq}, persisted to Cassandra asynchronously every 60 seconds.

API Design

POST /api/groups/{group_id}/messages — Send message: {content, message_type, media_url?, reply_to_seq?, mentions?}; returns {sequence_number, timestamp}
GET /api/groups/{group_id}/messages?after_seq={n}&limit=50 — Pull messages after a sequence number for catch-up
POST /api/groups — Create group: {name, member_ids[], settings?}; returns group_id
PUT /api/groups/{group_id}/members/{user_id} — Add member (admin only); triggers initial history sync

Scaling & Bottlenecks

The biggest challenge is fan-out for large groups. A 100K-member group with 1000 messages per minute requires 100 million fan-out events per minute from a single group. The tiered pub/sub approach (publishing to group-level channels instead of per-user channels) reduces this to the number of distinct Gateway nodes serving group members — typically 50-100 nodes. Each Gateway node then locally distributes to its connected group members, keeping the centralized fan-out manageable.

The Group Sequencer can become a bottleneck for very active groups. A single Redis INCR handles ~100K operations/sec, sufficient for most groups. For extremely active groups (chat rooms with 10K+ messages/minute), the sequencer can batch sequence number allocation — pre-allocating ranges of 100 sequence numbers at a time to reduce Redis round-trips. Message storage in Cassandra scales with the time-bucketed partition model; the risk is hot partitions for very active groups in a short time window, mitigated by using smaller bucket granularity (hourly instead of daily) for high-traffic groups.

Key Trade-offs

Shared log vs. per-recipient storage: The shared log reduces write amplification from O(N) to O(1) per message, enabling 100K-member groups, but requires each client to maintain and sync its own read pointer
Push for online, pull for offline: This hybrid avoids the cost of storing and delivering queued messages for offline users, but means offline users experience a burst of messages on reconnect that must be fetched synchronously
Tiered pub/sub for large groups: Publishing to group-level channels instead of per-user channels reduces centralized fan-out, but introduces a level of indirection that adds ~5ms latency and requires Gateway nodes to maintain group subscription state
Redis for read pointers over persistent storage: Redis provides sub-millisecond reads for checking a user's last read position, but read pointers could be lost on Redis failure — periodic Cassandra flush provides durability at the cost of potential re-delivery of a few messages