System Design: Message Reactions

Requirements

Functional Requirements:

Users can add/remove emoji reactions to any message in a conversation
Display reaction counts grouped by emoji on each message
Show who reacted with each emoji (on tap/hover)
Support both standard Unicode emoji and custom workspace emoji
Users can react with multiple different emoji on the same message
Real-time reaction updates visible to all conversation participants

Non-Functional Requirements:

Handle 500 million reactions per day
Reaction count updates propagate to all viewers within 1 second
Reaction add/remove operations complete in under 100ms
Reaction data should add no more than 20% overhead to message storage
Handle reaction storms: 10K+ reactions per second on a single viral message

Scale Estimation

With 500 million reactions per day, the system processes approximately 5,800 reactions per second sustained, with peaks of 50K/sec during viral moments. Each reaction record is compact: ~32 bytes (message_id + user_id + emoji_code + timestamp). Total storage: ~16GB/day for reaction records. However, the fan-out is the real cost: each reaction must be pushed to all participants in the conversation. For an average conversation of 5 participants, that is 2.5 billion fan-out events per day (~29K/sec). For a popular message in a 10K-member channel, a single reaction storm of 10K reactions requires 100 million fan-out events.

High-Level Architecture

The reactions system is a thin layer on top of the messaging infrastructure, comprising three components: a Reaction Service (write path), a Reaction Store (persistence), and a Fan-out Pipeline (real-time updates). When a user taps a reaction emoji, the client sends a toggle request to the Reaction Service. The service checks if the user already has this reaction on this message (idempotent add) or if they are removing it (toggle off). It updates the Reaction Store and emits an event to the fan-out pipeline.

The fan-out pipeline pushes reaction updates to all online conversation participants via their WebSocket connections. To avoid overwhelming clients during reaction storms, the pipeline implements batched updates: instead of pushing every individual reaction event, it aggregates reactions on the same message within a 500ms window and pushes a single update containing the current reaction counts. The client receives {message_id, reactions: {'👍': 142, '❤️': 89, '😂': 45}} rather than 276 individual events.

Custom emoji support requires a separate Emoji Registry that maps custom emoji codes to image URLs. When a client encounters a custom emoji code in a reaction, it fetches the image from a CDN-backed Emoji Service. Custom emoji are scoped to a workspace/server, so the registry is partitioned by workspace_id. Popular custom emoji images are aggressively cached at the CDN edge with immutable cache headers.

Core Components

Reaction Service

The Reaction Service handles add/remove operations. On an add request: (1) check if the user already has this reaction on this message (read from Redis set reactions:{message_id}:{emoji} → check membership of user_id); (2) if not, add to the set and increment the counter in reaction_counts:{message_id} hash; (3) write to Cassandra for persistence; (4) emit a Kafka event. On a remove request: reverse the operations. Both operations are idempotent — adding an existing reaction or removing a nonexistent one is a no-op. Rate limiting is applied per-user: max 10 reaction operations per second to prevent abuse.

Reaction Aggregator

The Aggregator processes the Kafka stream of reaction events and maintains materialized views of reaction counts. For each message, it maintains a Redis hash: reaction_counts:{message_id} → {emoji1: count, emoji2: count}. This hash serves the read path — when a client loads a message, the reaction counts are fetched from this hash in a single HGETALL command. The aggregator also handles the reactor list query: when a user taps on a reaction to see who reacted, the service reads from the Redis set reactions:{message_id}:{emoji} and returns the list of user_ids (paginated for popular reactions).

Fan-out Batcher

The Fan-out Batcher receives reaction events from Kafka and groups them by message_id within a 500ms tumbling window. At the end of each window, it emits a single aggregated update per message to the conversation's participants. The batcher looks up conversation membership from the cached membership service and publishes to each online participant's WebSocket Gateway. During reaction storms (detected when a single message accumulates >100 events in a window), the batcher switches to count-only mode: it pushes only the updated counts, not the individual reactor identities, reducing payload size by 95%.

Database Design

Reaction records are stored in Cassandra with partition key message_id, clustering key (emoji, user_id). Columns: created_at. This schema enables two query patterns: (1) get all reactions for a message: SELECT * FROM reactions WHERE message_id = ?, and (2) get all users who reacted with a specific emoji: SELECT user_id FROM reactions WHERE message_id = ? AND emoji = ?. Deletions are physical deletes (not tombstones) to avoid read amplification on heavily-reacted messages.*

The Redis layer maintains three data structures per reacted message: (1) a hash reaction_counts:{message_id} for O(1) count reads, (2) a set reactions:{message_id}:{emoji} for membership checks and reactor lists, and (3) a user-scoped set user_reactions:{user_id}:{conversation_id} tracking which messages the user has reacted to (for rendering the 'you reacted' indicator on the client). Redis TTLs match message retention: if messages expire after 90 days, reaction keys expire at the same time.

API Design

PUT /api/messages/{message_id}/reactions/{emoji} — Toggle reaction: adds if not present, removes if already reacted; idempotent; returns updated counts
GET /api/messages/{message_id}/reactions — Fetch all reaction counts: returns {reactions: [{'emoji': '👍', count: 142}, {'emoji': '❤️', count: 89}]}
GET /api/messages/{message_id}/reactions/{emoji}/users?limit=20&cursor={id} — Fetch paginated list of users who reacted with a specific emoji
WebSocket EVENT {type: 'reaction_update', message_id, reactions: {'👍': 143, '❤️': 89}} — Batched real-time reaction count update

Scaling & Bottlenecks

Reaction storms on viral messages create a hot key problem in Redis. When 10K users react to the same message per second, the Redis key reaction_counts:{message_id} receives 10K increments/sec — within a single Redis node's capacity but approaching limits. For extreme cases (100K reactions/sec on a single message), a local counter aggregation approach is used: each Reaction Service instance maintains a local counter and flushes to Redis every 100ms, reducing Redis writes from 100K/sec to ~100/sec (one per service instance per flush). The counters converge within 200ms.

The reactor list (who reacted) for popular messages can grow to millions of entries. Storing all reactor user_ids in a single Redis set is impractical for viral content. The solution is a two-tier approach: Redis stores the first 1,000 reactors per emoji (for quick lookups), and the full list is served from Cassandra with cursor-based pagination. The UI typically shows '142 people reacted' with the first few avatars — the Redis set covers this use case without hitting Cassandra.

Key Trade-offs

Batched fan-out over individual events: 500ms batching reduces fan-out events by 10-100x during reaction storms, but means reactions appear to update in bursts rather than individually — acceptable for a non-critical feature
Redis + Cassandra dual storage: Redis provides sub-millisecond reads for counts and membership checks, Cassandra provides durability; the trade-off is dual-write complexity and potential brief inconsistency
Toggle API over separate add/remove endpoints: A single toggle endpoint simplifies the client (one tap = one API call regardless of current state), but requires the server to read current state before writing — the Redis set membership check adds ~1ms
Count-only mode during storms: Switching to count-only pushes during high-volume reactions reduces bandwidth by 95%, but means clients don't see real-time reactor identities during viral moments — they can fetch on demand