System Design: Online Presence Indicator

Requirements

Functional Requirements:

Show real-time online/offline/away status for each user
Notify a user's contacts when their status changes
Support custom status messages (e.g., 'In a meeting')
'Last seen' timestamp for offline users
Status should reflect the most recent device activity for multi-device users
Away status after N minutes of inactivity, configurable per user

Non-Functional Requirements:

Support 50 million concurrent online users
Status change propagation to contacts within 10 seconds
Minimal false transitions (avoid flickering between online/offline)
Low bandwidth overhead: presence should consume <1% of total system bandwidth
Graceful degradation: presence inaccuracy is acceptable under high load

Scale Estimation

50 million concurrent users with a heartbeat interval of 30 seconds produce 1.67 million heartbeats per second. Each heartbeat is ~64 bytes (user_id + timestamp + device_id), yielding ~107MB/sec of heartbeat traffic. Status change events are less frequent: assuming 10% of users change status per minute (5M changes/min = 83K changes/sec). Each status change must be fanned out to the user's contacts — with an average of 100 contacts per user, that is 8.3 million fan-out events per second. Storage for presence state: 50M users × 64 bytes = ~3.2GB, easily fitting in a single Redis cluster.

High-Level Architecture

Presence detection uses a heartbeat-based approach. Each connected client sends a heartbeat to the Presence Service every 30 seconds. The Presence Service maintains a Redis key per user: presence:{user_id} → {status, last_heartbeat, device_id, custom_status} with a TTL of 90 seconds (3 missed heartbeats = offline). When a heartbeat arrives, the key is updated and the TTL is reset. When the TTL expires (no heartbeat for 90 seconds), Redis keyspace notifications trigger an offline event.

The fan-out challenge — notifying contacts of status changes — is handled by a Presence Fan-out Service. When a user's status changes (online → offline, offline → online, or custom status update), the fan-out service looks up the user's contact list from the Social Graph Service, filters for contacts who are currently online (no point pushing to offline users), and publishes the status change to each online contact's WebSocket Gateway. This filtering step reduces fan-out by ~70% since typically only 30% of contacts are online at any time.

To avoid the thundering herd problem (e.g., a server restart causing thousands of users to go offline simultaneously, triggering millions of fan-out events), the system uses debounced transitions: a user must be offline for 30 seconds before an offline notification is sent, and must be online for 5 seconds before an online notification is sent. This absorbs transient disconnections (network blips, server restarts) without producing false status changes.

Core Components

Heartbeat Processor

The Heartbeat Processor receives heartbeats from WebSocket Gateways (batched: each Gateway sends a single RPC containing heartbeats for all its connected users every 5 seconds). The processor updates Redis keys in a pipeline (MULTI/EXEC batch of SET + EXPIRE commands). For multi-device users, the processor merges device states: if any device is active, the user is online; if all devices are idle, the user is away; if all devices have timed out, the user is offline. The processor maintains an in-memory bloom filter of recently-changed users to detect status transitions without reading from Redis on every heartbeat.

Presence Fan-out Service

The Fan-out Service subscribes to status change events (emitted by the Heartbeat Processor to a Kafka topic presence-changes). For each event, it: (1) fetches the user's contact list from a cached Social Graph (Redis set: contacts:{user_id}), (2) filters for online contacts by batch-checking presence:{contact_id} keys in Redis (using MGET), (3) publishes the status update to each online contact's WebSocket Gateway via the routing layer. For users with very large contact lists (>5000), the fan-out is rate-limited and spread over 10 seconds to avoid Gateway overload.

Presence Query Service

The Query Service handles on-demand presence lookups, such as when a user opens a contact list or a chat window. It batch-reads presence keys from Redis: MGET presence:{id1} presence:{id2} ... and returns the current status for each contact. For the initial load of a contact list (which may have hundreds of entries), the service reads all presence keys in a single Redis pipeline call, achieving sub-5ms latency for 200 contacts. The results include: status (online/away/offline), last_seen timestamp, and custom status message.

Database Design

Presence state is ephemeral and stored entirely in Redis. The primary data structure is a Redis hash per user: presence:{user_id} with fields status (online/away/offline), last_heartbeat (Unix timestamp), device_states (JSON map of device_id → last_active), custom_status (string), with a TTL of 90 seconds that is reset on every heartbeat. Redis Cluster with 10 shards (consistent hashing on user_id) handles the 50M concurrent user load, with each shard holding ~5M keys consuming ~320MB.

The last_seen timestamp is the only piece of presence data that requires persistent storage. When a user goes offline, the Heartbeat Processor writes last_seen:{user_id} → timestamp to a Cassandra table (partition key user_id, column last_seen_at). This is a simple key-value write, and since it only occurs on status transitions (not on every heartbeat), the write rate is manageable: 83K writes/sec for status changes. A Redis cache with a 5-minute TTL serves frequent last_seen lookups without hitting Cassandra.

API Design

WebSocket HEARTBEAT {device_id, activity_type: active|idle} — Sent by client every 30 seconds; processed in batches by the Gateway
GET /api/v1/presence?user_ids=id1,id2,id3 — Batch query presence for multiple users; returns [{user_id, status, last_seen, custom_status}]
PUT /api/v1/presence/status — Set custom status: {text: 'In a meeting', emoji: '📅', expires_at?}
WebSocket EVENT {type: 'presence_change', user_id, status, custom_status} — Server-pushed presence update to subscribed contacts

Scaling & Bottlenecks

The primary bottleneck is the fan-out for status changes. At 83K status changes per second with 100 contacts each (filtered to 30 online), the fan-out produces 2.5 million events per second to Gateway nodes. The solution is aggregation at the Gateway level: instead of sending individual presence events, the Gateway node receives a batch of presence updates every second and pushes them to connected clients in a single WebSocket frame. This reduces the number of system calls and network packets by 10-100x.

Redis is the single point of contention for all presence reads and writes. With 1.67M heartbeat updates/sec and 2.5M presence reads/sec (fan-out filtering), the total Redis load is ~4M operations/sec. A 10-shard Redis Cluster handles this comfortably (each shard handles ~400K ops/sec, well within Redis's 500K ops/sec capacity). For additional headroom, the Heartbeat Processor uses a local in-memory cache of presence state with a 5-second staleness window, reducing Redis reads by 80% — most heartbeats don't change status, so the cached state is sufficient.

Key Trade-offs

Heartbeat-based over connection-based detection: Heartbeats detect true user activity (not just open connections), enabling meaningful away/idle states, but consume bandwidth and processing — the 30-second interval balances accuracy vs. overhead
Debounced transitions over immediate transitions: Waiting 30 seconds before broadcasting offline status eliminates false transitions from network blips, but means contacts see a stale 'online' status for up to 30 seconds after the user actually disconnects
Fan-out only to online contacts: Filtering fan-out to only online contacts reduces events by 70%, but means a user who comes online sees stale presence data until their contacts' next status change — mitigated by the on-demand Presence Query on app launch
Pub/sub fan-out over polling: Server-pushed presence updates provide real-time status changes, but require maintaining subscription state per user and handling fan-out storms for popular users — polling would be simpler but add 15-30 second latency