System Design: Live Streaming Platform

Requirements

Functional Requirements:

Broadcasters go live using RTMP/SRT from desktop (OBS) or mobile apps
Viewers watch live streams with quality selection and adaptive bitrate
Real-time interaction via chat, reactions, and polls
Stream recording for on-demand replay (DVR functionality)
Multi-host co-streaming with audio/video mixing
Stream scheduling, notifications to followers, and stream metadata (title, category)

Non-Functional Requirements:

Support 500K concurrent streams and 50 million concurrent viewers
Glass-to-glass latency under 5 seconds for standard mode, under 1.5 seconds for ultra-low-latency
99.95% stream uptime; automatic ingest failover within 3 seconds
Horizontal scaling: adding capacity should be linear with demand
Support streams up to 24 hours continuously

Scale Estimation

500K concurrent streams at average 5 Mbps ingest = 2.5 Tbps ingest bandwidth. Each stream transcoded to 4 quality variants = 10 Tbps transcode output. 50M concurrent viewers at average 3 Mbps = 150 Tbps CDN egress. Chat: assume 10% of viewers send 1 message/min → 5M messages/min = 83K messages/sec. DVR storage: 500K streams × 5 Mbps × 3600 sec/hr = 337TB/hour of recorded content. At 8-hour average stream duration, 2.7PB/day of DVR recordings.

High-Level Architecture

The platform architecture consists of four planes: Ingest, Transcode, Delivery, and Interaction. The Ingest Plane deploys edge ingest servers at 50+ PoPs. Each ingest server runs an RTMP/SRT listener that authenticates the broadcaster's stream key, validates the incoming bitrate/resolution, and establishes a persistent connection. The raw stream is forwarded to the nearest Transcode Cluster via an internal low-latency transport protocol (typically RIST or SRT over a dedicated backbone).

The Transcode Plane runs GPU-accelerated encoders (NVENC or Intel QSV) that produce HLS output in real-time. Each transcoder takes the raw stream and outputs 4 renditions (e.g., 1080p60, 720p30, 480p30, 360p30) as 2-second HLS segments with CMAF packaging. Segments are immediately pushed to the Delivery Plane — a CDN with edge PoPs that serve HLS manifests and segments to viewers. The manifest is a live-updating m3u8 playlist that the viewer's player polls every segment duration.

The Interaction Plane handles chat, reactions, polls, and viewer count. A WebSocket Gateway maintains persistent connections with all viewers. Messages flow through a Chat Service that applies moderation, then publishes to a partitioned message bus (Kafka or NATS) for fan-out to all connected viewers of that stream. A separate Analytics Pipeline processes viewership events in real-time for the streamer's dashboard.

Core Components

Ingest Server Cluster

Ingest servers are the first point of contact for broadcasters. Each server handles up to 1,000 concurrent RTMP connections using a multi-threaded async I/O architecture (built on libuv or Tokio). On connection, the server validates the stream key against the Auth Service (cached locally with 5-minute TTL), performs codec negotiation (H.264 required, H.265 optional), and begins forwarding packets. If the ingest server detects packet loss exceeding 1% or the broadcaster disconnects, an automatic failover redirects the stream to a backup ingest server. The broadcaster's encoder (OBS) is configured with primary and backup ingest URLs for client-side failover.

Real-Time Transcoder

Transcoders run as Kubernetes pods with GPU resource requests. Each pod runs a custom transcoding daemon wrapping FFmpeg with hardware acceleration. The daemon receives raw H.264/H.265 NAL units over SRT, decodes to raw frames, and re-encodes to multiple bitrates simultaneously. The output is CMAF (Common Media Application Format) segments — fragmented MP4 with byte-range addressing — enabling low-latency HLS (LL-HLS) with partial segment delivery. The transcoder publishes completed segments to an internal segment store (Redis for metadata, S3 for segment data) and notifies the Delivery Plane.

Interactive Features Engine

Beyond basic chat, the Interaction Plane supports polls, predictions, and reactions. Polls are created by the streamer via API, stored in DynamoDB, and vote tallies are maintained in Redis counters. Reactions (emoji overlays) use a sampling approach for large audiences: instead of sending every reaction to every viewer, the system samples reactions and sends aggregated counts (e.g., "500 heart reactions in the last second") to viewers, rendering a proportional animation client-side. This reduces message fan-out by 100x for popular streams.

Database Design

Stream metadata (stream_id, broadcaster_id, title, category, started_at, status, ingest_server, thumbnail_url) is stored in PostgreSQL with a partial index on status='live' for efficient discovery queries. Viewer sessions (session_id, stream_id, user_id, joined_at, quality, device_type) are stored in Cassandra for high write throughput — each viewer join/leave generates a write. Chat messages for moderation audit are stored in Cassandra partitioned by stream_id and bucketed by 10-minute intervals.

Follower/subscriber relationships use a separate PostgreSQL table (broadcaster_id, follower_id, subscribed_at, tier). Notification delivery for go-live events reads from this table and fans out via a push notification service (Firebase Cloud Messaging for mobile, WebSocket for web). Revenue data (subscriptions, donations, ad impressions) uses a transactional PostgreSQL cluster with synchronous replication.

API Design

POST /api/v1/streams — Create a stream; body contains title, category; returns stream_key and ingest_url (rtmp://ingest.example.com/live/{stream_key})
GET /api/v1/streams/{stream_id}/manifest.m3u8 — Fetch the live HLS master playlist; viewer's player uses this to start playback
POST /api/v1/streams/{stream_id}/chat — Send a chat message; body contains message text; processed through moderation pipeline
GET /api/v1/streams/discover?category={cat}&sort=viewers&limit=20 — Browse live streams by category, sorted by viewer count

Scaling & Bottlenecks

The transcode tier is the primary bottleneck — each concurrent stream requires dedicated GPU compute. At 500K concurrent streams, this requires ~125K GPU instances (assuming 4 streams per GPU with NVENC). Cost optimization strategies include: tiered transcoding (popular streams get all 4 renditions; low-viewer streams get only 2 — source passthrough + one lower quality); shared transcoding for streams with identical input settings (e.g., all 1080p30 H.264 streams share encoder parameters); and spot/preemptible GPU instances for non-critical renditions with automatic fallback to source passthrough.

The CDN egress at 150 Tbps is the dominant cost. Unlike VOD CDN where content can be pre-positioned, live content is inherently cache-unfriendly — each segment is accessed only once during its 2-second window. The solution is aggressive multicast-like behavior at the edge: all viewers of the same stream in the same PoP receive the same cached segment. With 50M viewers across 500K streams, each stream averages 100 viewers, many from the same PoP — this achieves effective cache hit rates of 80%+ even for live content.

Key Trade-offs

HLS (LL-HLS) vs WebRTC for delivery: LL-HLS at 1.5-second latency is close to WebRTC but scales via CDN caching; WebRTC provides sub-second but requires per-viewer server-side connections — LL-HLS wins for audiences above 1,000
GPU transcoding vs source passthrough: Transcoding enables ABR for all viewers but adds latency and cost; passthrough is zero-latency but forces all viewers to the broadcaster's bitrate — tier the decision by stream popularity
Reaction sampling vs full fan-out: Sending every reaction to every viewer is O(viewers × reactions/sec) which explodes for popular streams; sampling maintains the visual effect at 1% of the network cost
DVR recording all streams vs opt-in: Recording all streams ensures no content is lost but requires 2.7PB/day storage; opt-in reduces this by 90% but means accidental non-recordings — default-on with 7-day auto-delete is the middle ground