SYSTEM_DESIGN
System Design: Twitch Live Streaming
System design of Twitch covering real-time live video ingestion via RTMP, HLS transcoding, chat at scale, and sub-second latency delivery to millions of concurrent viewers.
Requirements
Functional Requirements:
- Streamers broadcast live video via RTMP/SRT from OBS or mobile apps
- Viewers watch live streams with sub-5-second latency (sub-second with low-latency mode)
- Real-time chat alongside each stream with emotes and moderation
- Channel subscriptions, bits (virtual currency), and donations
- Clip creation and VOD (Video on Demand) recording of live streams
- Raid and host features to redirect viewers to another stream
Non-Functional Requirements:
- 30 million DAU, 7 million unique streamers per month, 2.5 million peak concurrent viewers
- Stream start-to-glass latency under 3 seconds (low-latency mode)
- Chat message delivery under 500ms to all viewers in a channel
- 99.95% availability; stream interruptions directly impact creator revenue
- Handle chat rooms with 500K+ concurrent participants
Scale Estimation
7M unique streamers per month, with ~100K concurrent streams at peak. Average stream bitrate of 6 Mbps ingested → 600 Gbps ingest bandwidth. Each stream is transcoded to 4 quality levels → 2.4 Tbps transcode output. 2.5M concurrent viewers at average 4 Mbps = 10 Tbps CDN egress. Chat: top streams have 500K+ viewers each sending 1 message per minute on average → 8,300 messages/sec per large channel, with global chat volume exceeding 1M messages/sec. VOD storage: 100K concurrent streams × 6 hours average × 3 Mbps archived = 810TB/day.
High-Level Architecture
Twitch's architecture has three real-time pipelines running in parallel. The Video Pipeline handles ingestion and delivery: streamers connect via RTMP to the nearest Ingest Server (deployed at edge PoPs globally). The Ingest Server authenticates the stream key, validates the incoming codec, and forwards the raw video to a Transcoding Cluster. Transcoders run FFmpeg to produce 4 quality variants (160p, 480p, 720p, source quality) packaged as HLS segments (2-second chunks for standard latency, 1-second for low-latency). Segments are pushed to a CDN origin (Twitch uses a combination of its own edge network and Fastly/CloudFront) and distributed to edge nodes.
The Chat Pipeline uses a custom WebSocket-based protocol. Viewers connect to a Chat Edge Server which maintains persistent connections. When a user sends a message, it goes to a Chat Service that performs rate limiting, spam filtering (ML-based classifier), and moderation rule checks, then publishes to a Redis Pub/Sub cluster partitioned by channel_id. All Chat Edge Servers subscribed to that channel receive the message and fan it out to connected viewers. For channels with >100K viewers, a hierarchical fan-out tree is used to avoid overwhelming a single Redis node.
The Metadata Pipeline handles stream status, viewer counts, and channel data. A Stream Registry (backed by DynamoDB) tracks all active streams with their ingest server, transcode status, and viewer count. Viewer counts are maintained using a distributed counter service (HyperLogLog-based for approximate unique viewer counts, simple counters for concurrent viewers).
Core Components
Ingest & Transcoding
Ingest servers accept RTMP connections and perform a protocol handshake to authenticate the stream key against the Auth Service. The raw RTMP stream is demuxed and forwarded over a reliable internal protocol to the nearest transcoding cluster. Transcoders run FFmpeg with hardware acceleration (NVENC on NVIDIA GPUs) for real-time encoding. Each stream is split into HLS segments: the encoder produces 2-second CMAF segments with low-latency HLS (LL-HLS) extensions for sub-3-second latency. Segments are written to a shared NFS tier and immediately registered with the CDN origin for distribution.
Chat System
Twitch chat handles over 1 million messages per second globally. The architecture uses a layered fan-out model. The Chat Service (stateless Go microservices) receives messages, applies business logic (slow mode, subscriber-only mode, banned word filters), and publishes to a channel-partitioned Kafka topic. Chat Edge Servers consume from Kafka and push messages to connected WebSocket clients. For mega-channels (500K+ viewers), a fan-out tree is used: a root Chat Edge fans out to ~100 regional Chat Edge nodes, each serving ~5,000 viewers. This keeps fan-out bounded at O(sqrt(N)) rather than O(N).
Clip & VOD Service
When a viewer creates a clip, the Clip Service reads the last 60 seconds of HLS segments from the CDN edge cache, concatenates them, and stores the clip as a standalone MP4 in S3. VOD recording runs continuously for partner/affiliate streamers: a VOD Worker consumes HLS segments from the transcoding output and appends them to a growing MP4 file in S3. After the stream ends, the VOD is finalized with proper metadata and made available in the channel's video archive.
Database Design
Channel and user data is stored in PostgreSQL (RDS) sharded by user_id. The Streams table tracks active streams: stream_id, channel_id, ingest_server, started_at, title, game_id, viewer_count, status. Stream events (go-live, offline, title change) are published to a Kafka topic consumed by the notification service and analytics. Subscription and transaction data (bits, subs) use a separate PostgreSQL cluster with strict ACID guarantees — these are revenue-critical.
Chat messages are ephemeral by default — they are not persisted to a database during live delivery. For channels that opt into chat logs, messages are asynchronously written to a Cassandra cluster partitioned by channel_id and bucketed by hour. VOD metadata and clip data live in DynamoDB with stream_id as the partition key. The recommendation system (suggesting streams to watch) uses a feature store backed by Redis for real-time features (current viewer count, stream duration) and S3/Athena for historical features.
API Design
POST /api/v1/streams/ingest— Initiate a stream; body contains stream_key, codec, bitrate; returns ingest_endpoint URLGET /api/v1/streams/{channel_name}/playlist.m3u8— Fetch the HLS master playlist for a live streamWS /api/v1/chat/{channel_id}— WebSocket connection for real-time chat; supports PRIVMSG, JOIN, PART commands (IRC-inspired protocol)POST /api/v1/clips— Create a clip from the last 60 seconds; body contains channel_id, title; returns clip_id and URL
Scaling & Bottlenecks
The transcoding fleet is the primary compute bottleneck. Each concurrent stream requires a dedicated GPU transcode slot. At 100K concurrent streams × 4 quality variants = 400K concurrent encodes. Twitch uses a mix of on-premise GPU servers and cloud GPU instances (EC2 G4/G5) with auto-scaling. During peak hours (major esports events), reserved capacity handles baseline load while on-demand instances handle burst. A priority queue ensures partner streamers always get transcode capacity before non-partner streamers.
Chat fan-out for mega-channels is the second major bottleneck. The hierarchical fan-out tree solves the O(N) problem, but maintaining WebSocket connections for millions of concurrent users requires a large fleet of Chat Edge servers. Each server handles ~50K concurrent WebSocket connections using epoll-based event loops (Go + custom networking). Connection draining during deployments uses a gradual reconnect protocol where clients are told to reconnect to a new server over a 5-minute window to avoid thundering herd.
Key Trade-offs
- HLS over WebRTC for viewer delivery: HLS with LL-HLS extensions provides 2-3 second latency vs WebRTC's sub-second, but HLS scales to millions of viewers via CDN caching while WebRTC requires per-viewer connections
- Ephemeral chat vs persistent storage: Not persisting chat messages by default saves enormous storage and write throughput; the trade-off is loss of chat history unless explicitly opted in
- GPU transcoding vs CPU: GPU encoding (NVENC) is 10x faster than CPU (x264) but produces slightly lower quality per bitrate — acceptable for live content where latency is prioritized over quality
- 2-second HLS segments vs smaller: Smaller segments reduce latency but increase CDN request rate and manifest size; 2 seconds is the sweet spot for standard latency, 1 second for low-latency mode with CMAF chunked transfer
GO DEEPER
Master this topic in our 12-week cohort
Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.