SYSTEM_DESIGN
System Design: YouTube
Complete system design of YouTube covering video upload, transcoding, recommendation engine, and CDN delivery architecture serving 2.5 billion monthly users and 800 million videos.
Requirements
Functional Requirements:
- Users can upload videos up to 12 hours long in various formats
- Videos are transcoded to multiple resolutions (144p to 4K) and codecs (H.264, VP9, AV1)
- Users watch videos with adaptive bitrate streaming (DASH/HLS)
- Personalized home feed and video recommendations via sidebar
- Full-text and semantic search over video titles, descriptions, and auto-generated captions
- Subscribers receive notifications when followed channels upload
Non-Functional Requirements:
- 2.5 billion MAU, 800M DAU; 500 hours of video uploaded per minute
- Video playback start latency under 2 seconds globally (p95)
- 99.99% availability; uploaded videos must never be lost
- Eventual consistency for view counts and recommendations; strong consistency for monetization data
- CDN must serve 1 billion hours of watch time per day
Scale Estimation
800M DAU watching an average of 75 minutes/day = 1 billion hours of video streamed daily. At an average bitrate of 5 Mbps, that is roughly 2.25 exabytes of egress per day — 208 Tbps sustained. Upload volume: 500 hours/min = 720K videos/day. Average raw file size 500MB → 360TB raw uploads/day. After transcoding to ~20 renditions (resolutions × codecs), multiply by 8x = 2.88PB/day of processed output. The total video catalog exceeds 800 million videos, requiring hundreds of exabytes of storage across Google's Colossus distributed filesystem.
High-Level Architecture
YouTube's architecture splits into three major data planes. The Ingestion Plane handles uploads: the client negotiates a resumable upload session with the Upload Service, which streams chunks to Google Cloud Storage (GCS). Once the upload completes, a Pub/Sub event triggers the Transcoding Pipeline — a massive fleet of machines running a custom transcoding engine (based on libvpx/libaom for VP9/AV1 and FFmpeg internals for H.264). Each video is split into segments and transcoded in parallel; a DAG scheduler orchestrates the dependency graph (e.g., thumbnail extraction depends on first-pass encode). Processed segments and manifests are written to Colossus and registered in Spanner (Google's globally distributed SQL database) as the source of truth for video metadata.
The Serving Plane handles playback. When a user clicks a video, the Video Serving Service returns a DASH/HLS manifest with segment URLs pointing to Google's global CDN (Google Global Cache). The CDN has three tiers: Edge nodes in ISP data centers (Google Global Cache appliances), regional PoPs, and origin servers backed by Colossus. Adaptive bitrate logic on the client selects the appropriate rendition based on measured bandwidth. The Recommendation Plane runs a deep learning pipeline: candidate generation (retrieve ~1,000 videos from a corpus of 800M using embedding similarity via ANN), then ranking (a multi-objective deep neural network predicting watch time, click-through rate, and satisfaction).
Core Components
Transcoding Pipeline
YouTube's transcoding system processes 500 hours of video per minute. Each video is split into 4-second chunks and distributed across a fleet of transcoding workers managed by Borg (Google's cluster manager). A two-pass encoding strategy is used: the first pass analyzes scene complexity to allocate bits optimally (content-aware encoding); the second pass produces the final output. For each video, roughly 20 renditions are generated: 7 resolutions (144p to 4K) × 3 codecs (H.264 for legacy, VP9 for modern browsers, AV1 for bandwidth savings). AV1 encoding is 10x slower than H.264, so it is applied only to popular videos post-upload via a priority queue.
Video Serving & CDN
Google Global Cache (GGC) appliances are deployed inside ISP networks worldwide. When a user requests a video segment, DNS resolution directs them to the nearest GGC node. Cache hit rates exceed 90% for popular content. Cache misses cascade to regional PoPs and then to origin (Colossus). Segment URLs are signed with time-limited tokens to prevent hotlinking. The manifest includes multiple CDN endpoints; the client implements failover — if a segment request times out after 2 seconds, it retries against an alternate edge.
Recommendation Engine
The recommendation system uses a two-stage architecture. Stage 1 (Candidate Generation): a deep neural network with video and user embeddings retrieves ~1,000 candidates from the corpus using approximate nearest neighbor search (ScaNN). Stage 2 (Ranking): a wide-and-deep model scores each candidate on predicted watch time, engagement probability, and user satisfaction (measured via surveys and long-term retention). Features include user watch history, search history, demographic info, video metadata, and freshness signals. The model is retrained daily on petabytes of engagement data using TensorFlow on TPU pods.
Database Design
Video metadata is stored in Google Spanner, providing globally consistent reads and writes. The Videos table contains video_id (UUID), channel_id, title, description, upload_time, duration, status (processing/live/removed), and a repeated field for rendition URLs. A separate VideoStats table (sharded by video_id) stores view_count, like_count, and comment_count — these use a counter pattern with periodic aggregation from a Flume pipeline. Channel data and subscription relationships live in a Bigtable instance optimized for wide-column lookups (partition key: channel_id, column families: metadata, subscribers, uploads).
Comments use a separate Spanner table with video_id as the partition key and comment_id (Snowflake-style) as the sort key, enabling efficient range scans for paginated comment threads. A threaded comment has a parent_comment_id field for hierarchical replies. The search index is powered by a custom inverted index system (Google's internal search infrastructure) indexing titles, descriptions, auto-generated captions (from speech-to-text), and visual content labels (from Cloud Video Intelligence API).
API Design
POST /upload/resumable— Initiate a resumable upload session; returns upload_uri for chunked PUT requestsGET /api/v3/videos?id={video_id}&part=snippet,contentDetails,statistics— Fetch video metadata, duration, and view countsGET /api/v3/search?q={query}&maxResults=25&pageToken={token}— Search videos with paginationGET /api/v3/recommendations?user_id={id}&count=20— Fetch personalized video recommendations for the home feed
Scaling & Bottlenecks
The transcoding pipeline is the primary compute bottleneck. YouTube addresses this with priority-based scheduling: newly uploaded videos get a single H.264 rendition within minutes for immediate availability; additional renditions (VP9, AV1, higher resolutions) are queued based on predicted popularity. Videos that go viral get expedited AV1 encoding because the bandwidth savings at scale are enormous — AV1 delivers 30% better compression than VP9, saving petabytes of CDN egress daily.
CDN capacity planning is critical. YouTube pre-positions popular content to edge caches using a popularity prediction model (gradient boosting on early view velocity). Long-tail content (videos with <100 views/day) is served from regional origins to avoid polluting edge caches. The thundering herd problem during viral events (e.g., a music video premiere) is handled by request coalescing at the edge: multiple concurrent requests for the same segment are collapsed into a single origin fetch.
Key Trade-offs
- Multiple codecs (H.264 + VP9 + AV1) vs single codec: Supporting three codecs triples storage and encoding compute, but AV1 alone saves 30% bandwidth on supported devices — at YouTube's scale, CDN savings dwarf encoding costs
- Content-aware encoding vs fixed bitrate ladder: Two-pass encoding with per-scene bit allocation produces 20% better quality at the same bitrate but doubles encoding time — justified by the long tail of views
- Spanner over Bigtable for metadata: Spanner provides strong consistency and SQL semantics critical for monetization and copyright systems; the latency trade-off vs Bigtable is acceptable for metadata reads
- Push vs pull for subscription notifications: YouTube uses push (fan-out on write to a notification service) for subscribers of channels with <1M subscribers; for mega-channels, batch notification delivery avoids write storms
GO DEEPER
Master this topic in our 12-week cohort
Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.