System Design: Video Recommendation System

Requirements

Functional Requirements:

Generate personalized video recommendations for home feed, sidebar, and end-of-video screens
Support multiple recommendation surfaces (trending, because-you-watched, topic-based, new releases)
Incorporate real-time signals (just-watched videos, search queries) within seconds
Handle cold-start for new users and new videos
Explain recommendations ("Because you watched X", "Popular in your area")

Non-Functional Requirements:

Serve 500 million recommendation requests per day with p99 latency under 100ms
Model freshness: incorporate engagement signals within 30 seconds of user action
Support a catalog of 500 million videos with daily additions of 500K new videos
A/B testing infrastructure for continuous model experimentation
Recommendation quality metric: 10%+ improvement in watch time over popularity-based baseline

Scale Estimation

500M recommendation requests/day = 5,800 req/sec average, 15K/sec peak. Each request requires retrieving ~1,000 candidate videos from a catalog of 500M → ANN index must support 15K queries/sec with recall >95%. Feature store reads: each candidate requires ~200 features → 1,000 candidates × 200 features = 200K feature lookups per request → 3 billion feature lookups/sec at peak. Model inference: ranking 1,000 candidates through a deep model at 15K req/sec → 15M model inferences/sec. Training data: 10 billion engagement events/day, each ~500 bytes = 5TB/day of training data.

High-Level Architecture

The recommendation system uses a multi-stage funnel architecture. Stage 1 (Candidate Generation): multiple retrieval models run in parallel — collaborative filtering via matrix factorization (ALS), content-based retrieval via video embedding similarity (using a two-tower neural network), and rule-based retrieval (trending, new releases, same-creator). Each retriever produces 200-500 candidates. A union-and-deduplicate step yields ~1,000 unique candidates.

Stage 2 (Ranking): the candidates are scored by a deep ranking model. The model is a multi-task deep neural network with shared lower layers and task-specific heads predicting: P(click), P(watch>50%), P(like), P(share), and P(not-interested). Features are pulled from the Feature Store in batch: user features (watch history embedding, demographic, device), video features (topic embedding, popularity, freshness, creator features), and context features (time of day, day of week, session depth). The model outputs a weighted composite score.

Stage 3 (Post-Ranking): business logic is applied — diversity injection (no more than 2 videos from the same creator in a row), freshness boost for recently uploaded content, demotion of previously seen videos, and ad slot insertion. The final ranked list is returned to the client. A Logging Service records the recommendation and subsequent user interactions for model training.

Core Components

Candidate Retrieval

The collaborative filtering retriever uses user and video embeddings (128-dimensional vectors) trained via Alternating Least Squares on the user-video interaction matrix. At serving time, the user's embedding vector is used as a query against a FAISS index (IVF-PQ quantization) holding all 500M video embeddings. This returns the top 500 nearest neighbors in <10ms. The content-based retriever uses a two-tower model: one tower encodes the user's recent watch history into an embedding, the other encodes video metadata (title, tags, visual features from a CNN). The dot product of the two towers produces a relevance score. Both retrievers run in parallel on GPU inference servers.

Feature Store

The Feature Store is the backbone of the ranking system. It serves two types of features: batch features (precomputed hourly/daily by Spark jobs — e.g., video popularity percentile, creator subscriber count, user long-term preferences) stored in a Redis cluster with 10TB capacity; and real-time features (computed from streaming events — e.g., videos watched in current session, last search query, time since last interaction) computed by a Flink streaming job consuming from Kafka and written to a separate Redis cluster with sub-second update latency. At serving time, the Ranking Service issues a batch GET to Redis for all features of all 1,000 candidates — this is the most latency-sensitive operation.

Model Training Pipeline

The ranking model is trained continuously on fresh engagement data. An ETL pipeline (Spark) processes the engagement event log (stored in Kafka → S3 → Delta Lake) to produce training examples: each example is a (user, video, context, label) tuple where labels are multi-task (clicked: 0/1, watch_fraction: 0-1, liked: 0/1). The model is trained on a GPU cluster (8×A100 nodes) using PyTorch with distributed data parallelism. A new model checkpoint is produced every 6 hours. The model is validated on a holdout set; if quality metrics (NDCG@10, watch time lift) regress, the checkpoint is rejected. Approved checkpoints are deployed to inference servers via a blue-green deployment with canary traffic.

Database Design

User profiles (user_id, demographics, account_age, subscription_status) are stored in PostgreSQL. User watch history (user_id, video_id, watch_duration, timestamp) is stored in Cassandra partitioned by user_id with a TTL of 90 days — only recent history is relevant for recommendations. Video metadata (video_id, title, description, tags, creator_id, upload_time, duration, category) lives in Elasticsearch for text search and PostgreSQL for structured queries.

The embedding store uses a custom format: user embeddings (500M × 128 floats = 256GB) and video embeddings (500M × 128 floats = 256GB) are stored in FAISS indexes loaded into GPU memory on retrieval servers. Indexes are rebuilt nightly from the latest model checkpoint. Engagement events flow through Kafka (partitioned by user_id, 30-day retention) into S3 for batch processing and into Flink for real-time feature computation.

API Design

GET /api/v1/recommendations/home?user_id={id}&count=30&session_id={sid} — Fetch personalized home feed recommendations
GET /api/v1/recommendations/related?video_id={id}&user_id={uid}&count=20 — Fetch related video recommendations (sidebar/end-screen)
POST /api/v1/events/engagement — Log an engagement event; body contains user_id, video_id, event_type (impression, click, watch, like), watch_duration_ms
GET /api/v1/recommendations/trending?region={code}&count=20 — Fetch trending videos for a region (non-personalized)

Scaling & Bottlenecks

The Feature Store batch GET is the primary latency bottleneck. Fetching 200 features for 1,000 candidates means 200K key lookups per request. Redis can handle ~500K GET/sec per node, so a single request saturates nearly half a node. The solution is feature pre-aggregation: instead of storing individual features, features for each video are pre-packed into a single Redis value (a serialized feature vector). This reduces 200K lookups to 1,000 lookups per request — a 200x reduction. User features (which are the same across all 1,000 candidates) are fetched once and broadcast.

Model inference at 15K req/sec (each scoring 1,000 candidates) requires significant GPU compute. The ranking model runs on NVIDIA T4/A10 inference GPUs using TensorRT for optimized inference. Batching is critical: multiple user requests are batched together (batch size 64) to maximize GPU utilization. With TensorRT, scoring 1,000 candidates takes ~5ms on an A10 GPU. The inference fleet uses auto-scaling based on request queue depth; during off-peak hours, 60% of GPU instances are released.

Key Trade-offs

Multi-stage funnel vs single-model end-to-end: The funnel approach (cheap retrieval + expensive ranking) is 1000x more efficient than scoring all 500M videos with the full model, at the cost of potentially missing relevant long-tail videos that the retriever doesn't surface
Real-time features vs batch-only: Real-time features (current session context) improve recommendation relevance by 15% but add complexity (Flink pipeline, low-latency Redis cluster) — justified by the watch time uplift
Multi-task learning vs single-objective: Predicting multiple engagement signals (click, watch, like) with a shared model captures richer preferences than optimizing for clicks alone — prevents clickbait optimization
6-hour model refresh vs continuous learning: More frequent model updates capture trending content faster but increase training infrastructure cost and deployment risk — 6-hour cadence balances freshness with stability