System Design: Social Feed Ranking Algorithm

Requirements

Functional Requirements:

Rank posts from followed accounts and recommended sources into a personalized feed
Score each candidate post based on predicted user engagement (like, comment, share, dwell time)
Support multiple feed types (Home, Explore, Following-only chronological)
Incorporate real-time signals (user just liked a post about cooking → boost food content)
Inject ads and sponsored content at configurable positions
Allow users to provide negative feedback ("See less like this", "Not interested")

Non-Functional Requirements:

Generate ranked feeds for 500M DAU with average 10 feed loads/day = 5 billion feed requests/day
End-to-end ranking latency under 200ms (p99) including candidate retrieval and scoring
Model freshness: incorporate engagement signals within 30 seconds of user action
99.99% availability; feed is the core product experience
A/B testing infrastructure: support 50 concurrent ranking experiments

Scale Estimation

5 billion feed requests/day equals approximately 58K requests/sec. Each feed request retrieves 500 candidate posts from multiple sources, scores them with an ML model, and returns the top 20. That is 29 billion model inferences/day (58K × 500 candidates). At 500 candidates per request and 0.5ms per inference on a GPU, each request requires 250ms of GPU compute — the entire ranking for one request must be batched and parallelized. Assuming batch inference of 500 candidates in a single forward pass on an A100 GPU (which takes approximately 15ms for a transformer model), the GPU fleet needs approximately 870 A100 GPUs (58K requests/sec × 15ms / 1000ms) to handle peak load.

High-Level Architecture

The feed ranking system follows a multi-stage funnel architecture with four stages: source selection, candidate retrieval, scoring, and post-processing. Stage 1 (Source Selection): the Feed Orchestrator determines which content sources to query based on the user's follow graph, interest profile, and feed type. Sources include: Following Source (posts from followed accounts), Recommendation Source (posts from non-followed accounts predicted to be interesting), and Ad Source (sponsored content from the ad auction).

Stage 2 (Candidate Retrieval): each source returns a set of candidate posts. The Following Source queries a pre-materialized Feed Cache (Redis sorted set of post_ids written by a fan-out service). The Recommendation Source queries an Approximate Nearest Neighbor (ANN) index (FAISS on GPU) using the user's interest embedding to retrieve semantically similar posts from the broader content pool. The Ad Source queries the Ad Ranking Service which runs its own auction. All sources return candidates in parallel, and the Orchestrator merges them into a single candidate pool of ~500 posts.

Stage 3 (Scoring): the candidate pool is sent to the Ranking Service, which runs a two-tower neural network model. The user tower encodes user features (demographics, historical engagement patterns, session context, real-time signals) into a 128-dimensional user embedding. The item tower encodes post features (content type, creator quality score, engagement velocity, freshness) into a 128-dimensional item embedding. The final score is computed as a weighted combination of dot product similarity and a cross-attention head that captures user-item interaction features. The model predicts multiple objectives: P(like), P(comment), P(share), P(long_dwell), and P(negative_feedback). The final ranking score is a weighted sum of these predictions calibrated by business objectives.

Stage 4 (Post-processing): the top-scored candidates go through a policy layer that enforces diversity (no more than 3 consecutive posts from the same creator), freshness constraints (at least 30% of feed items from the last 4 hours), ad insertion (every 5th position is an ad slot), and negative feedback filtering (suppress content similar to items the user flagged "not interested").

Core Components

Feature Store

The Feature Store provides low-latency access to user and item features for the ranking model. It has two tiers: a real-time tier (Redis) storing session-level signals updated within seconds of user actions (last 10 posts viewed, last 5 posts liked, current session duration, device type), and a batch tier (Feast on Redis) storing pre-computed features updated hourly by Spark jobs (user interest vector from 30-day engagement history, creator quality score, user demographic cluster). The Feature Store serves approximately 58K feature fetch requests/sec, each returning ~200 features totaling 4KB. Redis cluster with 50 nodes handles this with sub-1ms p99 latency.

ML Scoring Service

The Scoring Service runs on a fleet of GPU inference servers (NVIDIA Triton Inference Server on A100 GPUs). Each request batches 500 candidate items and runs a single forward pass through the ranking model. The model architecture: a two-tower transformer with 50M parameters, trained on 90 days of engagement logs (positive: liked/commented/shared/watched >10s; negative: scrolled past quickly, flagged). The model is retrained daily and deployed via shadow scoring: new models run alongside production for 4 hours, comparing ranking quality via offline metrics (NDCG, AUC) before promotion. During A/B tests, traffic is split at the user level — each user is deterministically assigned to a model variant via consistent hashing on user_id.

Real-Time Signal Ingestion

Real-time signals are critical for feed personalization within a session. When a user likes a food photo, subsequent feed loads should boost food content. An Event Stream Processor (Flink) consumes engagement events from Kafka, updates the user's real-time feature vector in Redis (a list of recent engagement signals with timestamps), and triggers a re-ranking of pre-fetched but not-yet-displayed feed items. The signal propagation latency target is 30 seconds from user action to updated feed ranking. This is achieved by: Kafka produce latency <10ms, Flink processing <5ms, Redis write <5ms, remaining budget for the next feed request to pick up the updated features.

Database Design

The Feed Cache stores pre-materialized candidate lists in Redis sorted sets. Key: feed:{user_id}, members: post_ids, scores: post timestamps. This cache is populated by a Fan-Out Service that writes new post_ids to followers' feed caches on publish. Maximum cache size: 1,000 items per user, evicting the oldest. For users who follow many active accounts, only posts from the top 500 most-engaged-with creators are included (determined by a batch affinity score). Redis memory for 500M users × 1000 items × 8 bytes = 4TB across a 500-node Redis Cluster.

The engagement log (training data for the ranking model) is stored in a data lake on S3 in Parquet format, partitioned by date. Each row contains: user_id, post_id, action (impression/like/comment/share/skip/hide), timestamp, session_id, position_in_feed, and all features at the time of scoring. This log grows at approximately 50 billion rows/day (10 actions per feed load × 5B loads). A daily Spark job processes this into training datasets, sampling negatives at a 1:10 positive-to-negative ratio. The trained model artifacts (PyTorch checkpoints) are stored in S3 and loaded by Triton on deployment.

API Design

GET /api/v1/feed?type=home&cursor={encoded_cursor}&limit=20 — Fetch the next page of ranked feed; returns post objects with ranking metadata (score, source, model_version)
POST /api/v1/feed/signals — Submit a batch of client-side engagement signals (dwell times, scroll velocity, viewport visibility); used for real-time feature updates
POST /api/v1/feed/feedback — Submit negative feedback on a feed item; body contains post_id, feedback_type (not_interested, see_less, hide); updates user preference model
GET /api/v1/feed/debug?user_id={id}&post_id={id} — Internal endpoint for ranking engineers; returns the full feature vector and per-objective scores for why a post was ranked at a given position

Scaling & Bottlenecks

The GPU inference fleet is the most expensive and hardest-to-scale component. At 870 A100 GPUs, the annual compute cost exceeds $30M. Optimization strategies: (1) model distillation — a smaller 10M parameter student model handles the first-pass scoring of 500→100 candidates on CPU, and the full 50M parameter teacher model scores only the top 100 on GPU, reducing GPU load by 5x; (2) quantization — INT8 quantization of the ranking model doubles throughput per GPU with <0.5% ranking quality loss; (3) result caching — if a user refreshes the feed within 60 seconds, serve the cached ranking with only the top positions re-scored for freshness.

The Fan-Out Service for feed cache population faces the celebrity problem: when a user with 100M followers posts, writing to 100M Redis sorted sets takes significant time and bandwidth. The solution is lazy fan-out for celebrity content: their posts are not written to individual feed caches but are fetched at ranking time from a separate Celebrity Posts Cache and merged with the user's precomputed feed candidates. This bounds the fan-out to O(1) per celebrity post while keeping the ranking pipeline unaware of the optimization.

Key Trade-offs

Multi-objective scoring vs. single engagement metric: Predicting multiple objectives (like, comment, share, dwell) with a weighted combination allows business-level tuning of feed quality vs. engagement — the trade-off is increased model complexity and the need for careful weight calibration
Two-tower architecture vs. cross-encoder: Two-tower models allow pre-computing user and item embeddings independently (enabling ANN retrieval), but miss fine-grained user-item interactions that cross-encoders capture — the hybrid approach (two-tower + cross-attention) balances efficiency and quality
Fan-out on write vs. read for feed candidates: Write fan-out gives instant feed loads but is prohibitive for celebrities; read fan-out reduces write amplification but adds latency — the hybrid model with celebrity-specific lazy evaluation is the industry standard
Daily model retraining vs. online learning: Daily retraining captures macro trends while real-time feature updates capture micro trends within a session — full online learning would be more responsive but is operationally risky (feedback loops, adversarial manipulation)