System Design: Facebook

Requirements

Functional Requirements:

Users create profiles, post text/photos/videos, and interact via likes and comments
Users have friends (bidirectional) and can follow pages/groups
A ranked News Feed aggregates content from friends and followed entities
Real-time notifications for interactions
Groups and Pages with their own feeds
Marketplace, Events, and Messenger (separate subsystems)

Non-Functional Requirements:

3 billion MAU, 2 billion DAU — largest social network by users
News Feed must load under 300ms; real-time notifications within 2 seconds
99.99% availability; zero data loss for user posts
Strong consistency for profile data; eventual consistency for feed

Scale Estimation

2B DAU × 10 feed loads/day = 20B feed reads/day = 231K reads/sec. Post writes: 500M posts/day = ~5,800 writes/sec. Photos: 350M photos uploaded daily = ~40TB/day at 120KB average (after compression). With 3B users and average 300 friends each, the social graph has 450 billion edges — this is TAO's primary design challenge. Total storage including all media exceeds 100 exabytes.

High-Level Architecture

Facebook's platform is built around three foundational systems. First, TAO (The Associations and Objects), a distributed graph database that stores all social graph data — friendships, likes, group memberships — as typed objects and associations. TAO provides a geographically distributed cache layer over MySQL, handling 99% of social graph reads from RAM. Second, the News Feed service uses a multi-stage ranking pipeline: candidate generation pulls from a user's social connections, a lightweight scorer filters candidates, and a heavy ML ranker (trained on engagement signals) produces the final ordered feed. Third, the media pipeline: Haystack object store handles billions of photos with a custom blob storage layer optimized for small random reads.

The request path for a News Feed load: API Gateway → News Feed Service → Fan-out cache (Memcache cluster named Tectonic) → TAO for social graph lookups → MySQL read replicas for post content. Memcache (mcrouter) is deployed as a distributed cache with a regional leader-follower replication model; the 'thundering herd' problem on cache misses is handled via lease tokens (a Memcache-specific feature Facebook open-sourced).

Core Components

TAO (Social Graph Store)

TAO is a geographically distributed data store for the social graph. It stores objects (users, pages, posts) and associations (friend edges, likes, comments) with a read-through cache over MySQL. TAO uses a follower cache tier in each datacenter that replicates from a primary region. Read operations are served locally; writes go to the primary region and propagate asynchronously. The cache uses LRU eviction with short TTLs for associations to balance freshness vs. load.

News Feed Ranking Pipeline

News Feed generation runs in three stages: (1) Retrieval — pull the top N candidate posts from followed friends/pages using a lightweight linear model over freshness, affinity, and content type scores; (2) Ranking — apply a deep neural network (currently using transformer-based models) to score each candidate on predicted engagement (like, comment, share probability); (3) Filtering — apply policy rules (deduplication, diversity constraints, ad injection) and assemble the final feed. Results are cached per user with a short TTL.

Haystack Photo Storage

Haystack is Facebook's custom blob store optimized for the many-small-files problem. Traditional NAS systems use too much metadata overhead for billions of 50KB photos. Haystack aggregates many photos into large logical volumes (100GB), storing a compact in-memory index mapping photo ID to byte offset within the volume. This reduces metadata lookups from an NFS path traversal to a single hash table lookup, enabling millions of photo reads/sec per storage node.

Database Design

User profile data and post content live in a sharded MySQL cluster (hundreds of shards). Sharding key is user_id; posts are stored in a Posts table with columns: post_id (BIGINT), user_id, content (TEXT), media_refs (JSON array of Haystack URLs), created_at, privacy_setting. A separate Comments table and Reactions table reference post_id. TAO sits in front of MySQL for all social graph queries, serving ~1 billion reads/sec across the fleet from DRAM caches.

For session data and ephemeral state, a global Memcache fleet with ~28TB RAM per region handles authentication tokens, user settings, and feed cache. ZooKeeper manages cluster membership and configuration. A separate Scribe/LogDevice pipeline ingests all activity logs into Hadoop/Hive for analytics and ML feature generation.

API Design

POST /v1/posts — Create a post with text and optional media_ids; returns post object
GET /v1/feed?fields=id,message,from,created_time&limit=25&after={cursor} — Fetch ranked News Feed
POST /v1/{object_id}/reactions — Add a reaction (Like, Love, Haha, etc.) to an object
GET /v1/{user_id}/friends?limit=100&after={cursor} — Fetch friend list via TAO

Scaling & Bottlenecks

The News Feed ranking is computationally expensive — running a deep neural network over hundreds of candidates per user per feed load. Facebook addresses this with a tiered ranking approach: a fast linear model pre-filters to top 500 candidates, then the heavy model ranks the top 500 to top 25. The heavy model runs on GPU inference clusters with sub-10ms p99 latency. Results are cached per user for 5 minutes; background jobs refresh the cache proactively before it expires.

The TAO cache fleet is the single largest infrastructure component, handling the read amplification of social graph traversals. Cache invalidation uses a publishing model: writes to MySQL emit invalidation messages to a distributed queue (Wormhole, Facebook's CDC system), which fan out to all regional TAO caches. The 'cold start' problem for new datacenters is handled by bulk-loading cache from MySQL snapshots before taking live traffic.

Key Trade-offs

TAO over direct MySQL: The object-association model with a read-through cache handles the read-heavy social graph workload at a fraction of the MySQL cost — the trade-off is complexity of cache invalidation
ML ranking over chronological: Engagement-optimized ranking dramatically increases time-on-site but introduces filter bubble and content quality challenges
Eventual consistency for TAO: Accepting stale reads from follower caches (typically <1 second lag) allows each datacenter to serve reads locally without cross-region latency
Haystack over POSIX filesystem: Custom blob store eliminates NFS metadata bottlenecks for small files but requires significant engineering investment to operate