System Design: User Follow/Follower System

Requirements

Functional Requirements:

Users can follow and unfollow other users (asymmetric relationship)
Users can view their follower list and following list with pagination
Display follower and following counts on user profiles
Show mutual followers (friends) between two users
Suggest users to follow based on social graph proximity
Support follow requests for private accounts with accept/reject workflow

Non-Functional Requirements:

1 billion users with an average of 200 following and 200 followers each — 200 billion edges
Follow/unfollow operations must complete within 100ms
Follower count queries must return within 50ms
99.99% availability; the social graph is foundational to all platform features
Strong consistency for follow state (a user must never see stale follow status)

Scale Estimation

1 billion users × 200 average follows = 200 billion directed edges. At 16 bytes per edge (8-byte follower_id + 8-byte followee_id), raw edge storage is 3.2TB — comfortably fits in a distributed database. Follow/unfollow write rate: assuming 1% of DAU (500M) modify their follow list daily with 2 operations each, that is 10M writes/day = 116 writes/sec — a light write load. Read traffic is much heavier: every feed load, profile view, and story tray requires graph lookups. At 500M DAU with 50 graph queries per session, that is 25 billion reads/day = 290K reads/sec. The mutual followers query requires an intersection of two adjacency lists, averaging 200 elements each — O(n) with sorted lists.

High-Level Architecture

The follow system is built as a dedicated Social Graph Service with its own data layer, separated from the main application database. The core storage is a sharded graph store optimized for adjacency list operations. Two separate adjacency lists are maintained per user: a following list (users this person follows) and a followers list (users who follow this person). Every follow operation writes to both lists transactionally to maintain consistency.

The service exposes a gRPC API consumed by other platform services (Feed Service, Story Service, Notification Service, Recommendation Service). A Redis cache layer sits in front of the graph store, caching the most frequently accessed adjacency lists and follower counts. The cache uses a write-through strategy: follow/unfollow operations update both the database and the cache atomically. For the follow suggestions feature, a separate Graph Analytics Service runs nightly batch jobs on a Neo4j or Apache Giraph cluster to compute 2-hop neighbor recommendations and PageRank-based influence scores.

Core Components

Graph Storage Engine

The graph store uses a sharded MySQL cluster (Vitess) with two denormalized tables: user_following (partition key: follower_id, columns: followee_id, created_at) and user_followers (partition key: followee_id, columns: follower_id, created_at). This dual-write pattern optimizes for the two most common queries: 'who does user X follow?' and 'who follows user X?'. Both tables use a composite primary key (partition_key, created_at DESC) for efficient reverse-chronological pagination. The follow operation writes to both tables in a distributed transaction using a two-phase commit coordinator. For cross-shard follows (follower and followee on different shards), a Saga pattern with compensating transactions handles failures.

Follower Count Service

Follower and following counts are materialized as separate counters in Redis (key: count:{user_id}:followers and count:{user_id}:following). These counters are atomically incremented/decremented on every follow/unfollow operation using Redis INCRBY. For celebrity accounts with rapidly changing follower counts (Kafka consumer lag can cause the counter to drift), a periodic reconciliation job runs every hour: it counts the rows in the user_followers table for the top 10,000 accounts by follower count and updates the Redis counter if the drift exceeds 0.1%. Follower counts are displayed with rounding for large numbers (e.g., '12.4M followers') to mask minor inconsistencies.

Mutual Followers & Graph Queries

The mutual followers query ('show me followers of user B who I also follow') requires intersecting two sets. For small accounts (< 10K followers), this is done at query time: fetch both adjacency lists from cache/DB, sort, and intersect in O(n) time. For large accounts (> 10K followers), the intersection is too slow. Instead, a precomputed mutual followers cache is maintained: when user A views user B's profile, the service computes the intersection asynchronously and caches the result in Redis with a 24-hour TTL. The 'friend suggestions' feature uses a Jaccard similarity metric on follower overlap: similarity(A, B) = |followers(A) ∩ followers(B)| / |followers(A) ∪ followers(B)|, computed nightly via a Spark job on the full graph export.

Database Design

The primary graph store is MySQL (Vitess) with the following schema. Table user_following: follower_id (BIGINT, partition key), followee_id (BIGINT), created_at (DATETIME), status (ENUM: active, pending for private accounts). Table user_followers: followee_id (BIGINT, partition key), follower_id (BIGINT), created_at (DATETIME). Both tables have approximately 200 billion rows total, sharded across 1,024 Vitess shards with 200M rows per shard. Each shard runs on a MySQL 8.0 instance with 500GB SSD and 64GB RAM, with two read replicas for read scaling.

Redis caches adjacency lists as sorted sets: following:{user_id} and followers:{user_id} with the score set to the follow timestamp for chronological ordering. To bound memory, only users with fewer than 50K followers have their full adjacency list cached — celebrity follower lists are paginated from MySQL on demand. Redis memory for adjacency list caching: assuming 500M active users with average 200 entries at 8 bytes each = 800GB Redis memory total, spread across a 100-node Redis Cluster.

API Design

POST /api/v1/users/{user_id}/follow — Follow a user; returns follow status (followed or pending for private accounts)
DELETE /api/v1/users/{user_id}/follow — Unfollow a user; idempotent
GET /api/v1/users/{user_id}/followers?cursor={timestamp}&limit=50 — Paginated follower list sorted by recency
GET /api/v1/users/{user_id}/following?cursor={timestamp}&limit=50 — Paginated following list sorted by recency
GET /api/v1/users/{user_id}/mutual-followers/{other_user_id}?limit=20 — Mutual followers between two users

Scaling & Bottlenecks

The 290K reads/sec graph query load is handled by the Redis cache layer, which absorbs 95% of reads. Cache misses (approximately 14.5K/sec) are served by MySQL read replicas. The cache hit rate is high because of temporal locality — users repeatedly check the same profiles. Cache warming for new users is done lazily on first access. The write path (116 writes/sec) is light but requires cross-shard consistency for the dual-write pattern. The Saga coordinator processes follows sequentially per user (using a per-user lock in Redis) to prevent race conditions like simultaneous follow/unfollow of the same user.

The biggest scalability challenge is the follow suggestions feature, which requires analyzing the global graph structure. Running Personalized PageRank or label propagation on a 200-billion-edge graph requires a dedicated graph processing cluster. Apache Giraph on a 500-node Hadoop cluster processes the full graph in approximately 4 hours. The output (top 100 follow suggestions per user) is written to a Redis-backed feature store consumed by the Recommendation API. This batch computation runs nightly during off-peak hours.

Key Trade-offs

Dual-write denormalization vs. single table with index: Dual tables (following + followers) double the write cost but make both query directions O(1) lookups by partition key — with a read/write ratio of 2500:1, optimizing reads is clearly the right trade-off
MySQL over a native graph database: MySQL with adjacency list tables handles the primary queries (list followers, list following, check follow status) efficiently; a native graph DB like Neo4j would be better for multi-hop traversals but adds operational complexity for a single-hop-dominant workload
Redis materialized counts vs. COUNT(*) queries: Materialized counters are O(1) but can drift from the source of truth — the hourly reconciliation job keeps drift below 0.1%, which is acceptable for display purposes
Strong consistency for follow state vs. eventual: Follow/unfollow must be strongly consistent (a user who unfollows should never see content from that account again) — this requires synchronous dual-writes with distributed transactions, adding latency but ensuring correctness*