System Design: Product Recommendation Engine

Requirements

Functional Requirements:

Generate personalized product recommendations on homepage, product pages, and cart
Support multiple recommendation types: 'Frequently bought together', 'Similar items', 'Customers also viewed'
Cold start handling for new users (popularity-based) and new products (content-based)
A/B testing framework for comparing recommendation algorithms
Real-time adaptation to current browsing session behavior

Non-Functional Requirements:

Serve recommendations for 50M DAU with sub-50ms latency (p99)
Process 500M click/view events per day for model training
Model refresh: offline models retrained every 4 hours; online features updated in real-time
99.9% availability; fallback to popularity-based recommendations on failure
Support 100M+ product catalog with embedding-based retrieval

Scale Estimation

50M DAU × 10 recommendation requests/session = 500M recommendation requests/day = 5,787 QPS average. Each request retrieves 20 candidates from the embedding index, scores them with a ranking model, and returns top 8. Event ingestion: 500M events/day = 5,787 events/sec into Kafka. Model training: 500M events processed in a 4-hour Spark job on a 100-node cluster. Embedding index: 100M products × 128-dimensional float32 vectors = 51.2GB — fits in GPU memory across 4 A100 GPUs.

High-Level Architecture

The recommendation engine uses a two-phase architecture: offline training and online serving. The offline pipeline runs on Spark: raw click/purchase events from Kafka → feature engineering (user purchase history, product co-occurrence matrices, category affinity scores) → model training. Two models are trained: (1) a collaborative filtering model (Alternating Least Squares matrix factorization) producing user and item embeddings; (2) a ranking model (XGBoost gradient-boosted trees) that scores candidate items using 50+ features. Trained models and embeddings are pushed to the Model Registry (MLflow).

The online serving pipeline: Recommendation API receives a request with user_id and context (current page, cart contents) → Retrieval Layer fetches 200 candidates via ANN search (FAISS index of item embeddings, queried with the user's embedding) + rule-based retrievers (same category, same brand) → Ranking Layer scores candidates using the XGBoost model with real-time features (current session clicks, time of day, device type) → Post-Processing applies business rules (filter out-of-stock, deduplicate, diversity injection) → return top 8 items.

Real-time features are maintained in Redis: user_session:{id} stores the last 20 products viewed/clicked in the current session, updated by a Kafka consumer processing clickstream events with <1 second latency.

Core Components

Collaborative Filtering Pipeline

The ALS matrix factorization model decomposes the user-item interaction matrix (500M rows) into user embeddings and item embeddings of dimension 128. Training runs on Spark MLlib every 4 hours using implicit feedback signals: views (weight 1), add-to-cart (weight 3), purchases (weight 5). The trained item embeddings are loaded into a FAISS IVF index (inverted file index with 4096 clusters) on GPU servers for fast ANN retrieval. User embeddings are stored in Redis keyed by user_id. For new users without embeddings, the system uses the average embedding of their last 5 viewed products as a proxy.

Ranking Model Service

The ranking model (XGBoost with 200 trees, max depth 8) scores each candidate item using features: user-item dot product (from embeddings), price ratio (item price / user's median purchase price), category match score (cosine similarity of user's category affinity vector and item category), item popularity (7-day click count, log-scaled), seller rating, and real-time session features (how many items from this category the user viewed in the current session). The model is served via a custom C++ inference service achieving 0.1ms per candidate scoring. With 200 candidates, total ranking time is ~20ms.

A/B Testing Framework

The recommendation engine supports concurrent A/B experiments. Users are assigned to experiment buckets via consistent hashing on user_id + experiment_id. Each experiment can modify: the retrieval algorithm, the ranking model version, the post-processing rules, or the number of candidates. Metrics tracked: click-through rate (CTR), add-to-cart rate, conversion rate, and revenue per recommendation impression. Experiments run for a minimum of 7 days with statistical significance testing (chi-squared test, p < 0.05) before promotion to production.

Database Design

The feature store uses two tiers. Offline features (user purchase history, product co-occurrence counts, category affinity vectors) are stored in a Parquet-based feature store on S3, loaded into Redis daily for online serving. Real-time features (current session activity) are maintained in Redis hashes: user_features:{id} → {last_viewed: [...], session_categories: {...}, cart_items: [...]}. Item features are stored in a Redis hash per product: item_features:{product_id} → {embedding_vector, popularity_score, category_id, price, avg_rating}.

The FAISS index is stored as a serialized binary file on S3, loaded into GPU memory on model server startup. Index rebuilds happen every 4 hours after ALS training completes. A blue-green deployment strategy ensures zero downtime during index swaps: new index is loaded on standby servers, traffic is switched, old servers are drained.

API Design

GET /api/v1/recommendations/homepage?user_id={id}&count=8 — Personalized homepage recommendations; falls back to trending products for anonymous users
GET /api/v1/recommendations/similar?product_id={id}&count=8 — Similar products for a product detail page; uses item-to-item embedding similarity
GET /api/v1/recommendations/cart?cart_items={ids}&count=4 — 'Frequently bought together' recommendations based on co-purchase patterns
POST /api/v1/events — Log user interaction event (view, click, add_to_cart, purchase) for real-time feature updates

Scaling & Bottlenecks

FAISS index serving is the primary latency bottleneck. With 100M product embeddings, a single ANN query takes ~5ms on GPU (IVF index with nprobe=32). To handle 5,800 QPS, the system runs 4 GPU servers each handling 1,450 QPS. GPU servers are expensive; cost optimization strategies: (1) batch incoming queries — accumulate requests for 2ms windows and execute batch ANN search (FAISS supports batch queries efficiently); (2) cache popular query results — homepage recommendations for the same user don't change within 5 minutes; (3) use HNSW index on CPU for fallback — slightly slower (15ms) but runs on cheaper hardware.

The event ingestion pipeline must handle 500M events/day without data loss. Kafka with 32 partitions provides throughput headroom. Events are consumed by two consumer groups: (1) real-time feature updater (writes to Redis with <1s latency) and (2) batch feature pipeline (writes to S3 Parquet for next training cycle). Consumer lag monitoring alerts when the real-time consumer falls behind by >10 seconds.

Key Trade-offs

ALS collaborative filtering over deep learning: ALS is simpler to train and debug, with comparable accuracy for e-commerce; deep learning models (e.g., two-tower) would improve accuracy by 3-5% but at 10x training cost
XGBoost ranking over neural ranker: XGBoost provides interpretable feature importances and fast inference (0.1ms/candidate), critical for the 50ms latency SLA
4-hour model refresh over real-time training: Batch retraining is operationally simpler; real-time session features compensate for model staleness
FAISS IVF over exact nearest neighbor: Approximate search with <5% recall loss enables sub-5ms queries over 100M vectors — exact search would take 500ms+