System Design: ML-based Search Ranking

Requirements

Functional Requirements:

Return a ranked list of relevant documents/products for a text query within 200ms
Support multi-stage ranking: fast retrieval (BM25 + ANN) followed by expensive re-ranking (LTR model)
Incorporate diverse signals: query-document relevance, user engagement signals, freshness, and business rules
Handle query understanding: spell correction, query expansion, synonym mapping, and intent classification
Personalize results based on user history, location, and device
Apply business rules post-ranking: boost promoted items, demote out-of-stock items, enforce diversity

Non-Functional Requirements:

End-to-end search latency under 200ms at the 99th percentile
Support 50,000 search queries per second at peak
Index corpus of 1 billion documents with sub-minute indexing latency for new documents
NDCG@10 improvements tracked for every ranking model update
99.99% availability; degraded mode returns BM25 results without ML re-ranking

Scale Estimation

50,000 QPS with 200ms budget. Stage 1 (retrieval): BM25 + ANN over 1 billion documents must return top-1,000 candidates in 50ms. Stage 2 (re-ranking): LTR model scores 1,000 candidates per query at 50,000 QPS = 50 million LTR inferences per second — requiring 500 inference workers at 100,000 inferences/second each. Each LTR inference takes ~1 microsecond for a gradient boosted tree; 1,000 candidates * 1 microsecond = 1ms per query for re-ranking.*

High-Level Architecture

The search ranking system has five subsystems: Query Understanding, Candidate Retrieval, Feature Assembly, Learning-to-Rank, and Business Rules Engine. These execute sequentially within the 200ms budget: Query Understanding (10ms) → Retrieval (50ms) → Feature Assembly (30ms) → LTR Scoring (20ms) → Business Rules (5ms) → Response serialization (5ms) = 120ms with 80ms headroom.

Query Understanding applies NLP to the raw query: spell correction (Symspell), tokenization, stop-word removal, synonym expansion (from a domain-specific thesaurus), and intent classification (navigational vs. informational vs. transactional). For product search, entity extraction identifies brand names, product attributes, and price constraints. The processed query is represented as both a sparse vector (BM25 terms) and a dense vector (BERT-encoded query embedding for semantic retrieval).

The LTR model is trained offline using implicit feedback (click-through data, dwell time, purchase signals) collected from production search logs. Training uses pairwise LambdaMART or listwise LambdaRank objectives that optimize NDCG directly. Features include: query-document relevance features (BM25 score, semantic similarity, exact match signals), document quality features (freshness, popularity, authority), and personalization features (user historical preference match, location match). The model is served as an ONNX-exported gradient boosted tree (LightGBM), scoring 1 million candidates per second per CPU core.

Core Components

Multi-Stage Retrieval

Stage 1a (BM25): Elasticsearch retrieves top-500 documents by BM25 score. Elasticsearch shards the 1 billion document index across 20 nodes; queries fan out to all shards and merge top-500 results in 30ms. Stage 1b (ANN): A FAISS HNSW index over document embeddings retrieves top-500 semantically similar documents in 20ms. Both retrieval stages run in parallel; results are merged by union (up to 1,000 unique candidates) and passed to re-ranking. Documents appearing in both lists receive a hybrid score boosting their rank in re-ranking.

Feature Assembly for LTR

For each of the 1,000 candidates, the feature assembler computes: query-document features (BM25 score from Elasticsearch, cosine similarity from ANN, exact title match, query term coverage), document features (click-through rate from Redis cache, freshness score, content quality score from a pre-computed table), and user-document features (user's historical CTR on documents with this category, geographic match). Feature assembly runs as 3 parallel Redis pipeline calls (one per feature group) completing in 10ms.

Learning-to-Rank Training Pipeline

A daily training pipeline reads 30 days of search session logs from the data warehouse. Positive labels are assigned to clicked results; negative labels to results shown but not clicked. A position-bias correction model (inverse propensity scoring) debiases the training labels to account for position effects (items shown at rank 1 are clicked more due to position, not quality). LightGBM trains on the debiased dataset using LambdaMART, optimizing NDCG@10. The new model is A/B tested on 10% of traffic before full rollout.

Database Design

Elasticsearch for BM25 retrieval: document index with fields (title, body, category, tags) with custom analyzers (stemming, synonym token filter). Document feature pre-computation table (Redis): key doc_features:{doc_id} → JSON hash of static features (CTR, quality_score, freshness_score), refreshed hourly. LTR model artifact stored in S3 and loaded in-memory on ranking servers. Click log (Kafka → ClickHouse): (query_id, session_id, user_id, query_text, result_doc_id, rank_shown, clicked BOOL, dwell_time_ms, purchased BOOL, ts TIMESTAMP).

API Design

GET /search?q={query}&user_id={id}&filters={json}&page=1&page_size=20 — Execute a search query and return ranked results with relevance scores. POST /index/documents — Index new or updated documents; returns estimated propagation time to all search nodes. GET /search/explain?query_id={id}&doc_id={id} — Return a feature breakdown explaining why a document was ranked at its position. POST /search/feedback — Submit explicit relevance feedback (thumbs up/down) for a search result.

Scaling & Bottlenecks

Elasticsearch query fan-out to 20 shards adds coordination overhead: each query requires scatter-gather across all shards, and the coordinator node merges 20 * 500 = 10,000 candidate documents. Reducing shard count by sizing shards to 50 GB each (Elasticsearch recommendation) balances query parallelism against coordination overhead. Index routing (routing queries to shards containing the relevant document namespace) reduces fan-out for tenant-partitioned indices.*

Feature assembly for 1,000 candidates * 50 features = 50,000 feature lookups per query at 50,000 QPS = 2.5 billion lookups/second. Batch fetching (HMGET for multiple keys in one round trip) and pre-joining static document features at indexing time (storing pre-computed feature vectors in the search index alongside documents) reduces live feature lookup to only dynamic user features (50 lookups vs. 50,000), cutting feature assembly from 30ms to 5ms.*

Key Trade-offs

BM25 vs. dense retrieval: BM25 handles exact keyword matching and rare terms perfectly; dense retrieval handles semantic and paraphrase matching but misses rare terms not seen in embedding training data. Hybrid retrieval combines both.
LTR model complexity vs. serving latency: Deeper LTR models (neural rankers) achieve higher NDCG but require GPU inference; gradient-boosted trees (LightGBM) achieve 90% of neural ranker quality at 100x lower latency on CPU, making them the industry default for re-ranking.
Global vs. personalized ranking: Global ranking is simpler and easier to debug; personalized ranking improves CTR by 15–30% but requires user history features, increases privacy complexity, and makes A/B testing harder.
Implicit vs. explicit feedback for training: Implicit feedback (clicks, purchases) is abundant and continuously updated but noisy (position bias, presentation effects); explicit feedback (ratings, thumbs up/down) is precise but scarce.