System Design: Product Search (Amazon-scale)

Requirements

Functional Requirements:

Search 500 million product listings by keyword, category, and attributes
Return personalized, purchase-intent-ranked results
Support faceted filtering (brand, price, rating, Prime eligibility, color)
Handle sponsored/advertising slots integrated with organic results
Support query understanding: synonyms, abbreviations, brand disambiguation
Return results within 100ms

Non-Functional Requirements:

100ms p99 end-to-end latency
500,000 peak QPS
500 million products with near-real-time inventory and price updates
99.999% availability — outages directly cost revenue
Support for 20+ international marketplaces with locale-specific catalogs

Scale Estimation

Amazon serves over 350 million product searches per day (~4,000 QPS average, 500,000 QPS peak). With 500 million products at 5 KB metadata each (title, description, attributes, images), the catalog is 2.5 TB. The search index (inverted index + doc values) is 500 GB–1 TB after compression. Price and inventory update rates reach 10,000 writes/sec during flash sales. Personalization requires per-user feature vectors for 300 million active users: at 1 KB per vector, that's 300 GB in a user feature store.

High-Level Architecture

The architecture consists of five major subsystems: Query Understanding, Retrieval, Ranking, Ads Serving, and Result Assembly. Query Understanding (QU) preprocesses the query: language detection, spelling correction, tokenization, synonym expansion, brand recognition, intent classification (navigational vs. transactional). QU output is an enriched query object passed downstream. Retrieval fetches candidate products from the index. Ranking scores and sorts candidates. Ads Serving inserts sponsored placements. Result Assembly formats and caches the final SERP.

Query Understanding uses a combination of rule-based systems and ML models. A spell corrector (based on a product-catalog-specific language model) handles common misspellings. A synonym dictionary (maintained by merchant category teams) maps "headphones" → "earphones", "headsets". Named entity recognition identifies brands ("Nike"), models ("iPhone 15 Pro"), and product categories. Intent classification distinguishes navigational queries ("Amazon basics USB cable" — exact product match desired) from exploratory queries ("running shoes" — discovery desired), which routes to different ranking model configurations.

Retrieval uses a tiered index. A primary full-text index (custom Lucene-based system, similar to OpenSearch) handles keyword matching. A dense retrieval index (bi-encoder embeddings) handles semantic matching for queries with no keyword overlap. A structured attribute index handles filtered queries ("size:10 brand:Nike category:shoes"). All three retrieval paths execute in parallel; results are merged before ranking. Candidate set size per retrieval path is capped at 5,000; the merged set going to ranking is 10,000.

Core Components

Query Understanding Pipeline

The QU pipeline runs sub-20ms. Components execute in a directed acyclic graph (DAG): language detection → tokenization → spell correction → NER (brand/model) → synonym expansion → intent classification → category prediction. Models are lightweight: FastText for language detection (microseconds), a trie-based spell corrector against product catalog vocabulary, and a distilled BERT model for intent classification. Output is a structured query object: {original_text, corrected_text, tokens, entities: [{text, type, confidence}], intent, predicted_categories}.

Ranking Model (A9-inspired)

Amazon's A9 algorithm weighs three primary dimensions: relevance (does the product match the query?), performance (does the product sell well for this query?), and availability (is it in stock, Prime-eligible, deliverable?). Relevance signals: text match score (BM25), title match, category alignment, embedding similarity. Performance signals: conversion rate for this query, historical CTR, review count/rating, sales velocity, return rate. Availability signals: Prime eligibility, current stock level, delivery speed, seller rating. A gradient-boosted model (XGBoost) combines 300+ features. Personalization layer adds user affinity scores (purchase history, browsing history category weights) as multiplicative re-rank factors.

Real-Time Inventory & Price Updates

A dedicated Inventory Update Service consumes a Kafka stream of warehouse events (stock level changes, price changes). Rather than full reindexing, partial document updates use a lightweight update API: only the price, in_stock, prime_eligible, and delivery_days fields are updated in Elasticsearch. These fields are stored as numeric doc values, allowing fast update propagation (sub-second index refresh) without segment rewrite. Price changes for time-limited deals (Lightning Deals) use a TTL-based overlay mechanism: a deal record specifies a discounted price + validity window, and the ranking model reads the effective price at query time.

Database Design

Product catalog data is stored in DynamoDB (primary source of truth, keyed by ASIN) and mirrored to the search index via a CDC pipeline. The search index stores a subset of fields optimized for search: ASIN, title (analyzed), brand (keyword), category_path (keyword array), price (double), rating_avg (half_float), review_count (integer), prime_eligible (boolean), in_stock (boolean), embedding_vector (dense_vector, 768-dim). User feature vectors are stored in a purpose-built feature store (DynamoDB or Cassandra) keyed by user_id. Query logs are stored in S3 (partitioned by date/marketplace) and queried via Athena for training data generation.

API Design

Scaling & Bottlenecks

The most acute scaling challenge is the ranking step: scoring 10,000 candidates per query at 500,000 QPS requires 5 billion scoring operations per second. Optimization strategies: (1) early termination — stop retrieving candidates once quality plateaus; (2) cascaded ranking — a cheap linear model reduces 10,000 to 500 candidates before the full XGBoost model scores the remainder; (3) feature caching — document-level features (product rating, category) are precomputed and cached, only query-document interaction features computed online; (4) hardware acceleration — SIMD-vectorized XGBoost inference on modern CPUs.

Cache architecture is multi-layer: L1 = in-process result cache (LRU, 10,000 most popular queries per node, ~5 min TTL), L2 = regional Redis cluster (100,000 queries, 5 min TTL), L3 = CDN edge cache (top-1,000 popular queries, 30 sec TTL). Cache hit rate for L1+L2 is ~60% during peak traffic, reducing effective QPS to ranking to 200,000. CDN caching is used only for non-personalized search (guest users or queries with identical filter states across many users).

Key Trade-offs

Relevance vs. conversion optimization: Pure conversion-rate optimization surfaces bestsellers regardless of query relevance; pure text relevance may surface obscure exact-match products that never sell; A9 balances both
Personalization depth vs. latency: Deep personalization (session-level attention model) adds 30ms; lightweight category affinity re-ranking adds 1ms with 60% of the lift
Sponsored integration vs. organic quality: Interleaving ads with organic results monetizes the search page but degrades UX if ads dominate above-the-fold positions
Global index vs. per-marketplace index: A global index simplifies ops but requires complex locale filtering; per-marketplace indices are operationally heavier but faster and more relevant