System Design: Search Suggestion & Spell Check

Requirements

Functional Requirements:

Correct spelling errors in search queries in real time
Suggest alternative queries for zero-result searches
Support phonetic correction ("Britney" → "Brittney") and transposition errors
Provide "did you mean?" suggestions ranked by likelihood
Handle domain-specific vocabulary (product names, brand names, technical terms)
Support multiple languages with language-specific spell models

Non-Functional Requirements:

Correction latency under 10ms per query
Handle 100,000 QPS
Correction accuracy >95% on common misspellings
False positive rate <1% (do not "correct" correctly spelled but uncommon words)
Support corpus updates without service restart

Scale Estimation

Query spell check requires comparing each query token against a dictionary of ~5 million words. A BK-tree (Burkhard-Keller tree) over 5 million words with max edit distance 2 performs ~500 comparisons per lookup. At 100,000 QPS with queries averaging 3 tokens, that's 150 million BK-tree lookups/sec. Each lookup takes ~1 microsecond in optimized C++/Java, requiring ~150 CPU cores. Language models (5-gram) for candidate re-ranking are stored in compressed form (~1 GB per language) and queried as O(1) hash lookups. Total memory footprint per language: ~2 GB.

High-Level Architecture

The spell correction pipeline has three stages: error detection, candidate generation, and candidate ranking. Error detection identifies which tokens are likely misspelled (not found in vocabulary or low frequency). Candidate generation produces a set of plausible correct spellings using edit-distance enumeration. Candidate ranking scores candidates by language model probability and domain frequency, selecting the most likely intended query. The pipeline runs inline with query processing, triggered when a query token has low corpus frequency or produces zero/few results.

A vocabulary service maintains a trie and frequency table of known words, populated from the query log (frequent queries are treated as valid vocabulary) and domain-specific dictionaries (product catalog, brand names). Words not in the vocabulary trigger the correction pipeline. A candidate generator uses a BK-tree (metric tree for edit-distance queries) to enumerate all dictionary words within Damerau-Levenshtein distance 2 (handles insertions, deletions, substitutions, and transpositions). The candidate set is filtered by minimum frequency threshold to avoid suggesting rare words.

Candidate ranking uses a noisy channel model: the probability of a candidate C given observed typo T is proportional to P(T|C) * P(C). P(C) is the unigram frequency of C in the query log. P(T|C) is the confusion probability — how often a specific error pattern occurs (e.g., swapping "e" and "i", omitting a doubled letter). A confusion matrix is learned from aligned pairs of (typo, correction) extracted from query click logs (if a user searches for "receipe", gets no results, and immediately searches "recipe", this is a correction pair). An n-gram language model (5-gram, Kneser-Ney smoothing) scores the corrected query in context.*

Core Components

BK-Tree (Edit-Distance Index)

A BK-tree organizes words in a metric space where the distance function is Damerau-Levenshtein edit distance. Each node stores a word and its children, keyed by their edit distance from the parent. To find all words within distance k of a query word, recursively visit children whose key is in the range [dist-k, dist+k]. This prunes the search space dramatically compared to a linear scan. For a vocabulary of 5 million words and max distance 2, a BK-tree visits ~500 nodes (0.01% of vocabulary) per lookup. The tree is built once offline and loaded into memory at service startup.

N-Gram Language Model

A 5-gram language model captures the probability of a word sequence given the preceding 4 words: P(w5 | w1, w2, w3, w4). Trained on billions of query log entries, it captures query-specific language patterns ("how to" → "install", not "instill"). Kneser-Ney smoothing handles unseen n-grams by interpolating with lower-order models. The model is stored as a hash map of n-gram strings to log probabilities, compressed using KenLM binary format (~1 GB for a 5-gram model with 100M unique n-grams). At correction time, each candidate spelling correction is scored by the n-gram model over the full query context; the highest-scoring candidate is the "did you mean" suggestion.

Phonetic Correction

Phonetic algorithms (Soundex, Metaphone, Double Metaphone) encode words by their pronunciation, collapsing phonetically similar words to the same code. A phonetic index maps each code to the set of words sharing that pronunciation. Queries are checked against this index for zero-result handling: if a query token's phonetic code matches a known word, a phonetic correction candidate is generated. Double Metaphone is preferred for English as it handles more phonetic patterns. For proper nouns (brand names), a brand-name-specific phonetic dictionary is maintained separately to avoid false positive corrections of intentionally unusual brand spellings (e.g., "Häagen-Dazs").

Database Design

Vocabulary data is stored in memory: a hash map (word → frequency) for O(1) frequency lookup, and a BK-tree for edit-distance search. Vocabulary is built from query logs (stored in BigQuery/Hive, updated nightly) and supplemented with product catalog terms (queried from DynamoDB). The n-gram language model is stored in KenLM binary format on disk and memory-mapped at startup. Correction pairs (typo → correction, count) from click logs are stored in BigQuery and used for confusion matrix updates. Model refresh happens daily: a batch job retrains the confusion matrix and updates the BK-tree vocabulary; new artifacts are deployed via a rolling update.

API Design

Scaling & Bottlenecks

The correction pipeline is CPU-bound for high-QPS scenarios. BK-tree lookups are single-threaded per query but can be parallelized across query tokens. For a 5-token query at 100,000 QPS, 500,000 BK-tree lookups/sec is needed. A single optimized BK-tree implementation handles ~1 million lookups/sec per core. With 2 cores per correction server and 50,000 correction servers total, the math works. In practice, spelling correction is only triggered for low-frequency or zero-result queries (30–40% of queries), reducing effective load.

Vocabulary updates present a cold-start challenge: new product names, trending terms (from social media), and proper nouns are unknown to the model. A hot vocabulary layer in Redis stores recently added terms with their frequencies. The correction pipeline checks this hot layer before declaring a word unknown. New words are added to the hot layer within 15 minutes of appearing in query logs with frequency > 100. The hot layer is periodically merged into the offline vocabulary during nightly model rebuilds.

Key Trade-offs

Aggressive correction vs. false positives: Aggressive correction helps casual users but risks "correcting" valid technical terms, acronyms, and proper nouns; confidence thresholds and vocabulary whitelists mitigate this
Edit distance 1 vs. 2: Distance-2 correction catches more errors but generates 100x more candidates, slowing ranking; adaptive distance (start with 1, fall back to 2 on zero candidates) balances speed and recall
Contextual vs. token-level correction: Token-level correction is faster but misses context; contextual n-gram scoring is more accurate but requires the full query to be assembled before scoring
Domain vocabulary breadth vs. false correction rate: Including all product names in vocabulary prevents false corrections but makes the dictionary so large it rarely triggers correction for actual typos