SYSTEM_DESIGN
System Design: Typeahead Search (Autocomplete)
Design a real-time autocomplete system that serves query suggestions as users type, handling millions of concurrent sessions with sub-50ms latency. Covers trie structures, prefix trees, and caching strategies.
Requirements
Functional Requirements:
- Return top-10 query suggestions as user types each character
- Suggestions ranked by historical query frequency and personalization
- Support multiple languages and locales
- Handle trending queries (updated within minutes of a spike)
- Filter suggestions for SafeSearch and spam
- Support A/B testing of different ranking models
Non-Functional Requirements:
- P99 latency under 50ms end-to-end (including network)
- Handle 100,000 QPS globally
- 99.99% availability
- Suggestions refresh every 10–30 minutes for trending queries
- Storage for billions of historical query strings
Scale Estimation
Assume Google-scale: 100,000 autocomplete requests/sec at peak. With an average prefix length of 5 characters per query, this translates to a 500,000 events/sec write rate to the aggregation pipeline. Storing 5 billion unique query strings at 50 bytes each = 250 GB for the raw query corpus. A compressed trie for the top 10 million queries fits in roughly 2 GB in memory. Each data center maintains a hot in-memory trie replicated across multiple nodes.
High-Level Architecture
The system has two main paths: an offline aggregation pipeline that processes query logs to compute suggestion scores, and a real-time serving layer that responds to prefix lookups with ranked suggestions. The aggregation pipeline runs on a stream processor (Kafka + Flink) to continuously update query frequencies. Results are written to a distributed key-value store (Redis Cluster or a custom trie service). The serving layer handles prefix lookups against the in-memory data structure.
Query logs are streamed into Kafka as users submit searches. A Flink job aggregates counts over a 24-hour sliding window, applies decay (recent queries weighted higher), and outputs a ranked list of (query, score) pairs. These are written to a Trie Builder service that constructs compressed trie snapshots. Snapshots are pushed to serving nodes via a pub-sub distribution mechanism. Each serving node loads the new snapshot atomically (read-write lock swap) to avoid downtime during updates.
For personalization, a per-user query history service provides a small set of user-specific boosts. At query time, the base trie results are re-ranked using a blend of global frequency score + personalization boost + context signals (locale, device type). The blending is done in-memory with simple linear scoring.
Core Components
Trie / Prefix Tree
The core data structure is a compressed trie (radix tree) where each node represents a common prefix. Leaf nodes store the top-K (typically 10) candidate completions sorted by score. This enables O(P) lookup where P is the prefix length, returning pre-ranked results without additional sorting. The trie is stored as a serialized byte array for cache-friendly memory layout. Each internal node stores child pointers and a bitmask for quick child presence checks. Trie nodes for high-frequency prefixes are pinned to L2/L3 cache via memory alignment.
Query Aggregation Service
A Kafka-based pipeline ingests raw query events. A Flink streaming job groups events by query string, computes time-decayed frequency scores (exponential decay with half-life of 7 days), and produces ranked suggestion lists per prefix. An offline Spark job runs nightly to rebuild the full trie from the complete query corpus. The streaming job handles incremental updates. Trending detection uses a separate spike detector that compares short-window counts (15-minute) to long-window baselines and flags queries with >10x sudden growth for priority insertion.
Serving Cache Layer
Each serving node maintains the full trie in memory (2–5 GB). An additional Redis cache layer stores pre-computed results for the top 10,000 most frequent prefixes. Cache hit rates exceed 90% for prefixes of length 1–3 (highest traffic). Remaining requests fall through to the in-memory trie. Results are cached at the CDN edge for common prefixes with a TTL of 60 seconds. CDN caching alone absorbs ~40% of global autocomplete traffic.
Database Design
Query frequency data is stored in a time-series format: (query_string, timestamp_bucket, count). The primary store is Apache Cassandra, partitioned by (prefix_hash, time_bucket) for efficient range scans when building suggestion lists. The trie snapshot is stored as a binary blob in S3/GCS, versioned by build timestamp. Serving nodes download the latest snapshot on startup and poll for updates every 10 minutes. A ZooKeeper-managed distributed lock ensures only one node rebuilds and distributes the snapshot at a time.
API Design
Scaling & Bottlenecks
The main bottleneck is the trie update frequency. Pushing full trie snapshots (2 GB) to 1,000 serving nodes every 10 minutes creates 200 GB/min of internal bandwidth. To address this, delta updates (only changed trie nodes) are computed and distributed. Delta size is typically <1% of full trie size. Consistent hashing assigns prefix ranges to serving node groups, so only the relevant group receives a delta for changed prefixes.
For extreme traffic spikes (e.g., breaking news queries), the trending detection pipeline bypasses the standard aggregation cycle and directly injects hot queries into the serving layer via a fast-path Redis write. Serving nodes subscribe to a Redis Pub/Sub channel for trending query updates and apply them as overlays on top of the base trie results. This ensures trending queries appear in suggestions within 2–3 minutes of a traffic spike.
Key Trade-offs
- Snapshot freshness vs. update cost: More frequent trie rebuilds improve relevance but increase network and CPU overhead on serving nodes
- Personalization depth vs. latency: Deep personalization requires per-user model inference, adding 10–20ms; shallow frequency boosts add <1ms
- Global vs. regional tries: A single global trie simplifies ops but may surface irrelevant regional queries; per-region tries improve local relevance at 5–10x storage cost
- Trie depth vs. coverage: Storing suggestions only for prefixes of length ≤ 10 covers 99%+ of traffic but misses long-tail professional queries
GO DEEPER
Master this topic in our 12-week cohort
Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.