System Design: Google Search

Requirements

Functional Requirements:

Crawl and index billions of web pages continuously
Support keyword, phrase, and boolean queries
Return ranked, relevant results in under 200ms
Support freshness — newly published pages appear within hours
Handle personalization, SafeSearch, and localization
Provide spell correction, query suggestions, and featured snippets

Non-Functional Requirements:

99.999% availability — search must never go down
Sub-200ms p99 latency for query serving
Petabyte-scale storage for the index
Horizontally scalable to handle 8.5 billion queries per day
Fault-tolerant crawling with deduplication

Scale Estimation

Google processes approximately 8.5 billion searches per day (~100,000 QPS at peak). The web contains an estimated 5–50 billion indexable pages. Assuming an average page size of 100 KB, raw crawl storage is 500 TB to 5 PB. The forward index and inverted index together require tens of petabytes. Crawling at scale requires 100,000+ concurrent fetches/sec distributed across global data centers.

High-Level Architecture

The system splits into three major pipelines: crawling, indexing, and query serving. The crawl pipeline fetches web pages via distributed crawlers, deduplicates content, and stores raw HTML in a distributed store like GFS/Colossus. The indexing pipeline parses HTML, extracts tokens, computes link graphs, runs PageRank, and builds the inverted index. The query serving layer handles incoming searches, executes lookups against the sharded index, runs ranking models, and assembles the SERP.

Crawlers follow a frontier of URLs managed by a URL scheduler. They respect robots.txt, detect redirects, and store raw pages in a distributed blob store. A document processor pipeline runs offline (MapReduce/Dataflow jobs) to parse, tokenize, and produce postings lists. The inverted index maps each token to a postings list: (docID, term frequency, positions). These lists are sharded by term across thousands of index servers.

At query time, a query processor parses the search string, expands synonyms, identifies entities, and fans out to multiple index shards in parallel. Each shard scores its candidate documents using a combination of TF-IDF, BM25, and learned ranking signals. A top-K merge step assembles global results. A separate ranker applies deep neural ranking (e.g., BERT-based models) to the top few hundred candidates before the final SERP is assembled.

Core Components

Web Crawler

The crawler is a distributed system of thousands of fetcher nodes coordinated by a URL frontier. The frontier prioritizes URLs by PageRank, recrawl schedule, and freshness signals. DNS resolution is pre-cached to reduce latency. Politeness constraints cap request rates per domain. Bloom filters detect already-crawled URLs to avoid duplicates. A content fingerprint (SimHash) detects near-duplicate pages. Raw HTML and metadata are written to a distributed object store keyed by URL hash.

Inverted Index

The inverted index is the core data structure: a mapping from term → sorted postings list of (docID, TF, positions). It is built offline in MapReduce jobs. Postings lists are compressed using variable-byte or delta encoding. The index is sharded across thousands of machines by term range. Each shard holds a portion of the term space. Index servers keep hot portions of the index in memory (DRAM or SSD) for fast random access. DocID ordering within postings is by URL hash to allow fast intersection with AND queries.

Ranking Engine

Ranking is multi-stage. Stage 1 uses lightweight BM25 scoring over the inverted index to retrieve top-10,000 candidates per shard. Stage 2 applies feature-rich gradient-boosted tree models (LambdaMART) over hundreds of signals: anchor text, backlink count, PageRank, content freshness, click-through rate, and user engagement metrics. Stage 3 uses a BERT-based neural re-ranker on the top 200 candidates for semantic relevance. Final results are re-ranked for diversity and filtered for spam.

Database Design

The document store (raw HTML, metadata) uses a distributed key-value store keyed by URL fingerprint — conceptually similar to Bigtable with row key = URL hash and columns for raw content, HTTP headers, crawl timestamp, and PageRank score. The inverted index is stored as SSTable files on GFS/Colossus, loaded into memory-mapped segments on index servers. Link graph data (for PageRank) is stored in a graph-oriented format, processed in bulk by Pregel-style graph computation jobs. Each document gets a numeric docID for compact postings list storage.

API Design

Scaling & Bottlenecks

The primary scaling challenge is index sharding. As the web grows, both term shards (horizontal) and document shards (vertical) must scale independently. Google uses a two-tier index: a real-time index for fresh documents (written within hours via incremental indexing) and a base index built from full crawl batch jobs. Merging these at query time adds complexity but ensures freshness. Replication of each shard (3x) provides read throughput and fault tolerance. Cache layers (Memcached clusters) cache popular query results for sub-millisecond responses.

Query latency at scale is managed through strict budgets: each index shard must respond within 50ms or is skipped. A hedged request strategy sends duplicate requests to backup replicas if the primary is slow. The query fan-out to thousands of shards is executed in parallel. Global load balancers route queries to the nearest data center. Serving infrastructure is deployed across 20+ regions worldwide, with regional indexes maintained in sync via continuous replication of index deltas.

Key Trade-offs

Freshness vs. completeness: Real-time indexing favors fresh pages but requires dual-index complexity; batch indexing is simpler but stale
Recall vs. latency: Scanning more shards improves recall but increases tail latency; strict per-shard timeouts sacrifice recall for speed
Precision vs. spam tolerance: Aggressive spam filtering reduces junk results but risks false positives on legitimate new sites
Personalization vs. privacy: User history improves relevance but raises significant data privacy concerns