System Design: Full-Text Search Engine

Requirements

Functional Requirements:

Index text documents and support full-text keyword search
Support phrase queries, boolean operators (AND, OR, NOT), and wildcard queries
Return results ranked by relevance (TF-IDF / BM25)
Support field-level search (title, body, author, tags)
Near-real-time indexing — new documents searchable within 1 second
Support highlighting of matched terms in snippets

Non-Functional Requirements:

Sub-100ms query latency at p99
Index 1 billion documents totaling 10 TB of text
Handle 10,000 search QPS
99.9% availability with no data loss
Horizontal scalability — add nodes to increase capacity

Scale Estimation

With 1 billion documents averaging 10 KB of raw text, raw storage is 10 TB. The inverted index (compressed postings + term dictionary) typically runs 20–30% of raw text size: ~2–3 TB. With 10,000 QPS and average result set assembly time of 20ms per shard, each shard processes ~200 queries/sec. Sharding across 50 nodes keeps per-node QPS manageable. Index replication (factor of 2) doubles storage to ~6 TB total but provides fault tolerance and read throughput.

High-Level Architecture

The system mirrors Elasticsearch's architecture. An indexing pipeline receives documents via a REST API, applies analysis (tokenization, stemming, stop word removal), and writes to Lucene segments on each primary shard. A query coordinator receives search requests, fans them out to all relevant shards, collects partial results, and merges them into a ranked global result set. A cluster state manager (ZooKeeper or Raft-based) tracks shard assignments, node health, and index metadata.

Documents are indexed by first applying a configurable analysis chain: character filters (HTML stripping), tokenizers (standard whitespace/Unicode), and token filters (lowercase, stop words, Porter stemmer, synonyms). Each token is added to an in-memory buffer (the Lucene IndexWriter). When the buffer reaches a threshold (e.g., 512 MB), it is flushed to a new immutable Lucene segment on disk. Segments are periodically merged (Tiered Merge Policy) to reduce the number of open file handles and improve query performance.

Query execution uses the Coordinating Node pattern. A client sends a query to any node, which acts as coordinator. The coordinator broadcasts the query to all primary or replica shards. Each shard executes the query locally, computing BM25 scores, and returns the top-K (e.g., top-100) hits with scores. The coordinator merges these partial results, applies global ranking, fetches source fields for the final top-N, and returns the response. This two-phase approach (scatter-gather) keeps per-shard work bounded.

Core Components

Lucene Segment Engine

Each shard is a Lucene index composed of immutable segments. Each segment contains a term dictionary (sorted FST), postings lists (docID + term frequency + positions, delta-encoded), stored fields (original document values), doc values (columnar storage for sorting/aggregations), and norms (per-field length normalization). Queries execute a merge of postings lists using AND/OR boolean logic. Phrase queries require position data to verify token adjacency. Wildcard and fuzzy queries use the FST to enumerate matching terms efficiently.

BM25 Scoring

BM25 replaces classic TF-IDF as the default relevance model. The BM25 score for a document D given query Q is: sum over query terms t of IDF(t) * (TF(t,D) * (k1+1)) / (TF(t,D) + k1 * (1 - b + b * |D|/avgdl)). Default parameters k1=1.2, b=0.75 are tunable per-field. IDF is computed per-segment and corrected during merge. Field-level boosts (e.g., title matches worth 3x body) are applied as multiplicative factors. Function score queries allow incorporating non-text signals like recency, popularity, or geographic distance.

Near-Real-Time Indexing

Lucene supports NRT search by refreshing the IndexReader to expose newly written (but not yet committed to disk) segments. A refresh operation (default: 1 second in Elasticsearch) opens a new reader over the current segment state. The translog (WAL) ensures documents are not lost if a node crashes before a full fsync commit. On recovery, the translog replays unconfirmed operations against the last committed Lucene checkpoint. Primary-to-replica replication happens at the operation level: each indexing operation is forwarded to replicas before the primary acknowledges the client.

Database Design

The index metadata (shard assignments, field mappings, index settings) is stored in the cluster state, managed by elected master nodes and replicated via Raft consensus. Each shard is stored as a directory of Lucene files: .tim (term dictionary), .doc (postings), .pos (positions), .pay (payloads), .nvd/.nvm (norms), .dvd/.dvm (doc values), .fdx/.fdt (stored fields). A dedicated .tlog translog file records all un-committed writes. Index aliases (virtual index names) allow zero-downtime reindexing by atomically swapping which physical index an alias points to.

API Design

Scaling & Bottlenecks

The primary scaling levers are shard count and replica count. More shards distribute indexing and query load but increase coordination overhead (scatter-gather to more nodes). A common pattern is to over-shard at index creation time (e.g., 10 primary shards) so the cluster can scale to 10 nodes without reindexing. Hot spotting occurs when all documents route to one shard (e.g., timestamp-based routing); routing by content hash distributes more evenly. Index lifecycle management (ILM) automatically rolls over hot indices to warm/cold tiers, moving older segments to cheaper storage.

Segment merge is a significant I/O bottleneck during heavy indexing. Merge throttling (max_bytes_per_sec) prevents merges from saturating disk I/O. For bulk indexing (initial population), setting refresh_interval=-1 and disabling replicas during ingestion, then re-enabling afterward, can achieve 5–10x faster ingestion. JVM heap pressure from large in-memory segment caches is a common operational concern; field data caches for aggregations must be carefully bounded to avoid GC pauses.

Key Trade-offs

NRT freshness vs. query performance: More frequent refreshes expose new documents sooner but create more small segments that slow queries until merged
Shard count vs. overhead: More shards enable more parallelism but increase scatter-gather fan-out and cluster state complexity
Stored fields vs. storage cost: Storing the original document enables rich highlighting but doubles storage; use source filtering to reduce overhead
Replica count vs. write throughput: More replicas improve read throughput and fault tolerance but each additional replica multiplies write amplification