System Design: Semantic Search Engine

Requirements

Functional Requirements:

Return semantically relevant results even without keyword overlap between query and document
Support hybrid search combining keyword (BM25) and dense vector retrieval
Index documents up to 10,000 tokens (long-form articles, PDFs, code)
Support domain-specific embedding models (legal, medical, code)
Enable Retrieval Augmented Generation (RAG) as a downstream use case
Re-rank retrieved candidates using a cross-encoder for precision

Non-Functional Requirements:

P99 query latency under 200ms for semantic search
Index 100 million documents
Handle 10,000 QPS
Embedding pipeline processes 10,000 documents/min
99.9% availability

Scale Estimation

With 100 million documents and 768-dimensional float32 embeddings (BERT-base), raw embedding storage is 100M * 768 * 4 bytes = 307 GB. With int8 quantization, this drops to 76 GB — manageable on a medium-sized cluster. Long documents are chunked into 256-token passages; a 5,000-token document generates ~20 chunks. If average document length is 2,000 tokens, total passage count is ~800 million, requiring ~2.3 TB of embedding storage. Hierarchical indexing (document-level + passage-level) addresses this.

High-Level Architecture

The system combines a sparse retrieval layer (BM25 via Elasticsearch) with a dense retrieval layer (vector search via a specialized ANN index). A query encoder converts user queries to dense vectors in real time. An offline embedding pipeline encodes all documents and passages at index time. A hybrid fusion layer combines BM25 scores and cosine similarity scores using Reciprocal Rank Fusion (RRF) or learned weights. An optional cross-encoder re-ranker provides high-precision re-ranking of the top-50 candidates.

Document processing starts with chunking: long documents are split into fixed-size overlapping passages (256 tokens, 32 token overlap) to handle models with limited context windows. Each chunk is encoded by a bi-encoder (e.g., sentence-transformers/all-mpnet-base-v2 or a domain-fine-tuned model). Embeddings are stored in a vector database (Weaviate, Qdrant, or Milvus). The original text and metadata are stored in Elasticsearch for BM25 retrieval and metadata filtering.

At query time, the query is encoded by the same bi-encoder (online inference, ~5ms on GPU). The query vector is searched against the ANN index (HNSW graph), returning top-100 candidates. Simultaneously, a BM25 query runs in Elasticsearch. The two result lists are merged via RRF. The top-50 fused candidates are passed to a cross-encoder (BERT-based, computing full attention over query-document pairs) for precision re-ranking. The cross-encoder's scores are the final ranking signal.

Core Components

Bi-Encoder Embedding Service

The bi-encoder independently encodes queries and documents into a shared vector space where cosine similarity approximates semantic relevance. Model choices range from lightweight (MiniLM-L6: 384-dim, 22M params, 4ms/query) to high-quality (E5-large: 1024-dim, 335M params, 30ms/query). Models are served via TorchServe or Triton Inference Server with dynamic batching. Document embeddings are computed offline in batch (GPU cluster) and stored. Query embeddings are computed online on each request. ONNX export with FP16 quantization reduces inference time by 2–4x with minimal quality loss.

HNSW Vector Index

Hierarchical Navigable Small World (HNSW) graphs provide state-of-the-art ANN search performance. Each node in the graph is a document embedding; edges connect geometrically close neighbors. Search starts at a fixed entry point in the top layer, greedily descends to lower layers, and performs a beam search in the bottom layer. HNSW achieves 95%+ recall@10 at O(log N) query complexity. Index construction is O(N log N). Parameters M (max neighbors per node, default 16) and ef_construction (search depth during build, default 200) control the accuracy-speed trade-off. Qdrant and Weaviate implement HNSW with on-disk storage for indices larger than RAM.

Cross-Encoder Re-Ranker

The cross-encoder takes a (query, document) pair as a single input, processes them with full cross-attention (unlike the bi-encoder which encodes independently), and outputs a scalar relevance score. This captures fine-grained query-document interactions that bi-encoders miss. Cross-encoders are 10–50x slower per pair than bi-encoder similarity but are only applied to the top-50 candidates. A DistilBERT-based cross-encoder scores 50 pairs in ~15ms on GPU, fitting within the 200ms total budget. MS-MARCO fine-tuned cross-encoders are commonly used as off-the-shelf re-rankers.

Database Design

The system uses three storage layers: Elasticsearch for BM25 retrieval, metadata storage, and filtered search (fields: doc_id, passage_id, text, title, author, tags, date); Qdrant/Weaviate for vector storage and HNSW-based ANN search (stores embedding vectors + payload metadata for filtering); S3 for original documents and computed embedding files (backup and batch reprocessing). Passage-to-document mappings are stored in Redis for fast parent document lookup during result assembly. An embedding version registry tracks which model version was used for each document chunk, enabling selective recomputation when models are updated.

API Design

Scaling & Bottlenecks

Embedding computation is the primary bottleneck for indexing throughput. A single A100 GPU running E5-large processes ~2,000 passages/sec. For 800 million passages (100M docs x 8 avg chunks), initial indexing requires ~400,000 GPU-seconds, or ~5 days on a single A100. A cluster of 100 GPUs reduces this to ~1 hour. Incremental indexing (new documents only) keeps throughput manageable. The HNSW index build also becomes expensive at scale; Qdrant supports distributed HNSW with sharding by document range.

Hybrid fusion quality depends on score normalization. BM25 scores are unbounded; cosine similarities are in [-1, 1]. Direct combination requires normalization. Reciprocal Rank Fusion (RRF) sidesteps this by using only rank positions: RRF_score = 1/(k + rank_bm25) + 1/(k + rank_dense), with k=60 empirically optimal. RRF consistently outperforms simple score interpolation across diverse query types and requires no hyperparameter tuning per dataset.

Key Trade-offs

Bi-encoder vs. cross-encoder: Bi-encoders enable precomputed document embeddings (fast ANN search); cross-encoders require online (query, doc) inference but achieve higher precision
Dense vs. hybrid retrieval: Dense-only retrieval handles paraphrase queries well but misses exact keyword matches; hybrid combines both at the cost of system complexity
Chunk size vs. context: Smaller chunks (128 tokens) improve passage-level precision but lose document context; larger chunks (512 tokens) retain context but dilute relevance signals
Model size vs. latency: Larger models (E5-large) produce higher-quality embeddings but push online query encoding to 30ms, consuming 15% of the total latency budget