System Design: Image Search Engine

Requirements

Functional Requirements:

Search images by text query (keyword search)
Support reverse image search — upload an image, find visually similar images
Return results ranked by visual similarity and relevance
Index 10 billion images
Support filtering by image type, color, size, license
Support object-level search ("find images containing cats")

Non-Functional Requirements:

Query latency under 300ms for text queries, under 500ms for reverse image search
99.9% availability
Index new images within 1 hour of upload
Handle 50,000 text search QPS and 5,000 reverse image QPS
Store 10 billion image embeddings efficiently

Scale Estimation

At 10 billion images with 2,048-dimensional float32 embeddings, raw embedding storage is 10B * 2,048 * 4 bytes = 80 TB. Product quantization (PQ) compresses each embedding from 8 KB to ~64 bytes (128x compression), reducing storage to ~640 GB. With PQ-compressed embeddings fitting in RAM across a cluster of 100 machines (6.4 GB each), approximate nearest neighbor (ANN) search becomes feasible. Text metadata (tags, alt text, surrounding context) for 10 billion images at 500 bytes average = 5 TB in the text index.

High-Level Architecture

The system has two primary search paths: text-to-image (keyword query → ranked image results) and image-to-image (query image → visually similar images). Both paths converge on a shared multi-modal index. Text search uses a standard inverted index over image metadata (alt text, captions, surrounding page text, OCR-extracted text). Image similarity search uses a vector index of CNN embeddings supporting approximate nearest neighbor queries.

A feature extraction pipeline runs continuously, processing new and existing images through a CNN (EfficientNet-B7 or CLIP vision encoder) to produce 2,048-dimensional embedding vectors. CLIP is preferred for its shared embedding space with text, enabling zero-shot text-to-image retrieval. Extracted embeddings are stored in a vector database (Faiss, Milvus, or Pinecone). Text metadata is extracted via OCR (Tesseract) and crawl-time context extraction, then indexed in Elasticsearch.

At query time, text queries are dual-path: the query is encoded into a CLIP text embedding and searched against the vector index (semantic image retrieval), while simultaneously executing a keyword search in the text metadata index. Results from both paths are merged and re-ranked by a fusion model. For reverse image search, the query image is encoded via the same CLIP vision encoder, and its embedding is used directly for ANN search in the vector index.

Core Components

CNN Feature Extraction Pipeline

Images ingested from crawlers or user uploads are passed through a preprocessing pipeline (resize to 224×224, normalize pixel values) before being fed into the CNN. A GPU cluster runs inference in batches of 256 images. Each image produces a 2,048-dim L2-normalized embedding. CLIP's shared embedding space allows the text query "sunset over mountains" to be encoded as a text embedding that is directly comparable to image embeddings via cosine similarity. Embeddings are stored in Faiss with Product Quantization (PQ) for compression and IVF (Inverted File Index) for partitioned search.

Approximate Nearest Neighbor Index

Faiss IVF-PQ is the core ANN structure. The index partitions the embedding space into K=65,536 Voronoi cells using K-means clustering. Each cell maintains an inverted list of PQ-compressed embeddings. At query time, the query embedding's nearest C=256 cells are identified (coarse quantizer), and only those cells' embeddings are scanned for similarity. PQ decodes approximate distances using precomputed lookup tables. This achieves ~95% recall@10 while scanning only 0.4% of the full index. The index is sharded across machines by embedding cluster ranges.

Multi-Modal Fusion Ranker

Results from text search and vector search are fused using a late fusion model. Each result receives a text relevance score (BM25) and a visual similarity score (cosine similarity). A learned linear combination, tuned on user click data, merges the two scores. Object detection labels (COCO classes) from a secondary detection model (YOLO or Faster R-CNN) provide structured labels ("contains: person, car, tree") that enrich the text index and enable structured filtering. Color histograms enable color-based filtering as a lightweight structured field.

Database Design

Image metadata is stored in a combination of systems: Cassandra stores the image catalog (image_id, url, source_domain, crawl_time, width, height, license, ocr_text) partitioned by image_id. Elasticsearch indexes the textual signals (alt_text, caption, ocr_text, object_labels, color_tags). Faiss (in-memory, sharded) stores the compressed embedding vectors. S3/GCS stores the raw image thumbnails and full-resolution images. An image deduplication service uses perceptual hashing (pHash) to detect and merge near-duplicate images, reducing index size by 20–30%.

API Design

Scaling & Bottlenecks

CNN inference is the primary throughput bottleneck. A single A100 GPU processes ~5,000 images/sec for CLIP ViT-L/14 inference. For 10,000 new images/sec ingestion, two GPUs are required for feature extraction alone. The vector index build is an offline process: Faiss IVF training (K-means on 1% sample) runs periodically, and new embeddings are added incrementally to existing IVF cells. Full index rebuilds run weekly to rebalance cluster centroids. Online ANN queries execute in 10–30ms for 10B images with IVF-PQ.

For reverse image search at 5,000 QPS, each query requires one CNN inference (~0.2ms on GPU with batching) plus one ANN lookup (~20ms). GPU batching is critical: queries are held in a micro-batch queue (max wait: 5ms) and processed together. Without batching, GPU utilization drops below 5% for bursty traffic. The ANN index search is parallelized across shards; each shard returns its top-K, and a global merge selects the final top-K results.

Key Trade-offs

Exact NN vs. ANN: Exact nearest neighbor guarantees perfect recall but is O(N) per query; ANN with IVF-PQ achieves 90–95% recall at O(√N) cost
CLIP vs. task-specific CNN: CLIP provides unified text-image embeddings enabling zero-shot retrieval but produces lower recall than fine-tuned domain-specific models
Embedding dimensionality vs. ANN accuracy: Higher-dimensional embeddings capture more visual detail but require more storage and slower ANN search
Deduplication aggressiveness vs. diversity: Aggressive near-duplicate removal reduces storage but may remove visually similar but contextually distinct images