System Design: Real-Time Search Indexing

Requirements

Functional Requirements:

Make new or updated documents searchable within 1 second of write
Support inserts, updates, and deletes propagated to the search index
Handle high write throughput (100,000 document writes/sec)
Preserve document ordering and avoid partial updates visible to users
Support schema evolution without downtime
Enable backfill reindexing when the index schema changes

Non-Functional Requirements:

End-to-end indexing latency (write → searchable) under 1 second at p95
No data loss during pipeline failures
Horizontal scalability — add pipeline workers to increase throughput
Exactly-once semantics for index updates
99.99% indexing pipeline availability

Scale Estimation

At 100,000 document writes/sec and an average document size of 5 KB, the ingestion rate is 500 MB/sec. A 3-partition Kafka topic can handle 500 MB/sec with proper tuning. Each Elasticsearch indexing worker can handle ~10,000 docs/sec (bulk API, 10 MB batches). To handle 100,000 docs/sec, 10 indexing workers are needed, each consuming from a dedicated Kafka partition. With 3x Kafka replication, total Kafka storage for a 24-hour retention window is 500 MB/sec * 86,400 sec * 3 = ~130 TB.

High-Level Architecture

The pipeline has four stages: source capture, event streaming, index processing, and index writing. Source capture uses Change Data Capture (CDC) to extract write events from the primary database (e.g., Debezium reading MySQL binlog or PostgreSQL WAL). Events are published to Apache Kafka. Index processors consume from Kafka, transform events into indexable documents (field mapping, enrichment, normalization), and batch-write to Elasticsearch using the bulk API. A consumer group offset per Kafka partition tracks processing progress, enabling safe restarts and exactly-once processing.

Change Data Capture with Debezium monitors the database write-ahead log. Every INSERT, UPDATE, or DELETE generates a structured event with the operation type, old values, and new values. For updates, only changed fields are included (delta updates). For deletes, a tombstone record is published to trigger document deletion from the search index. Debezium checkpoints its log position in Kafka (using a separate offsets topic), ensuring no events are missed after a restart. Multiple CDC connectors run in parallel for different database tables, each writing to a dedicated Kafka topic.

Index processors use a Kafka consumer group with N workers. Each worker maintains a micro-batch buffer (up to 1,000 documents or 100ms max wait), applies field transformations (date parsing, text analysis, embedding computation for semantic fields), and submits to Elasticsearch's bulk API. Elasticsearch bulk API writes go to the primary shard's translog (synchronous, on-disk) immediately, making them durable. The NRT refresh (default 1 second) exposes translog writes to search. Workers commit Kafka offsets only after Elasticsearch confirms the bulk write succeeded, ensuring at-least-once delivery. Idempotent writes (using doc_id as Elasticsearch id) make this effectively exactly-once.

Core Components

Change Data Capture (Debezium)

Debezium is an open-source CDC platform that captures row-level changes from database transaction logs. For MySQL, it reads the binlog. For PostgreSQL, it uses logical decoding (pg_output plugin). For MongoDB, it tails the oplog. Events are serialized as Avro with schemas registered in Confluent Schema Registry. Schema Registry enforces backward/forward compatibility, enabling schema evolution without breaking consumers. Debezium handles table schema changes (column additions/removals) by automatically evolving the Avro schema, which consumers must handle gracefully (defaulting missing fields).

Kafka Event Bus

Kafka provides durable, ordered, partitioned log storage between CDC and the indexing workers. Topics are partitioned by document ID (consistent hashing), ensuring all events for the same document are processed by the same worker (preserving order for that document). Kafka's retention (24 hours) provides a replay buffer: if the indexing worker falls behind, it can reprocess events at higher speed without data loss. Kafka Streams or Apache Flink can be added as an intermediate processing layer for complex transformations (joining multiple change streams, computing derived fields, deduplication within a time window).

Elasticsearch Indexing Worker

Each worker manages a Kafka consumer, a transformation pipeline, and a bulk indexer. The transformation pipeline applies: field extraction (map DB column names to ES field names), type coercion (parse ISO date strings to epoch millis), text normalization (trim whitespace, strip HTML), and optional embedding computation (for semantic fields). The bulk indexer accumulates operations and flushes when the batch reaches 1,000 documents or 100ms elapses. It handles partial failures: if some documents in a bulk request fail (e.g., mapping conflicts), they are written to a dead-letter Kafka topic for manual review. Successful documents are indexed; the Kafka offset advances only for fully processed batches.

Database Design

The primary source of truth remains the relational database (MySQL/PostgreSQL). The search index is a derived, eventually consistent view. A pipeline metadata store (PostgreSQL or DynamoDB) tracks: kafka_topic, partition, last_committed_offset, last_indexed_timestamp, error_count, lag (in messages and milliseconds). This metadata powers a monitoring dashboard that alerts when indexing lag exceeds 5 seconds. An index version registry tracks schema versions: each Elasticsearch index is named with a version suffix (e.g., products_v3), and an alias (products) points to the active version. Reindexing writes to a new versioned index, then atomically flips the alias.

API Design

Scaling & Bottlenecks

The primary bottleneck is Elasticsearch write throughput. A 3-node cluster handles ~30,000 docs/sec with default settings. Tuning for write throughput: (1) disable auto-refresh during bulk loading (refresh_interval=-1, re-enable after); (2) increase indexing buffer to 20% of JVM heap; (3) use async translog durability (durability=async) to reduce fsync latency (risk: last-second data in translog may be lost on crash, but Kafka replay ensures recovery); (4) use fewer, larger shards (5 shards per node maximum for write-heavy workloads). With these optimizations, throughput reaches 100,000+ docs/sec on a properly sized cluster.

Kafka consumer lag is the primary operational risk. If the indexing workers cannot keep up with CDC events, lag grows, and the search index falls behind. Horizontal scaling (add more consumer partitions + workers) handles gradual growth. For sudden spikes (e.g., bulk database migrations), the pipeline should support a "catch-up mode" where workers skip the NRT refresh and batch more aggressively (10,000 docs per bulk request, 1-second flush interval) to maximize throughput at the cost of temporary staleness.

Key Trade-offs

At-least-once vs. exactly-once: At-least-once with idempotent writes is operationally simpler; true exactly-once (Kafka transactions + Elasticsearch conditional writes) adds significant complexity
NRT refresh interval vs. freshness: 1-second refresh exposes new documents quickly but creates many small segments; 30-second refresh is more efficient but increases visible staleness
Synchronous vs. asynchronous translog: Async translog improves write throughput by 2–3x but risks losing up to 5 seconds of uncommitted writes on a hard crash (mitigated by Kafka replay)
Delta vs. full-document updates: Delta updates (update only changed fields) reduce indexing CPU but require the index worker to fetch the current document state for missing fields; full-document reindexing is simpler but more expensive per event