System Design: Message Search (at scale)

Requirements

Functional Requirements:

Full-text search across all messages in a user's conversations
Support for search operators: from:user, in:channel, before:date, after:date, has:link, has:file
Relevance-ranked results with recency boost
Search results respect access control (user only sees messages from conversations they belong to)
Near real-time indexing: new messages searchable within 10 seconds
Typeahead/autocomplete for recent messages and contacts

Non-Functional Requirements:

Index 50 billion messages across 500 million users
Search query latency under 200ms (p99)
Indexing throughput: 500K new messages per second
Search index must not exceed 2x the raw message data size
Support for 20 languages with language-specific tokenization and stemming

Scale Estimation

With 50 billion messages and an average message size of 200 bytes, the raw message corpus is ~10TB. The inverted index (with positional data) is approximately 1.5-2x the raw text size, so the search index is ~15-20TB. At 500K new messages per second, the indexer must sustain 100MB/sec of continuous index writes. Search query volume: 500M users with an average of 2 searches per day = 1 billion searches/day = 11,574 queries per second. Each query scans a user-scoped partition of the index — with proper partitioning, this means each query touches a small subset (one user's data) of the total index.

High-Level Architecture

The search system has three pipelines: indexing, serving, and autocomplete. The Indexing Pipeline consumes messages from a Kafka topic (messages-all) emitted by the messaging service. A fleet of Index Workers tokenizes, normalizes, and writes documents to Elasticsearch. The index is partitioned by user_id (or workspace_id for enterprise contexts): each user's messages form an isolated index shard. This partitioning ensures that search queries are scoped to one partition, providing both access control and query performance.

The Serving Pipeline handles search queries. A Search API receives the query, parses search operators (e.g., from:alice in:#engineering after:2025-01-01), translates them into an Elasticsearch DSL query scoped to the user's partition, executes the query, and returns results with highlighted snippets. The query is enriched with relevance scoring: BM25 for text relevance, combined with a recency decay function (newer messages score higher) and an engagement boost (messages the user has starred or reacted to score higher).

The Autocomplete Pipeline powers the search-as-you-type experience. It uses a separate data structure: a prefix trie backed by Redis, populated with recent message snippets, conversation names, and contact names. As the user types, the client queries the autocomplete endpoint every 200ms (debounced), which returns matching suggestions from the trie. The autocomplete index is per-user and limited to the last 10K messages and all conversation/contact names.

Core Components

Index Worker

The Index Worker processes raw messages from Kafka and writes to Elasticsearch. For each message, it: (1) detects the language using a compact n-gram classifier, (2) tokenizes using the appropriate language analyzer (e.g., ICU tokenizer for CJK languages, standard tokenizer for English with stemming via Snowball), (3) extracts structured fields (sender, timestamp, conversation_id, message_type, URLs, file metadata), and (4) writes the document to the user's Elasticsearch index shard. Workers are partitioned by user_id hash to ensure all messages for a user go to the same worker, enabling local batching (50 documents per Elasticsearch bulk request).

Query Parser

The Query Parser translates human-readable search queries into Elasticsearch DSL. It recognizes operators: from: maps to a term query on the sender field, in: maps to a term query on conversation_name or channel_id, before:/after: map to range queries on timestamp, has:link maps to an exists query on the urls field. Free text is parsed into a match query on the content field with BM25 scoring. Multiple operators are combined with boolean must clauses. The parser also handles quoted phrases (exact match) and negation (-from:bot excludes messages from bots).

Access Control Layer

Every search query is scoped to prevent unauthorized access. The Access Control Layer adds a mandatory filter to every Elasticsearch query: for a per-user partitioned index, the query only hits that user's shard (isolation by partition). For per-workspace partitioned indexes (enterprise), a filter clause {terms: {conversation_id: [user's conversation IDs]}} is appended to every query. The user's conversation list is fetched from a Redis cache (conversations:{user_id} → Set<conversation_id>) and refreshed on conversation join/leave events. This ensures a user can never see messages from conversations they don't belong to.

Database Design

Elasticsearch index mapping for the message document: {message_id (keyword), conversation_id (keyword), conversation_name (text + keyword), sender_id (keyword), sender_name (text + keyword), content (text, analyzed with language-specific analyzer), timestamp (date), message_type (keyword), urls (keyword array), file_names (text), has_attachment (boolean), reactions_count (integer), is_starred (boolean)}. The content field uses a custom analyzer chain: character filter (HTML strip), tokenizer (standard or ICU), token filters (lowercase, stemmer, stop words removal).

Index partitioning strategy: for consumer messaging (billions of users with small message volumes each), each user gets a virtual index within a shared physical index using routing key = user_id. Elasticsearch routes all documents and queries for a user to the same shard. For enterprise messaging (fewer but larger workspaces), each workspace gets a dedicated index with multiple shards based on message volume. Index lifecycle management (ILM) moves old message data from hot nodes (SSD) to warm nodes (HDD) after 90 days, and to cold (compressed, read-only) after 1 year.

API Design

GET /api/v1/search?q={query}&limit=20&cursor={token} — Full-text search: query supports operators; returns {results: [{message_id, conversation_id, sender, content_snippet, timestamp, highlight}], next_cursor}
GET /api/v1/search/autocomplete?prefix={text}&limit=5 — Typeahead: returns matching message snippets, conversation names, and contact names
POST /api/v1/search/reindex/{conversation_id} — Trigger reindexing of a conversation (admin endpoint for repair scenarios)
GET /api/v1/search/stats — Index statistics: total documents, index size, indexing lag

Scaling & Bottlenecks

The primary bottleneck is indexing throughput. At 500K messages/sec, the Elasticsearch cluster must sustain continuous bulk writes. The solution uses a tiered indexing approach: messages are first written to a Redis-backed near-real-time (NRT) index that supports search within 1 second but holds only the last hour of messages. A background batch indexer writes to the main Elasticsearch index every 10 seconds. This decouples search freshness from bulk indexing throughput. The NRT index is queried in parallel with the main index, and results are merged, deduplicated, and re-ranked.

Elasticsearch cluster sizing: with 20TB of index data and a 3x replication factor, total storage is 60TB. Using 2TB SSD drives, the hot tier needs 30 nodes. Each node handles approximately 400 queries/sec (11,574 QPS / 30 nodes). CPU is the typical bottleneck for search, not I/O, since hot data fits in the OS page cache. The cluster uses shard allocation awareness to ensure primary and replica shards are on different racks. Index rollover (create a new index when the current one reaches 50GB shard size) prevents individual shards from growing too large.

Key Trade-offs

Per-user partitioning over global index: Per-user partitioning provides built-in access control and isolates query scope to a small data subset, but prevents cross-user search (needed for admin/compliance tools) and creates many small shards for low-volume users — addressed via routing key to share physical shards
NRT Redis index + batch Elasticsearch over pure Elasticsearch: The NRT tier provides 1-second search freshness without stressing Elasticsearch with per-document writes, but adds architectural complexity and requires result merging
BM25 + recency decay over pure recency sorting: Relevance ranking surfaces the most contextually appropriate results (not just newest), but users sometimes expect chat search to be purely chronological — offering a 'sort by date' toggle addresses this
Language detection per message over per-user language setting: Per-message detection handles multilingual users and code-switching, but adds CPU overhead (~0.5ms per message) and can misclassify very short messages — fallback to user locale for messages under 20 characters