Blog / AI Engineering
AI Engineering

Building Production RAG Pipelines That Don't Fall Apart

A practical guide to designing RAG systems that survive real-world traffic, covering chunking, retrieval quality, re-ranking, and the failure modes nobody warns you about.

Akhil Sharma

Akhil Sharma

January 12, 2026

12 min read

Building Production RAG Pipelines That Don't Fall Apart

Every RAG demo looks impressive. Ingest some PDFs, run a vector search, generate an answer — magic. Then you ship it, and users start reporting hallucinated answers, stale results, and responses that confidently cite documents that say the opposite of what the model claims.

The gap between a RAG demo and a production RAG system is where most teams get stuck. Here's what that gap looks like and how to close it.

Architecture Overview

A production RAG pipeline has more moving parts than the tutorial version:

Each component has failure modes. Let's walk through them.

Chunking: Where Most Pipelines Break

Chunking determines the atomic unit of retrieval. Get it wrong and everything downstream suffers.

Fixed-size chunking (split every N tokens) is the default in most tutorials. It's fast and simple, but it breaks sentences mid-thought, splits tables, and separates headers from their content.

Recursive character splitting is marginally better — it tries to split on paragraph boundaries, then sentences, then words. LangChain's RecursiveCharacterTextSplitter does this. It's a reasonable default.

Semantic chunking groups sentences by embedding similarity. When consecutive sentences diverge in meaning, that's a chunk boundary. This produces more coherent chunks but is slower to compute:

python

AI Engineering Cohort

We build this end-to-end in the cohort.

Live sessions, real systems, your questions answered in real time. Next cohort starts 2nd July 2026 — 20 seats.

Reserve your spot →

What actually works in production: Use document-structure-aware chunking. If you're processing Markdown, split on headers. If you're processing code, split on function/class boundaries. If you're processing PDFs, use layout detection to respect columns and tables. The chunking strategy should match the document format — there's no universal chunker.

Chunk sizes that work well: 256-512 tokens for precise factual retrieval, 512-1024 tokens for summarization and reasoning tasks. Always include 64-128 token overlap.

Embedding Selection

Your embedding model determines the ceiling of your retrieval quality. No amount of re-ranking fixes bad embeddings.

The practical choice in 2026:

ModelDimensionsMTEB ScoreLatency (p50)Cost
text-embedding-3-large307264.6~50ms$0.13/1M tokens
voyage-3102467.1~80ms$0.06/1M tokens
all-MiniLM-L6-v238456.3~5ms (local)Free
bge-large-en-v1.5102463.6~10ms (local)Free

For most production use cases, voyage-3 or text-embedding-3-large gives the best quality-cost trade-off when you're using an API. If you need to run locally for privacy or latency, bge-large-en-v1.5 or fine-tuned variants are solid.

Critical rule: always embed queries and documents with the same model. This sounds obvious, but I've seen production systems where the ingestion pipeline used one model version and the query path used another after an upgrade.

Retrieval Quality: The Metrics That Matter

You need to measure retrieval quality independently from generation quality. Otherwise you can't tell if a bad answer is caused by bad retrieval or bad generation.

python

Build a golden test set of 100+ query-document pairs. Automate this evaluation to run on every change to your chunking, embedding, or retrieval logic. If recall@10 drops below 0.7, your pipeline has a retrieval gap that generation cannot compensate for.

Re-ranking: The Highest-ROI Improvement

If you do one thing to improve your RAG pipeline, add a cross-encoder re-ranker. Cross-encoders process the query and document together (rather than independently), which dramatically improves relevance scoring.

The pattern: retrieve 20-50 candidates with cheap vector search, then re-rank the top results with a cross-encoder and keep the top 5.

python

In our benchmarks, adding a re-ranker improved answer accuracy by 15-25% with only 50-100ms additional latency. That's a better return than switching embedding models or changing chunk sizes.

The Failure Modes Nobody Warns You About

Stale embeddings. Your documents update but embeddings don't. Build an incremental ingestion pipeline that detects document changes (via content hash or modified timestamp) and re-embeds only changed chunks. Track embedding model versions — when you upgrade the model, you need to re-embed everything.

Chunk boundary hallucination. The model retrieves a chunk that contains a partial table or an incomplete code block. It then "completes" the missing information by hallucinating. Mitigation: include chunk metadata that signals truncation, and add an instruction in the system prompt to acknowledge when information appears incomplete.

Retrieval-generation mismatch. The retrieved context contains the answer, but the model ignores it and generates from parametric knowledge instead. This is especially common with strong models that are confident about popular topics. Force faithfulness by structuring your prompt:

Multi-hop questions. The user asks a question that requires synthesizing information across multiple documents. Standard top-k retrieval grabs documents relevant to the surface-level query but misses the connecting pieces. Solution: implement iterative retrieval where the first LLM call identifies what additional information is needed, which triggers a second retrieval pass.

Monitoring in Production

Ship your RAG pipeline with these observability hooks:

  1. Log every retrieval — query, retrieved doc IDs, similarity scores, re-ranker scores
  2. Track retrieval latency by stage — embedding, search, re-ranking
  3. Sample and review 1-5% of query-response pairs weekly for quality
  4. Monitor embedding drift — if your document corpus changes significantly, retrieval quality will degrade
  5. Alert on low-confidence retrievals — when the top result's similarity score falls below your threshold

The difference between a RAG prototype and a production system isn't the model or the framework — it's the chunking strategy, retrieval evaluation, re-ranking, and monitoring that you build around it.

RAG Vector Search LLM Embeddings

become an engineering leader

Advanced System Design Cohort