Building Production RAG Pipelines That Don't Fall Apart
A practical guide to designing RAG systems that survive real-world traffic, covering chunking, retrieval quality, re-ranking, and the failure modes nobody warns you about.
Akhil Sharma
January 12, 2026
Building Production RAG Pipelines That Don't Fall Apart
Every RAG demo looks impressive. Ingest some PDFs, run a vector search, generate an answer — magic. Then you ship it, and users start reporting hallucinated answers, stale results, and responses that confidently cite documents that say the opposite of what the model claims.
The gap between a RAG demo and a production RAG system is where most teams get stuck. Here's what that gap looks like and how to close it.
Architecture Overview
A production RAG pipeline has more moving parts than the tutorial version:
Each component has failure modes. Let's walk through them.
Chunking: Where Most Pipelines Break
Chunking determines the atomic unit of retrieval. Get it wrong and everything downstream suffers.
Fixed-size chunking (split every N tokens) is the default in most tutorials. It's fast and simple, but it breaks sentences mid-thought, splits tables, and separates headers from their content.
Recursive character splitting is marginally better — it tries to split on paragraph boundaries, then sentences, then words. LangChain's RecursiveCharacterTextSplitter does this. It's a reasonable default.
Semantic chunking groups sentences by embedding similarity. When consecutive sentences diverge in meaning, that's a chunk boundary. This produces more coherent chunks but is slower to compute:
AI Engineering Cohort
We build this end-to-end in the cohort.
Live sessions, real systems, your questions answered in real time. Next cohort starts 2nd July 2026 — 20 seats.
Reserve your spot →What actually works in production: Use document-structure-aware chunking. If you're processing Markdown, split on headers. If you're processing code, split on function/class boundaries. If you're processing PDFs, use layout detection to respect columns and tables. The chunking strategy should match the document format — there's no universal chunker.
Chunk sizes that work well: 256-512 tokens for precise factual retrieval, 512-1024 tokens for summarization and reasoning tasks. Always include 64-128 token overlap.
Embedding Selection
Your embedding model determines the ceiling of your retrieval quality. No amount of re-ranking fixes bad embeddings.
The practical choice in 2026:
| Model | Dimensions | MTEB Score | Latency (p50) | Cost |
|---|---|---|---|---|
text-embedding-3-large | 3072 | 64.6 | ~50ms | $0.13/1M tokens |
voyage-3 | 1024 | 67.1 | ~80ms | $0.06/1M tokens |
all-MiniLM-L6-v2 | 384 | 56.3 | ~5ms (local) | Free |
bge-large-en-v1.5 | 1024 | 63.6 | ~10ms (local) | Free |
For most production use cases, voyage-3 or text-embedding-3-large gives the best quality-cost trade-off when you're using an API. If you need to run locally for privacy or latency, bge-large-en-v1.5 or fine-tuned variants are solid.
Critical rule: always embed queries and documents with the same model. This sounds obvious, but I've seen production systems where the ingestion pipeline used one model version and the query path used another after an upgrade.
Retrieval Quality: The Metrics That Matter
You need to measure retrieval quality independently from generation quality. Otherwise you can't tell if a bad answer is caused by bad retrieval or bad generation.
Build a golden test set of 100+ query-document pairs. Automate this evaluation to run on every change to your chunking, embedding, or retrieval logic. If recall@10 drops below 0.7, your pipeline has a retrieval gap that generation cannot compensate for.
Re-ranking: The Highest-ROI Improvement
If you do one thing to improve your RAG pipeline, add a cross-encoder re-ranker. Cross-encoders process the query and document together (rather than independently), which dramatically improves relevance scoring.
The pattern: retrieve 20-50 candidates with cheap vector search, then re-rank the top results with a cross-encoder and keep the top 5.
In our benchmarks, adding a re-ranker improved answer accuracy by 15-25% with only 50-100ms additional latency. That's a better return than switching embedding models or changing chunk sizes.
The Failure Modes Nobody Warns You About
Stale embeddings. Your documents update but embeddings don't. Build an incremental ingestion pipeline that detects document changes (via content hash or modified timestamp) and re-embeds only changed chunks. Track embedding model versions — when you upgrade the model, you need to re-embed everything.
Chunk boundary hallucination. The model retrieves a chunk that contains a partial table or an incomplete code block. It then "completes" the missing information by hallucinating. Mitigation: include chunk metadata that signals truncation, and add an instruction in the system prompt to acknowledge when information appears incomplete.
Retrieval-generation mismatch. The retrieved context contains the answer, but the model ignores it and generates from parametric knowledge instead. This is especially common with strong models that are confident about popular topics. Force faithfulness by structuring your prompt:
Multi-hop questions. The user asks a question that requires synthesizing information across multiple documents. Standard top-k retrieval grabs documents relevant to the surface-level query but misses the connecting pieces. Solution: implement iterative retrieval where the first LLM call identifies what additional information is needed, which triggers a second retrieval pass.
Monitoring in Production
Ship your RAG pipeline with these observability hooks:
- Log every retrieval — query, retrieved doc IDs, similarity scores, re-ranker scores
- Track retrieval latency by stage — embedding, search, re-ranking
- Sample and review 1-5% of query-response pairs weekly for quality
- Monitor embedding drift — if your document corpus changes significantly, retrieval quality will degrade
- Alert on low-confidence retrievals — when the top result's similarity score falls below your threshold
The difference between a RAG prototype and a production system isn't the model or the framework — it's the chunking strategy, retrieval evaluation, re-ranking, and monitoring that you build around it.
More in AI Engineering
Building Reliable LLM Evaluation Pipelines
How to evaluate LLM outputs systematically with automated metrics, LLM-as-judge, human review, and CI/CD integration for prompt regression testing.
Prompt Caching Strategies That Cut Your LLM Costs in Half
Practical caching strategies for LLM applications — from exact match to semantic similarity caching to provider-level prefix caching — with real cost/latency numbers.