Retrieval-Augmented Generation has become the default architecture for enterprise AI applications. Ask any company building with LLMs and they're probably building a RAG system.
But here's the uncomfortable truth: most RAG systems that work in demos fail in production.
The demo retrieves 3 relevant documents from a curated test set. Production retrieves 3 irrelevant documents from 10 million noisy ones. The model hallucinates. Users lose trust. The project fails.
I've audited dozens of production RAG systems. The failure patterns are remarkably consistent—and remarkably fixable.
The Fundamental Trade-Off
Every RAG system lives on a spectrum between precision and recall:
**High Precision**: Retrieved documents are highly relevant, but you might miss some good ones.
**High Recall**: You capture most relevant documents, but include some irrelevant ones.
The LLM can filter irrelevant context to some degree—but at the cost of latency and accuracy. The right balance depends on your use case:
Chunking Strategies
How you split documents into chunks has massive impact on retrieval quality. The core tension:
Recursive Chunking
The most robust general-purpose approach. Start with high-level separators (paragraphs, sections), then recursively split if chunks remain too large. Research shows recursive token-based chunking with 100-token base size consistently outperforms alternatives.
Semantic Chunking
Split based on meaning, not structure. Analyze sentence similarity and create chunks where topics shift. Preserves meaning but requires additional embedding computation.
Structure-Aware Methods
For structured documents (Markdown, HTML, PDF with clear headers), use structure-aware splitters. This is often the single biggest improvement you can make—headers provide natural semantic boundaries.
When Not to Chunk
Small, focused documents that directly answer user questions may not need chunking at all. Chunking these documents can actually hurt retrieval.
Embedding Selection
Your embedding model maps text to vectors. The quality of this mapping determines retrieval quality.
General-Purpose Options
Domain-Specific Fine-Tuning
For specialized domains—legal, medical, technical—fine-tuning embeddings on domain data can dramatically improve retrieval. Even 10,000 domain-specific examples can meaningfully improve performance.
Multilingual Considerations
If your documents span languages, you need multilingual embeddings. Options like Cohere's multilingual embeddings or BGE-M3 handle this well.
Retrieval Strategies
Vector Search Alone Isn't Enough
Semantic search is powerful but has blind spots. It can miss exact matches for names, codes, and rare terms. Hybrid search—combining vector similarity with BM25 keyword matching—captures both semantic relevance and exact matches.
Reranking
Initial retrieval is fast but imprecise. Reranking models (Cohere Rerank, ColBERT) take the top-k results and reorder by relevance. This is computationally expensive but significantly improves precision.
Metadata Filtering
Use metadata to narrow retrieval before semantic search. If you know the user is asking about 2024 contracts, filter to 2024 contracts first. This improves precision and reduces computation.
Production Architecture
Caching
Cache frequent queries. If 100 users ask about vacation policy, retrieve once. Cache invalidation strategy matters—balance freshness against cost.
Async Processing
For non-realtime applications, process retrieval asynchronously. Queue queries, batch processing, return results via callback.
Monitoring
Track everything:
Without monitoring, you can't optimize.
Graceful Degradation
What happens when retrieval fails? When the LLM API times out? Design fallback behaviors—cached responses, human escalation, transparent error messages.
Common Failure Modes
Over-Retrieval
Retrieving too many chunks stuffs the context window with marginally relevant information, diluting the good stuff. Start with fewer chunks (3-5) and increase only if needed.
Poor Query Preprocessing
User queries are often ambiguous, misspelled, or conversational. Preprocess queries—expand abbreviations, correct spelling, rewrite as statements—before retrieval.
Ignoring Document Quality
RAG retrieves what you put in. If your document corpus is full of outdated, contradictory, or poorly-written content, your RAG system will confidently cite it. Document curation is often more important than retrieval optimization.
One-Size-Fits-All
Different query types benefit from different strategies. A factual lookup needs precision. An exploratory question needs breadth. Consider routing queries to different retrieval configurations.
The Path to Production
Step 1: Build an Evaluation Dataset
Before optimizing, know what good looks like. Build a dataset of 100+ query-answer pairs with human-verified correct answers. Run every change against this dataset.
Step 2: Establish Baseline Metrics
Measure current performance: precision, recall, latency, cost. You can't improve what you don't measure.
Step 3: Iterate Systematically
Change one thing at a time. Measure impact. Keep what works, discard what doesn't. Resist the temptation to change everything at once.
Step 4: Monitor in Production
Production data differs from evaluation data. Monitor retrieval quality continuously. Build feedback loops to identify failures.
Step 5: Continuous Improvement
RAG systems degrade over time as document corpora evolve. Schedule regular reindexing and re-evaluation.
The Bottom Line
RAG is not a solved problem. Building RAG systems that work reliably at production scale requires careful engineering across chunking, embedding, retrieval, and monitoring.
The good news: the techniques are well-understood. The hard work is applying them systematically rather than hoping the demo scales.