RAG in the Real World: Retrieval That Doesn't Hallucinate
Retrieval-augmented generation is simple to start and brutal to get right. A practical look at chunking, ranking, and the failure modes nobody warns you about.

Retrieval-augmented generation has a reputation for being easy: embed your documents, search them, stuff the results into a prompt, done. That version works in a weekend and breaks in production. The hard part of RAG isn't the retrieval — it's making sure the model gets the right context, in the right shape, at the right time.
Garbage retrieval, garbage answers
An LLM can only reason over what you hand it. If retrieval surfaces the wrong passages, the model will faithfully build a wrong answer on top of them — and sound completely confident doing it. Most "the AI hallucinated" complaints are actually retrieval failures wearing a disguise. Fix retrieval and a surprising share of hallucinations disappear.
Chunking is a design decision
How you split documents determines what's retrievable. Chunks that are too big drown the signal in noise; chunks that are too small lose the context that makes a passage meaningful. Respect the structure of the source — split on sections and semantic boundaries, not arbitrary character counts — and keep enough overlap that an idea is never cut in half.
Search is not one step
Pure vector similarity gets you 70% of the way and then plateaus. The systems that hold up combine approaches: keyword search to catch exact terms and names, vector search for meaning, and a reranking pass to put the genuinely relevant results on top. Retrieval quality is a pipeline, not a single embedding call.
- Hybrid search — combine keyword and semantic retrieval, don't pick one
- Rerank the top candidates before they ever reach the model
- Attach metadata (source, date, section) and filter on it
- Return citations so every claim is traceable to a chunk
The failure modes nobody mentions
Real corpora are messy. Documents go stale, duplicates pile up, and the most-cited page is often outdated. Without freshness signals, your RAG system will happily quote a policy that changed last year. Build in recency weighting, deduplication, and a way to retire old content — otherwise retrieval quietly rots while the demo still looks fine.
“RAG doesn't fail loudly. It fails by confidently retrieving yesterday's truth — which is why evaluation matters more than the model you pick.”
Measure, then improve
You can't improve retrieval you don't measure. Build a set of representative questions with the passages that should be retrieved, and track whether they actually surface. Once retrieval is measurable, every change becomes an experiment with a clear answer instead of a vibe — and that's the foundation we insist on before any RAG system goes live.

Keep reading
How AI Copilots Actually Earn Their Keep in Production
Most AI copilots demo well and ship poorly. Here's the engineering that separates a flashy prototype from a copilot people trust every day.
ReadAI EngineeringEvals Before Vibes: Measuring AI You Can Trust
"It feels better" isn't a metric. How to build evaluation sets that turn AI development from guesswork into engineering.
ReadAI EngineeringShipping AI Features in Your Web App Without the Bloat
Adding AI to a product is easy to do badly. Streaming, error states, and cost control patterns for AI features that feel fast and stay cheap.
Read