RAG's three failure modes · and the diagnostic table we use on every audit
Three failure modes, one table. 30 minutes of diagnosis, then you know what to fix. Stop guessing.
Three failure modes, one table. 30 minutes of diagnosis, then you know what to fix. Stop guessing.
We have audited 18 RAG systems in the last 18 months. The pattern is invariant. Every failing RAG is failing one of three ways. The team usually does not know which one. They assume 'the model is bad' and start swapping models. That is rarely the issue.
The three modes are · (1) retrieval miss · the right chunk was never returned · (2) retrieval-yes-answer-no · the right chunk was returned but the model ignored, misread or hallucinated around it · (3) stale truth · the chunk was returned, the answer used it, but the chunk itself is out of date. Each has a different fix, different cost.
The right chunk never made it into the top-k. The vector search ranked it 47th, you took top 5. Symptoms: the model produces a plausible-sounding wrong answer, or 'I don't know'. The chunk exists in the corpus.
Causes, in order of frequency we see them:
Fastest fix: turn on hybrid search (BM25 + dense) and re-chunk to 400-800 tokens with 50 token overlap. This alone resolves 60% of mode 1 cases.
The right chunk is in the top 5, but the model still produces the wrong answer. Symptoms: hallucinated citations, mixing facts from multiple chunks, contradicting the chunk it cites. This is the failure mode that looks like 'the LLM is hallucinating', because it is, but the cause is upstream of the model.
Fastest fix: top k 5, explicit chunk separators, prompt template that says 'answer only from these chunks, cite by ID, say I don't know if not sure'. Most teams already have this · but their `top_k = 20` and no chunk IDs in the prompt undo it.
The chunk is returned, the model answers from it, the answer is wrong because the chunk is six months old. The product changed. The price changed. The policy changed. The corpus did not.
Fastest fix: schedule a daily re-ingestion. Add a `last_modified` field to chunks, expose it as a retrieval bias for time sensitive queries. Periodically dedupe by source path.
When we sit down with a client, this is the literal table on the whiteboard. We pick 10 known-bad cases, walk down the table for each.
| symptom | retrieval rank of right chunk | answer cites chunk? | source up to date? | mode | fastest fix |
|---|---|---|---|---|---|
| plausible wrong answer | not in top 20 | no | yes | 1 | hybrid search + re-chunk |
| 'I don't know' | not in top 20 | no | yes | 1 | hybrid search + re-chunk |
| confident wrong | top 5 | yes | yes | 2 | grounding prompt + lower top_k |
| confident wrong | top 5 | hallucinated cite | yes | 2 | chunk IDs + 'cite by ID' rule |
| outdated price/policy | top 5 | yes | NO | 3 | re-ingest + date bias |
| 'two answers' | top 5 (2 versions) | yes | partial | 3 | dedupe by source path |
| right chunk, but partial | top 5 | yes | yes | 2 | larger answer budget, fewer chunks |
| user phrase mismatch | not in top 20 | no | yes | 1 | query rewriting / HyDE |Most failed RAG projects we are called in on have the same root cause: the team did not separate 'is this a retrieval problem or a generation problem'. They throw a bigger model at it (mode 2 ish fix) when the chunk is stale (mode 3) or never retrieved (mode 1). Six weeks of work, the bill grows, the answer is still wrong.
If you want to skip an audit cycle, run the table on your own RAG today. 10 cases, 30 minutes, then you know which mode you are fighting.

Founder, DField Solutions
I've shipped production products from fintech to creator-tooling · for startups and enterprises, from Budapest to San Francisco.
Let's talk about your project. 30 minutes, no strings.