How to evaluate a RAG system
Stand up an evaluation harness for a RAG system that scores retrieval and generation separately — so you know which half is failing, catch regressions on every change, and stop shipping on vibes.
Before you start
- A working RAG pipeline, even a rough one — evaluation built alongside beats evaluation bolted on
- Access to real user questions, or a stakeholder who can supply plausible ones
- Somewhere to store evaluation runs over time; the trend is the signal
Steps
- 1
Build the golden set from real questions
Collect thirty to a hundred questions your users actually ask, each paired with the passages that should be retrieved and a reference answer a domain expert signs off. Resist inventing tidy questions — the value of the set is its mess: ambiguous phrasings, questions whose answer spans documents, questions your corpus cannot answer at all. That last category is mandatory; a RAG system's behaviour when the answer does not exist is half its trustworthiness.
- 2
Score retrieval on its own
Before judging any generated answer, measure whether the right chunks surfaced: for each golden question, check whether the known-relevant passages appear in the top results and how high they rank. Retrieval failures explain most RAG failures, and they are invisible if you only read final answers — a fluent response over wrong passages reads exactly like a right one until someone checks.
- 3
Score generation against the retrieved context
With retrieval held fixed, judge the answer on three separable questions: is it faithful to the retrieved passages, does it actually answer what was asked, and does it decline cleanly when the context does not contain the answer. Faithfulness is the one to watch — it is where hallucination hides in RAG, dressed as synthesis. Human review on a sample beats automated scoring alone; LLM-as-judge scales the middle ground if you calibrate it against your human sample first.
- 4
Wire the harness into every change
Run the full set on every meaningful change — chunking, embedding model, prompt, retriever settings, and above all the underlying model version, which changes beneath stable names. Store scores per run and alert on regression, not just failure. The harness is what converts tuning from folklore into engineering: a chunking change that lifts retrieval recall but drops faithfulness is now a visible trade, not a mystery.
- 5
Close the loop from production
Sample live queries on a cadence and route the failures back into the golden set — real failures are the highest-value test cases you will ever get, and they keep the set honest as usage drifts away from what you predicted. If your system serves multiple audiences or corpora, stratify the sample so a regression in one segment cannot hide in another's average.
Common pitfalls
- Evaluating only end-to-end answers, so retrieval and generation failures blur into one unfixable score. Separating them is the whole trick.
- A golden set of easy questions — it will flatter every change and catch nothing. The set earns its keep on the ambiguous and the unanswerable.
- Skipping the no-answer cases. Systems tuned only on answerable questions learn to improvise when the corpus is silent, which is the worst possible behaviour in front of a user.
- Treating model-version bumps as free. The provider's upgrade is your regression risk; the harness exists for exactly that morning.
- Letting LLM-as-judge scores float unanchored — calibrate the judge against human ratings on a sample, or the metric drifts with the judge's own model updates.