Build

How to evaluate a RAG system

Goal

Stand up an evaluation harness for a RAG system that scores retrieval and generation separately — so you know which half is failing, catch regressions on every change, and stop shipping on vibes.

Before you start

  • A working RAG pipeline, even a rough one — evaluation built alongside beats evaluation bolted on
  • Access to real user questions, or a stakeholder who can supply plausible ones
  • Somewhere to store evaluation runs over time; the trend is the signal

Steps

  1. 1

    Build the golden set from real questions

    Collect thirty to a hundred questions your users actually ask, each paired with the passages that should be retrieved and a reference answer a domain expert signs off. Resist inventing tidy questions — the value of the set is its mess: ambiguous phrasings, questions whose answer spans documents, questions your corpus cannot answer at all. That last category is mandatory; a RAG system's behaviour when the answer does not exist is half its trustworthiness.

  2. 2

    Score retrieval on its own

    Before judging any generated answer, measure whether the right chunks surfaced: for each golden question, check whether the known-relevant passages appear in the top results and how high they rank. Retrieval failures explain most RAG failures, and they are invisible if you only read final answers — a fluent response over wrong passages reads exactly like a right one until someone checks.

  3. 3

    Score generation against the retrieved context

    With retrieval held fixed, judge the answer on three separable questions: is it faithful to the retrieved passages, does it actually answer what was asked, and does it decline cleanly when the context does not contain the answer. Faithfulness is the one to watch — it is where hallucination hides in RAG, dressed as synthesis. Human review on a sample beats automated scoring alone; LLM-as-judge scales the middle ground if you calibrate it against your human sample first.

  4. 4

    Wire the harness into every change

    Run the full set on every meaningful change — chunking, embedding model, prompt, retriever settings, and above all the underlying model version, which changes beneath stable names. Store scores per run and alert on regression, not just failure. The harness is what converts tuning from folklore into engineering: a chunking change that lifts retrieval recall but drops faithfulness is now a visible trade, not a mystery.

  5. 5

    Close the loop from production

    Sample live queries on a cadence and route the failures back into the golden set — real failures are the highest-value test cases you will ever get, and they keep the set honest as usage drifts away from what you predicted. If your system serves multiple audiences or corpora, stratify the sample so a regression in one segment cannot hide in another's average.

Common pitfalls

  • Evaluating only end-to-end answers, so retrieval and generation failures blur into one unfixable score. Separating them is the whole trick.
  • A golden set of easy questions — it will flatter every change and catch nothing. The set earns its keep on the ambiguous and the unanswerable.
  • Skipping the no-answer cases. Systems tuned only on answerable questions learn to improvise when the corpus is silent, which is the worst possible behaviour in front of a user.
  • Treating model-version bumps as free. The provider's upgrade is your regression risk; the harness exists for exactly that morning.
  • Letting LLM-as-judge scores float unanchored — calibrate the judge against human ratings on a sample, or the metric drifts with the judge's own model updates.

Frequently asked questions

How large does the evaluation set need to be?

Thirty well-chosen questions catch most regressions; a hundred gives you headroom to stratify by topic and difficulty. Size matters less than honesty — ten genuinely hard questions outperform two hundred easy ones.

Is your organisation ready for AI agents?

Take the assessment →