What you evaluate and why it matters

LLM evaluation covers multiple dimensions simultaneously. Accuracy checks whether factual claims are correct. Faithfulness — most relevant for RAG systems — checks whether the response sticks to the provided source material or fabricates. Relevance measures whether the response addresses the actual question. Format compliance checks whether outputs follow required structure. Toxicity and safety evaluations catch harmful content. No single dimension is sufficient — a response can be accurate but toxic, or relevant but faithless to the source.

Human evaluation vs automated evaluation

Human evaluation — having people rate outputs for quality — is the most accurate method but is slow and expensive. Automated evaluation uses a second LLM as a judge, predefined rubrics, or rule-based checks to score outputs at scale. Automated approaches are practical for continuous evaluation in production but introduce their own biases, particularly when the judge model is the same family as the model being evaluated. Most production systems combine both: automated evaluation for coverage, human evaluation for calibration and high-stakes cases.

Where evaluation fits in the development cycle

Evaluation should happen at three stages. Offline evaluation during development tests prompts and models against a benchmark dataset before anything reaches production. Pre-deployment evaluation gates whether a new model version or prompt change meets quality thresholds. Online evaluation in production samples real traffic to detect distribution shift and quality degradation over time. Treating evaluation as only a pre-deployment gate misses the fact that production inputs differ from benchmarks, and models change.