What is LLM evaluation?

LLM evaluation is the systematic process of testing a language model's outputs against defined quality criteria — measuring accuracy, relevance, faithfulness, safety, and task performance — to determine whether a model or prompt configuration meets the bar required for a specific production use case.

What you evaluate and why it matters

LLM evaluation covers multiple dimensions simultaneously. Accuracy checks whether factual claims are correct. Faithfulness — most relevant for RAG systems — checks whether the response sticks to the provided source material or fabricates. Relevance measures whether the response addresses the actual question. Format compliance checks whether outputs follow required structure. Toxicity and safety evaluations catch harmful content. No single dimension is sufficient — a response can be accurate but toxic, or relevant but faithless to the source.

Human evaluation vs automated evaluation

Human evaluation — having people rate outputs for quality — is the most accurate method but is slow and expensive. Automated evaluation uses a second LLM as a judge, predefined rubrics, or rule-based checks to score outputs at scale. Automated approaches are practical for continuous evaluation in production but introduce their own biases, particularly when the judge model is the same family as the model being evaluated. Most production systems combine both: automated evaluation for coverage, human evaluation for calibration and high-stakes cases.

Where evaluation fits in the development cycle

Evaluation should happen at three stages. Offline evaluation during development tests prompts and models against a benchmark dataset before anything reaches production. Pre-deployment evaluation gates whether a new model version or prompt change meets quality thresholds. Online evaluation in production samples real traffic to detect distribution shift and quality degradation over time. Treating evaluation as only a pre-deployment gate misses the fact that production inputs differ from benchmarks, and models change.

What is LLM evaluation? — FAQ

What is the difference between LLM evaluation and LLM benchmarking?

Benchmarking compares models on standardized tasks to assess general capability. Evaluation assesses whether a model's outputs meet the requirements of your specific application on your specific data. Benchmark scores do not reliably predict application performance — evaluation does.

How many examples do I need for a meaningful evaluation?

There is no universal number. For a narrow task with a clear rubric, a few hundred examples can be sufficient to detect meaningful differences. For complex tasks where quality is hard to operationalize, the answer is usually more than you think. The practical minimum is enough examples to cover the distribution of inputs your application actually receives, not just the easy cases.