What is agent evaluation?

What makes agent evaluation different from LLM evaluation

LLM evaluation assesses a single model call in isolation. Agent evaluation assesses a multi-step system where decisions compound: an agent that reasons correctly at step one but selects the wrong tool at step two fails the task regardless of step one's quality. The unit of evaluation is the full task execution — did the agent complete the goal, and did it do so without taking harmful side actions? Partial-credit scoring matters here, because an agent that reaches step four of five is qualitatively different from one that fails immediately.

What to measure in an agent evaluation

Task completion rate is the primary metric: for a defined set of tasks, what fraction did the agent complete correctly? Supporting metrics include tool selection accuracy (did the agent pick the right tool for each step?), unnecessary tool calls (wasted steps that increase cost and latency), and escalation behavior (did the agent correctly recognize when to hand off to a human?). Safety metrics — actions the agent should never take — are equally important and often tested with adversarial inputs designed to push the agent toward prohibited behavior.

Evaluation environments and ground truth

Agent evaluation requires an environment where tool calls execute without production consequences — either sandboxed versions of real tools, mock tool responses, or isolated test environments. Ground truth is harder to establish than in LLM evaluation: the correct path through a multi-step task is often not unique, so evaluation rubrics must capture correct outcomes rather than correct steps. Building and maintaining evaluation environments is the largest ongoing cost of agent evaluation programs.

What is agent evaluation? — FAQ

Can I use the same evaluation dataset for agents and LLMs?

Rarely. LLM evaluation datasets are typically single-turn question-answer pairs. Agent evaluation requires multi-step task specifications with defined success conditions, available tools, and expected tool usage. You usually need to build agent-specific evaluation suites, although LLM evaluation still applies to individual reasoning steps within an agent run.

How do I evaluate an agent that uses external APIs that change?

Mock the external APIs in your evaluation environment and update the mocks when real APIs change. Real external calls make evaluations non-deterministic, expensive, and slow. The mock should reflect the behavior the agent is designed to work with — not an idealized response — so failure modes from real API variability are still covered.