What makes agent evaluation different from LLM evaluation

LLM evaluation assesses a single model call in isolation. Agent evaluation assesses a multi-step system where decisions compound: an agent that reasons correctly at step one but selects the wrong tool at step two fails the task regardless of step one's quality. The unit of evaluation is the full task execution — did the agent complete the goal, and did it do so without taking harmful side actions? Partial-credit scoring matters here, because an agent that reaches step four of five is qualitatively different from one that fails immediately.

What to measure in an agent evaluation

Task completion rate is the primary metric: for a defined set of tasks, what fraction did the agent complete correctly? Supporting metrics include tool selection accuracy (did the agent pick the right tool for each step?), unnecessary tool calls (wasted steps that increase cost and latency), and escalation behavior (did the agent correctly recognize when to hand off to a human?). Safety metrics — actions the agent should never take — are equally important and often tested with adversarial inputs designed to push the agent toward prohibited behavior.

Evaluation environments and ground truth

Agent evaluation requires an environment where tool calls execute without production consequences — either sandboxed versions of real tools, mock tool responses, or isolated test environments. Ground truth is harder to establish than in LLM evaluation: the correct path through a multi-step task is often not unique, so evaluation rubrics must capture correct outcomes rather than correct steps. Building and maintaining evaluation environments is the largest ongoing cost of agent evaluation programs.