Agent Observability

How to evaluate AI agents

Goal

Design and run an evaluation program that measures whether your AI agent completes its defined tasks correctly, safely, and within acceptable performance bounds — both before release and as part of ongoing production monitoring.

Before you start

  • A clearly defined scope for the agent: what tasks it should complete and what outcomes count as success
  • A sandboxed version of the agent's environment where tool calls do not affect production systems
  • A representative set of tasks drawn from the actual distribution of inputs the agent will receive

Steps

  1. 1

    Define success criteria before building evaluation tasks

    Evaluation is only as good as the success criteria it measures against. For each task type your agent handles, define what a correct outcome looks like — not what the correct path looks like. Multi-step agents can reach the right outcome via different paths, and evaluating path correctness instead of outcome correctness produces misleading results. Define failure criteria too: actions the agent should never take regardless of the task, and task-incomplete states that should be distinguished from task-failed states. Write these criteria down in a form your evaluation logic can apply consistently.

  2. 2

    Build a task set that covers the input distribution

    Your evaluation tasks should represent the actual inputs your agent receives, including the edge cases. Draw tasks from production logs if available — these represent real inputs rather than idealized ones. If the agent is pre-launch, construct tasks from the task specification and include: straightforward cases the agent should handle easily, cases that require multiple tool calls or reasoning steps, edge cases where the task is ambiguous or information is incomplete, and adversarial cases designed to push the agent toward prohibited actions. A task set that covers only the easy cases will not reveal how the agent behaves under the conditions that actually cause problems.

  3. 3

    Set up an isolated evaluation environment

    The evaluation environment must let agent tool calls execute without affecting production systems. For each external integration the agent uses, implement one of: a sandboxed version of the real system (a test database, a staging API), a mock that returns realistic but controlled responses, or read-only access to production data with writes intercepted. The mock or sandbox should reflect actual system behavior — including failure modes — rather than always returning success. Evaluation results from an environment that never returns errors do not tell you how the agent handles the error conditions it will actually encounter.

  4. 4

    Run baseline evaluations and record results by task type

    Run the full task set through the agent in the evaluation environment and record: task completion rate overall, broken down by task type; step count per completed task (more steps than expected can indicate inefficiency or reasoning loops); tool selection accuracy where you have ground truth; escalation rate (tasks the agent handed off to humans); and error rate by type. Record raw outputs for each task, not just summary statistics — you need the outputs to diagnose failures and to compare runs after changes.

  5. 5

    Establish a regression baseline and add evaluation to your change process

    The value of evaluation is in comparison: the agent today versus the agent after a prompt change, model update, or new tool. Commit your baseline results and run the same evaluation on every significant change. Define regression thresholds: how much can task completion rate drop, or error rate rise, before the change is blocked? These thresholds should reflect what your use case can tolerate — a small regression in a low-stakes task may be acceptable; any regression in a high-stakes one may not be.

  6. 6

    Layer in production sampling as a quality monitor

    Offline evaluation against a fixed task set misses the fact that production inputs change over time. Complement offline evaluation with online evaluation: sample a fraction of real production runs and evaluate them against your success criteria — either automatically where criteria can be checked programmatically, or manually where human judgment is needed. Rising failure rates or new failure modes in production samples are the earliest signal that the agent's performance has drifted from your evaluation baseline.

Common pitfalls

  • Evaluating path correctness instead of outcome correctness: agents that take a different route than expected but produce the right result are not failing.
  • Building a task set only from easy, well-formed inputs: evaluation tasks should include the irregular and adversarial inputs the agent will encounter in production.
  • Running evaluation against production systems: tool calls that modify state must be sandboxed, or evaluation itself becomes a source of production incidents.
  • Treating a single evaluation run as definitive: agent outputs are probabilistic, and a small task set will have high variance. Run evaluation multiple times or with a large enough task set to distinguish signal from noise.
  • Running evaluation only before release: production inputs change, models update, and quality drifts. Evaluation is ongoing maintenance, not a one-time gate.

Frequently asked questions

How many evaluation tasks do I need?

Enough to cover the input distribution of your agent's actual workload. For a narrow, well-defined task scope, a few hundred tasks can be sufficient. For broad or complex agents, more is better. The practical test is whether your task set catches failures you know exist — if you introduced a known regression and your evaluation did not catch it, the task set is too small or too narrow.

Can I use an LLM to evaluate my agent's outputs?

Yes, with caveats. LLM-as-judge evaluation scales well and can assess qualities that rule-based checks cannot, like whether an output is helpful or whether reasoning is sound. The limitations: the judge model introduces its own biases, consistency varies across runs, and using the same model family as the evaluated agent creates blind spots. Use LLM evaluation for scale, but calibrate it against human evaluation on a sample and validate that it agrees with human judgments on the kinds of outputs you care about.

Is your organisation ready for AI agents?

Take the assessment →