Agent Observability

How to implement agent observability

Goal

Add structured observability to an AI agent system so every LLM call, tool invocation, and reasoning step is recorded in a way that lets you trace failures, audit decisions, and detect quality regressions in production.

Before you start

  • A working agent that makes at least one LLM call and one tool call
  • A log destination — a structured logging library, an observability platform, or a database you control
  • Environment variable or configuration access to instrument the agent code

Steps

  1. 1

    Define your observability requirements before adding instrumentation

    Decide what questions you need to answer. Debugging failed runs requires the full prompt, completion, and tool-call sequence for that run. Cost monitoring requires token counts and model used per call. Quality monitoring requires output evaluation signals. Audit requirements may specify retention periods and access controls. Writing these down before instrumentation prevents recording data you do not need and missing data you do. The three minimum requirements for any agent observability setup are: trace-level correlation across all events in a single run, per-call recording of inputs and outputs, and error state capture with enough context to reproduce the failure.

  2. 2

    Generate and propagate a trace ID for every agent run

    A trace ID is a unique identifier assigned at the start of each agent execution and attached to every event — LLM calls, tool calls, state transitions, errors — produced during that run. Without it, you cannot reconstruct the causal sequence of events for a given task. Generate the ID at agent entry and pass it through the execution context rather than regenerating it. If your agent spawns sub-agents, pass the same root trace ID down or generate a child span ID that references the parent, so the full tree of execution is navigable as a unit.

  3. 3

    Instrument LLM calls to capture input, output, and metadata

    For each call to a language model, record: the full prompt (including system message, any injected context, and the user turn), the full completion, the model name and version, token counts for both input and output, latency from request sent to response received, any error code if the call failed, and the trace ID. If your framework wraps model calls, add instrumentation at the wrapper layer so every call is captured consistently. Token count and latency data is most useful when aggregated by prompt template or agent step, so include a step label or prompt identifier alongside the call-level data.

  4. 4

    Instrument tool calls to capture what was requested and what came back

    For each tool invocation, record: the tool name, the input arguments exactly as the model passed them, the output returned to the model, call duration, whether the call succeeded or errored, and the trace ID. Tool calls are where agents interact with external systems, so their logs are the audit record of what the agent actually did in the world. If a tool modifies state — writes to a database, sends a message, calls an API — record enough to reconstruct what was changed, not just that the tool was called.

  5. 5

    Add step-level events for agent state transitions

    Beyond individual LLM and tool calls, record events at the agent workflow level: task start (with the initial goal or input), each step boundary in a multi-step plan, escalations or human handoffs, and task completion (with a success or failure status and the final output). These events give you the skeleton of the execution that the LLM and tool logs flesh out. When debugging a failure, you typically start with the step-level timeline to locate where things went wrong, then drill into the LLM and tool logs for that step.

  6. 6

    Set up anomaly detection on the metrics that matter to you

    Once baseline data is flowing, define thresholds that trigger investigation. Useful signals include: error rate per agent type rising above baseline, average steps-per-task increasing significantly (which often indicates reasoning loops), tool call failure rate by tool (which catches external dependency degradation), and task completion rate dropping. These do not need to be sophisticated — a threshold alert on a rolling average is sufficient to catch most regressions before users report them. Tune thresholds against your baseline data rather than using arbitrary values.

Common pitfalls

  • Recording only errors and omitting successful runs: you need the full distribution of executions to detect quality degradation, not just failure cases.
  • Logging the prompt template but not the rendered prompt with values filled in: the filled-in prompt is what the model received and what you need to reproduce a failure.
  • Using different trace IDs for the same logical run because the ID was not propagated through sub-agents or async steps.
  • Not capturing tool call inputs verbatim: if you summarize or truncate tool arguments in the log, you cannot reliably reproduce the call.
  • Storing raw prompts and completions without access controls: this data may contain sensitive information from the context window and should be treated accordingly.

Frequently asked questions

Do I need a third-party observability platform or can I use my existing logging stack?

Your existing logging stack is sufficient for the basics — structured JSON logs with a trace ID will let you query and reconstruct executions. Dedicated observability platforms add value when you need cross-run analytics, visual trace exploration, or built-in quality evaluation. Start with your existing infrastructure and add a dedicated platform when log querying becomes the bottleneck.

How much does full prompt and completion logging cost in storage?

At typical token lengths, a single LLM call record including prompt and completion is a few kilobytes. Storage cost depends on call volume and retention period. For most teams, the cost is modest compared to the API cost of the calls themselves. If storage is a concern, consider sampling — logging every call during development, and a representative sample (with all errors) in production — rather than omitting prompt data entirely.

Is your organisation ready for AI agents?

Take the assessment →