Agent Observability

How to trace LLM calls

Goal

Set up LLM tracing to record the inputs, outputs, latency, and token usage of every model call in an AI application, giving you the observability data needed to debug failures, optimize costs, and monitor quality in production.

Before you start

  • An AI application that makes LLM API calls (any provider)
  • A tracing destination — a self-hosted or managed observability backend, or an LLM-specific tracing platform
  • Access to the application's source code to instrument the model calls

Steps

  1. 1

    Understand what LLM tracing records

    LLM tracing captures the full context of each model call as a structured record: the prompt sent (system message, user turns, any retrieved context), the model's response, the model and version used, the number of input and output tokens, latency from request start to response completion, and any tool calls with their arguments and results. For multi-step workflows and agents, traces group related calls into a parent span so you can see the full sequence of model interactions that produced a final output — not just individual isolated calls. This structured record is what makes debugging, cost attribution, and quality monitoring possible at scale.

  2. 2

    Choose an instrumentation approach

    Two instrumentation approaches exist. Auto-instrumentation wraps the LLM client library so that every call is traced automatically without code changes — most LLM observability tools provide an auto-instrumentation package for major providers that you initialize once and forget. Manual instrumentation gives you explicit control over what is captured and how spans are named and structured — you call the tracing SDK at the points in your code where model calls happen. Auto-instrumentation is the right starting point: it captures all calls immediately and requires minimal code. Add manual instrumentation on top when you need to capture application-specific context — user IDs, session IDs, request metadata — that the auto-instrumentation does not know about.

  3. 3

    Initialize your tracing backend

    A tracing backend receives, stores, and makes queryable the trace data your application sends. Options range from self-hosted systems built on OpenTelemetry-compatible collectors to managed LLM observability platforms purpose-built for model call data. For OpenTelemetry-based setups, configure an OTLP exporter pointing at your collector endpoint and set the service name and deployment environment as resource attributes — these attributes are how you filter traces by application and environment in the observability UI. For managed LLM platforms, follow the platform's SDK initialization instructions, which typically require setting an API key and optionally a project identifier. Initialize the tracing configuration before your application starts making model calls — typically at application startup or at the top of the entry module.

  4. 4

    Instrument your first model call and verify

    With initialization in place, make a model call through your application and verify that a trace appears in your observability backend. Check that the trace record contains: the prompt text, the completion text, the model identifier, token counts for input and output, and the call latency. If any of these are missing, the most common causes are: the auto-instrumentation package not wrapping the correct version of the client library, the tracing backend not receiving data due to a misconfigured endpoint or missing API key, or the client library being imported before the instrumentation was initialized (import order matters). Resolve missing fields before proceeding — traces with partial data are less useful and harder to correct retroactively once you have a large volume.

  5. 5

    Add parent spans for multi-step workflows

    If your application chains multiple model calls — an agent loop, a RAG pipeline, a multi-step generation workflow — wrap the full workflow in a parent span so related calls are grouped in the trace view. Create the parent span at the entry point of the workflow, pass the span context through to child operations, and close the parent span when the workflow completes. In OpenTelemetry terms this means starting a root span before the first model call and using context propagation so child model calls attach to it automatically. In LLM-specific platforms, the equivalent is a trace or session identifier attached to all calls within a workflow run. Without this grouping, a trace view shows a flat list of unrelated model calls with no way to see which calls belonged to the same user request.

  6. 6

    Capture application context as span attributes

    Model call data alone — prompt, response, tokens, latency — is necessary but not sufficient for debugging production issues. You also need to know which user made the request, which feature or endpoint triggered the workflow, and which version of your application was running. Add these as custom attributes on the trace or parent span: user ID or session ID, request ID for correlation with application logs, the specific workflow or feature name, and any other identifiers that will help you filter and group traces during investigation. Do not add personal data or sensitive content as trace attributes — trace data may be retained for extended periods and accessed by multiple team members.

  7. 7

    Set up cost and token tracking

    Token counts in trace data enable per-request and aggregate cost tracking. Most tracing backends can compute cost from token counts and model identifiers if you configure the cost per token for each model you use. Verify that input tokens, output tokens, and the model name are captured accurately in every trace — these are the inputs to cost calculations. Set up aggregate views or dashboards that show total token usage and estimated cost by model, by feature, and over time. This data is what lets you identify which parts of your application are the most expensive, whether costs are growing proportionally to usage, and where prompt optimization or model substitution would have the most impact.

  8. 8

    Define quality signals and sampling strategy

    Recording every model call at full fidelity is the right starting point but may not be sustainable at scale — high-volume applications can generate trace data faster than it can be stored and queried economically. Once baseline tracing is working, define a sampling strategy: trace all calls in development and staging; in production, consider head-based sampling that captures a percentage of requests, or tail-based sampling that captures all traces where a quality signal (high latency, error, low confidence score) was observed. Also define the quality signals you want to detect in traces: unexpectedly high latency, token counts that suggest runaway prompts, model errors, and downstream indicators like user abandonment or negative feedback that can be joined to the trace that produced the response. These signals turn a passive record into an active quality monitoring system.

Common pitfalls

  • Importing the LLM client library before initializing instrumentation — auto-instrumentation wraps the client at import time, so the order matters. Always initialize tracing before any other imports in your entry module.
  • Logging prompt content to traces without redacting personally identifiable information — trace data is often accessible to more team members and stored longer than application logs. Apply redaction or masking to user-generated content before it enters the trace.
  • Treating trace data as a compliance or audit log — tracing is an observability tool for debugging and optimization, not a system of record. Do not architect your application to rely on trace data for legal, billing, or compliance requirements; use dedicated logging systems for those purposes.
  • Skipping parent span setup for multi-step workflows and then wondering why the trace view shows unrelated calls — grouping is opt-in and requires explicit context propagation.

Frequently asked questions

Does LLM tracing affect application performance?

With asynchronous export configured — where trace data is buffered and sent in background batches — the overhead of tracing is negligible for most applications: typically under a millisecond per call. Synchronous export, where the application waits for the trace to be acknowledged before continuing, adds latency equal to the network round-trip to the tracing backend. Always use async export in production. The buffer and batch configuration in your tracing SDK controls how frequently batches are sent and how large they can grow before being flushed.

What is the difference between LLM tracing and application logging?

Application logs are text records of events, typically structured as time-ordered lines that you search by keyword or filter by log level. Traces are structured records of causally related operations — spans with parent-child relationships — that represent the full lifecycle of a request. For AI applications, traces capture the specific model inputs and outputs in a queryable format, enable latency breakdown across steps, and support cost attribution at the request level. Logs tell you something happened and when; traces tell you what the model received, what it returned, and how long each step took in the context of the full request.

Can I trace model calls through third-party libraries and frameworks?

Most major LLM frameworks — orchestration libraries, agent frameworks, RAG libraries — either emit OpenTelemetry spans natively or have official instrumentation packages that do. If a framework you use does not have instrumentation, you can add manual spans around the framework's entry and exit points to capture the workflow-level trace even if individual model calls within the framework are not individually traced. Check whether your framework's documentation covers observability integration before writing custom instrumentation.

Is your organisation ready for AI agents?

Take the assessment →