What LLM observability tracks
Each LLM call generates a record with at minimum: the prompt sent, the completion received, the model and version, token counts for input and output, wall-clock latency, cost if using a paid API, and any error codes. Beyond the basics, good observability adds quality annotations — whether the response followed instructions, avoided hallucination, stayed on topic, and met application-specific criteria. That quality layer is what separates a log from useful observability data.
How it differs from classic application monitoring
In conventional software, a function returns the same output for the same input. In LLMs, outputs are probabilistic and sensitive to prompt wording, temperature settings, context length, and model version. A prompt that worked well last week may degrade after a model update. Observability catches this by giving you enough data to compare behavior over time — prompt by prompt, model version by model version — and flag when distributions shift without obvious errors being thrown.
Where to start
The practical entry point is capturing all prompts and completions with timestamps and session identifiers. From there, add latency and token cost tracking. Once baseline metrics are in place, layer in quality evaluation — even manual sampling of a small fraction of completions reveals patterns. The goal is to move from 'the LLM sometimes gives bad answers' to 'the LLM gives bad answers in this specific condition, which occurs with detectable frequency.'