Are AI voice agents just chatbots with text-to-speech?

The naive ones are, and they fail on contact: real calls involve interruption, crosstalk, mishearing, and silence, none of which a turn-based chatbot models. A production voice agent is built around streaming and recovery from the start — and governed around the fact that its output is spoken commitment, not reviewable text.

AI voice agents: the stack, the stakes, and the readiness gap

The stack under the voice

Three stages run in a loop measured in milliseconds: speech-to-text transcribes the caller, the language model — holding the conversation state and whatever tools the agent has — decides what to say or do, and text-to-speech answers. Latency budgets dominate every design choice, because a pause that reads as thoughtful in chat reads as broken on a call; production systems stream every stage, interleave listening with thinking, and handle interruptions as first-class events rather than errors. The model layer is the same [agent loop](/learn/agentic-ai-architecture) as everywhere else — voice changes the deadlines, not the architecture.

Where the stakes differ from chat

A chat agent's mistake sits in a transcript a user can re-read; a voice agent's mistake is spoken once into the air and acted on. Mishearing is a new failure class — a transcription error upstream becomes a confident wrong action downstream — so consequential commitments need read-back confirmation the way payments need second factors. And the medium carries its own compliance load: calls are recordings of people who may not have consented in every jurisdiction, transcripts are personal data flowing through every vendor in the stack, and the disclosure question — does the caller know it is a machine — is regulated in a growing number of places. Treat the legal review as part of the build, not the launch.

The readiness checklist, voice edition

Everything on the standard agent checklist applies — [identity, scoped tools, gates on consequential actions, audit trail](/guides/secure-agentic-ai) — with voice-specific additions: redaction of payment card and health details from recordings and traces at capture time, an escalation path to a human that works mid-call without losing context, and evaluation that scores transcription accuracy separately from decision quality, because they fail independently and only one of them is the model's fault. The agents most likely to skip all this are the ones bought as a product rather than built — which is an argument for putting purchased voice agents through the same [inventory and registry](/guides/inventory-your-ai-agents) as everything else, not an exemption.

The stack under the voice

Where the stakes differ from chat

The readiness checklist, voice edition

AI voice agents: the stack, the stakes, and the readiness gap — FAQ