How to secure the AI agents you run
Put a working security model around your agentic AI — identity, scoped permissions, untrusted-input handling, and a kill switch — so an agent that goes wrong is contained rather than catastrophic.
Before you start
- An inventory of the agents in scope, with the systems each one reads and writes
- Authority to issue and revoke service credentials
- Somewhere durable to send logs — agent security work produces evidence, and it needs a home
Steps
- 1
Map each agent's attack surface before its threat list
For every agent, write down three things: the tools it can invoke, the credentials those tools carry, and every place untrusted content enters — user messages, retrieved documents, web pages, emails, API responses. That third list is the one teams skip, and it is where agent incidents start: anything an agent reads can try to steer it. The attack surface of an agent is its tool list multiplied by its input sources, not its model.
- 2
Give every agent its own identity and least-privilege scopes
One agent, one service identity, never a shared account or a developer's token. Then scope each tool's credential to the narrowest grant that works — read-only where the agent only reads, this-table not this-database, this-channel not this-workspace. The test of the setup is operational: can you revoke one agent in one step without breaking anything else, and can you tell, from any log line downstream, which agent acted?
- 3
Treat everything the agent reads as untrusted input
Prompt injection is the agent-era equivalent of SQL injection: instructions hidden in content the agent was merely supposed to process — a support ticket, a web page, a calendar invite. Defences are layered, not absolute: separate instructions from data in your prompting, strip or flag instruction-like content in retrieved material, and — the control that actually bounds the damage — refuse to let high-risk tool calls proceed on the strength of retrieved content alone. Assume injection will sometimes work, and design so that when it does, the blast radius is a scoped tool, not the estate.
- 4
Gate the consequential actions at runtime
Decide which action classes are expensive to reverse — payments, deletions, sending external messages, granting access — and put a control in front of each: a human approval, a policy check, a rate limit, a dollar ceiling. The gate must live outside the agent's own reasoning; an agent persuaded by injected instructions will also be persuaded that the action is fine. Runtime enforcement is what stands when the prompt falls.
- 5
Log every action as evidence, not telemetry
Record each tool call with its inputs, outputs, initiating agent identity, and the chain of reasoning context that led to it, redacting sensitive fields as you capture. The standard you are aiming for: a security reviewer can reconstruct any consequential action after the fact without asking the team that built the agent. If your audit trail cannot answer "why did the agent do that", it is monitoring, not security.
- 6
Red-team the agent before launch and after every model change
Run an adversarial pass against the assembled system, not the model in isolation: injection attempts through every input source, tool-misuse chains, data-exfiltration paths through innocuous-looking outputs. Re-run it when the underlying model version changes, because behaviour shifts under your feet even when your code does not. Keep the failing cases as a regression suite — the attacks that worked once are the first ones to try again.
- 7
Build the kill switch before you need it
Decide now how you stop an agent in under a minute: revoke its identity, disable its credentials at the providers, halt its orchestrator. Write the steps down where on-call can find them, and rehearse once — a revocation path that has never been exercised is a hypothesis. Containment speed, not prevention, is what separates an agent incident from an agent story.
Common pitfalls
- Securing the model instead of the tool surface. Jailbreak resistance is the vendor's problem; what the agent can write to is yours, and it is the part you control completely.
- A shared service account across agents — one compromise revokes everything or nothing, and the audit trail can no longer say which agent acted.
- Trusting retrieved content because it came from an internal source. The wiki page an agent reads may have been edited by anyone, including a previous agent.
- A one-time security review while the model underneath changes quarterly. Agent security has the shelf life of the model version it was tested against.
- Treating the audit trail as done because logs exist. Logs nobody can query during an incident are storage, not security.