Getting started

How to build an agentic AI system you can put in production

Goal

Build a first agentic AI system that does real work — and arrives in production with the identity, permissions, evaluation, and audit trail that let it stay there.

Before you start

  • API access to a model provider (or an internal gateway in front of one)
  • A sandbox or non-production environment the agent can act in safely
  • A named owner — a person, not a team — who answers for what the agent does

Steps

  1. 1

    Pick a task with a verifiable end state

    Choose work that is narrow, repeatable, and checkable: triage inbound tickets against a rubric, draft and file the weekly report, open a pull request that fixes lint findings. The test is whether you can state, in one sentence, how anyone would verify the task was done correctly. Avoid tasks whose writes are irreversible — your first agent will be wrong sometimes, and you want wrong to be cheap.

  2. 2

    Define the tool surface before you write a prompt

    List every action the agent may take as a named tool — read this API, write that record, send this message — and grant each tool the narrowest credential that works. The tool list is the agent's real permission boundary; the prompt is only advice. Give the agent its own service identity from day one, never a developer's personal token, so its actions are distinguishable and revocable later.

  3. 3

    Build the loop, not just the call

    An agentic system is a loop: the model plans a step, acts through a tool, observes the result, and decides what is next. Use a framework if it helps — LangGraph, CrewAI, and the provider SDKs all implement this pattern, and the [Model Context Protocol](/mcp) standardises how tools are exposed — but make the loop's exit conditions explicit: maximum steps, budget per task, and the states that end the run. A loop that cannot tell when it is done is the one that runs all weekend.

  4. 4

    Instrument from the first run

    Record every step the loop takes: which tool, what inputs, what came back, what it cost. Capture prompts and outputs with redaction where they touch sensitive data. This is not optional polish — the first time the agent does something surprising, the trace is the difference between a five-minute answer and an afternoon of guessing. Cost per completed task, computed from these traces, is also the number that decides whether the agent earns its keep.

  5. 5

    Evaluate before you grant autonomy

    Assemble a set of representative tasks with known-good outcomes and run the agent against it after every meaningful change. Score correctness, not completion — an agent that confidently finishes the wrong thing scores zero. Autonomy is then promoted on evidence: the agent runs supervised until its evaluation record says the supervision is theatre.

  6. 6

    Put a human in front of the consequential writes

    Route actions that are expensive to reverse — payments, deletions, anything customer-visible — through an approval step, and keep the approval queue small enough that reviewing it stays a real act rather than a rubber stamp. The goal is not permanent supervision; it is a control you can loosen deliberately, one action class at a time, as the evaluation record earns it.

  7. 7

    Register it and hand it over

    Before the agent touches production, record it where your organisation keeps its agent inventory: owner, purpose, systems it reads and writes, credentials it holds, and its current autonomy level. Then write the one-page risk profile while the design is fresh. An agent that exists only in the repo of the person who built it is a shadow agent with extra steps.

Common pitfalls

  • Prompt-first development — weeks tuning instructions while the tool surface, credentials, and exit conditions stay an afterthought. The prompt is the most visible part of the system and the least load-bearing.
  • Shipping the demo loop. A demo needs the happy path; production needs the verify step — checking the result of each action before taking the next. Most weekend-long agent incidents are a missing verify.
  • Running on a personal token because the sandbox did. The day the builder leaves or rotates credentials, the agent dies — or worse, keeps acting under a human's name.
  • Skipping evaluation because the demo looked right. Five hand-picked runs are an anecdote; the eval set is what tells you the agent still works after the model version changes underneath it.
  • Measuring completion instead of correctness — agents complete tasks fluently whether or not the work is right, so completion rate flatters every agent that should worry you.

Frequently asked questions

Do I need a framework to build an agentic AI system?

No — the loop is a few hundred lines against any provider SDK, and starting bare teaches you what the frameworks abstract. Frameworks earn their place when you need durable state, parallel branches, or human-approval steps as first-class constructs rather than your own plumbing.

Python or TypeScript?

Whichever your team operates best. The ecosystem leans Python — most framework examples and evaluation tooling assume it — but the operational requirements in this guide are language-independent, and an agent your on-call rota cannot debug is the wrong agent in any language.

Is your organisation ready for AI agents?

Take the assessment →