Techniques

What is retrieval-augmented generation?

RAG lets a model answer from your documents instead of its training data — how retrieval works, where it fails, and what changes when agents do the retrieving.

A language model knows what it was trained on, and nothing after, and nothing private. Retrieval-augmented generation is the standard answer to that limit: before the model writes a word, the system searches a knowledge source for the passages most relevant to the question and places them in front of the model as context. The model then generates from what it just read rather than from what it once memorised. The technique is why a chatbot can answer questions about your policies, your codebase, or yesterday's documents without anyone retraining anything.

The mechanics are a pipeline. Documents are split into chunks — passages small enough to search precisely and large enough to carry meaning. Each chunk is converted into an embedding, a numeric representation of its meaning, and stored in a vector index. At question time the query is embedded the same way, the index returns the nearest chunks, and the winners are assembled into the prompt alongside the question. Every design choice in that pipeline — chunk size, what gets embedded, how many results, whether a re-ranker filters them — moves answer quality more than swapping the model does, which is why RAG work is mostly retrieval work.

What RAG buys, and what it cannot

The honest pitch is threefold: answers grounded in sources you control, currency without retraining, and citations — the retrieved chunks are evidence a reader can check. The equally honest limits: a RAG system is capped by its retrieval, because the model cannot use what the search failed to find; it remains a probabilistic generator, so grounding reduces invention without abolishing it; and the knowledge source becomes part of the attack surface — a poisoned or stale document becomes a confident answer, and retrieved content can carry injected instructions as easily as facts. Teams that treat the document corpus with the care they give code — ownership, freshness, review — get RAG's benefits; teams that point it at an unmaintained wiki get fluent recitations of whatever rotted there.

RAG in the agent era

In a classic RAG pipeline, retrieval happens once, before generation. Agentic systems fold retrieval into the loop: the agent decides *mid-task* that it needs something, queries for it — increasingly through MCP servers exposing search over real systems — reads the result, and decides again. That upgrade is real (the agent can notice a gap and go fill it) and it sharpens every caution above, because retrieved content now steers actions rather than wording. An agent that treats whatever it retrieved as instruction is an agent waiting to be steered; the controls in securing agentic AI — and the rule that consequential actions never proceed on retrieved content alone — exist for exactly this seam.

Where to go from here

If you are building, RAG is usually a component inside a larger system — the agentic architecture explainer shows where it sits, and the build guide sequences the work around it. If you are assessing readiness, the question to ask of any RAG deployment is the same one the maturity curve asks of agents generally: who owns the knowledge source, who would notice it going wrong, and how would they prove it. The assessment will tell you where you stand.

For the agentic version of this pattern — retrieval folded into the agent loop — the agentic RAG explainer covers the specifics.

Frequently asked questions

What is retrieval-augmented generation (RAG)?

RAG lets a model answer from your documents instead of only its training data. Before the model writes, the system searches a knowledge source for the passages most relevant to the question and places them in front of the model as context, so it generates from what it just read.

How does a RAG pipeline work?

Documents are split into chunks, each converted into an embedding and stored in a vector index. At question time the query is embedded the same way, the index returns the nearest chunks, and they are assembled into the prompt alongside the question. Retrieval choices move answer quality more than swapping the model does.

What are the limitations of RAG?

A RAG system is capped by its retrieval — the model cannot use what the search failed to find — and it remains a probabilistic generator, so grounding reduces invention without abolishing it. The knowledge source also becomes attack surface: a poisoned or stale document becomes a confident answer.

How does RAG change with AI agents?

Classic RAG retrieves once before generation; agentic systems fold retrieval into the loop, so the agent decides mid-task that it needs something, queries for it (often through MCP servers), reads the result, and decides again. Retrieved content now steers actions, so consequential actions should never proceed on it alone.

Is your organisation ready for AI agents?

Take the assessment →