A language model knows what it was trained on, and nothing after, and nothing private. Retrieval-augmented generation is the standard answer to that limit: before the model writes a word, the system searches a knowledge source for the passages most relevant to the question and places them in front of the model as context. The model then generates from what it just read rather than from what it once memorised. The technique is why a chatbot can answer questions about your policies, your codebase, or yesterday's documents without anyone retraining anything.
The mechanics are a pipeline. Documents are split into chunks — passages small enough to search precisely and large enough to carry meaning. Each chunk is converted into an embedding, a numeric representation of its meaning, and stored in a vector index. At question time the query is embedded the same way, the index returns the nearest chunks, and the winners are assembled into the prompt alongside the question. Every design choice in that pipeline — chunk size, what gets embedded, how many results, whether a re-ranker filters them — moves answer quality more than swapping the model does, which is why RAG work is mostly retrieval work.
What RAG buys, and what it cannot
The honest pitch is threefold: answers grounded in sources you control, currency without retraining, and citations — the retrieved chunks are evidence a reader can check. The equally honest limits: a RAG system is capped by its retrieval, because the model cannot use what the search failed to find; it remains a probabilistic generator, so grounding reduces invention without abolishing it; and the knowledge source becomes part of the attack surface — a poisoned or stale document becomes a confident answer, and retrieved content can carry injected instructions as easily as facts. Teams that treat the document corpus with the care they give code — ownership, freshness, review — get RAG's benefits; teams that point it at an unmaintained wiki get fluent recitations of whatever rotted there.
RAG in the agent era
In a classic RAG pipeline, retrieval happens once, before generation. Agentic systems fold retrieval into the loop: the agent decides *mid-task* that it needs something, queries for it — increasingly through MCP servers exposing search over real systems — reads the result, and decides again. That upgrade is real (the agent can notice a gap and go fill it) and it sharpens every caution above, because retrieved content now steers actions rather than wording. An agent that treats whatever it retrieved as instruction is an agent waiting to be steered; the controls in securing agentic AI — and the rule that consequential actions never proceed on retrieved content alone — exist for exactly this seam.