The indexing path

Documents enter, vectors come out, and three decisions on the way set the ceiling for everything downstream. Chunking decides the unit of retrieval — too small and meaning fragments, too large and queries match passages diluted with noise; most teams land between a paragraph and a page, split on structure rather than character counts. The embedding model decides what similarity means — and changing it later means re-embedding the corpus, which is why it is the stickiest choice in the stack. Metadata decides what you can filter by: source, date, access level. Skipping metadata at indexing time is the mistake teams pay for at query time, when they discover retrieval cannot respect permissions it never recorded.

The query path

The question is embedded the same way the chunks were, the index returns the nearest neighbours, and the top results are assembled into the prompt. Two refinements separate production systems from demos. Hybrid retrieval pairs vector similarity with classic keyword search, because embeddings miss exact identifiers — part numbers, error codes, names — that keyword search catches trivially. Re-ranking takes a generous candidate set and lets a stronger model order it before the handful that fit the context window are chosen; it is the cheapest large quality win in most pipelines.

Where the architecture is heading

Two extensions matter operationally. Long-context models tempt teams to skip retrieval and stuff everything in — which works until the corpus grows, costs compound per call, and answers lose citations; retrieval stays the discipline that keeps context deliberate. And agentic systems fold the query path into their loop, retrieving mid-task as needs emerge — the pattern covered in [agentic RAG](/learn/agentic-rag), which inherits this pipeline and adds decision-making on top of it. The architecture in this page is the stable core both futures build on.

Reading a RAG diagram like an operator

When someone shows you a RAG architecture diagram, the boxes are rarely where the risk is. Ask where access control is enforced — at retrieval, or nowhere. Ask what happens when retrieval returns nothing relevant — does the model say so, or improvise. Ask which boxes log their inputs and outputs, because a wrong answer with no trace of what was retrieved cannot be debugged, only re-rolled. The [build guide's](/guides/build-an-agentic-ai-system) instrumentation rule applies to every arrow in the picture.