Core components of a RAG framework
Every RAG framework has five components that work in sequence. The indexing pipeline takes source documents, splits them into chunks, and creates searchable representations. The embedding model converts text — both document chunks and query inputs — into vector representations that capture semantic meaning. The vector store indexes the document embeddings and retrieves the most similar chunks for a given query. The retrieval layer receives a query, embeds it, queries the vector store, and returns the relevant chunk text. The generation layer passes the query and retrieved chunks to a language model with instructions to answer based on the provided context. Most RAG frameworks are modular — each component can be swapped or reconfigured independently.
Naive vs. advanced RAG architectures
Naive RAG implements the basic pipeline: chunk, embed, retrieve top-k, generate. Advanced RAG architectures address the failure modes of naive retrieval. Query rewriting improves retrieval by reformulating the user's query before embedding it. Hypothetical document embedding (HyDE) generates a hypothetical answer to the query, embeds that, and uses it to retrieve documents — which can outperform direct query embedding for knowledge-intensive tasks. Re-ranking applies a cross-encoder model to re-score retrieved chunks by relevance after initial retrieval. Multi-hop retrieval handles questions that require synthesizing information from multiple documents by iterating between retrieval and generation steps.
Evaluation and quality signals
RAG framework quality is measured at three levels. Retrieval quality: are the right chunks being retrieved for a given query? This can be evaluated with retrieval-specific metrics like precision and recall against annotated ground truth. Generation quality: is the model producing accurate answers from the retrieved context, or is it hallucinating facts not present in the chunks? This requires evaluating faithfulness of the output to the source material. End-to-end quality: does the system answer the user's actual question correctly? A system can have strong retrieval and strong generation but still fail end-to-end if the retrieved context is relevant but incomplete, or if the question requires synthesizing across more chunks than are retrieved.