RAG

Retrieval-augmented generation tutorial

Goal

Build a working retrieval-augmented generation pipeline that answers questions about a document corpus by finding relevant passages and generating answers grounded in those passages, without fabricating information from the model's training data.

Before you start

  • Access to a language model API or local model
  • A document corpus to query — at minimum, a folder of text files or PDFs
  • A Python environment with pip access, or equivalent programming environment

Steps

  1. 1

    Understand the RAG architecture before building

    Retrieval-augmented generation combines two components: a retrieval system that finds relevant document passages for a given query, and a language model that generates an answer using those passages as context. The retrieval step is what prevents fabrication — the model is instructed to answer only from the provided passages, not from general training knowledge. Before writing any code, confirm that your use case is appropriate for RAG: it works well for question answering over a fixed document corpus, and less well for tasks that require synthesizing across the entire corpus or drawing on knowledge not present in the documents.

  2. 2

    Prepare and chunk your documents

    Language models have context window limits, so you cannot pass an entire document corpus as context for every query. Documents must be split into chunks small enough to fit in context alongside other passages and the model's response. Chunk at natural boundaries — paragraphs, sections, or logical units — rather than at fixed character counts, which can split sentences mid-thought. Typical chunk sizes range from 256 to 1024 tokens. Smaller chunks retrieve more precisely but may lose context; larger chunks retain context but reduce precision. Test different sizes on a representative set of queries to find the right balance for your content.

  3. 3

    Generate embeddings and build a vector index

    Retrieval in RAG works by comparing the semantic similarity between the query and document chunks. Both query and chunks are converted to embedding vectors — numerical representations of meaning — and the closest chunks are retrieved. Use an embedding model to convert each chunk to a vector, then store the vectors in a vector index or database that supports similarity search. The embedding model you use for documents and the one you use for queries must be the same model, because different models produce incompatible vector spaces. Rebuild the index whenever documents are added or updated.

  4. 4

    Implement the retrieval step

    At query time, embed the user's question using the same embedding model used for indexing, query the vector index for the top-k most similar chunks (k=3 to 5 is a common starting point), and retrieve the corresponding chunk text. Test the retrieval step in isolation before connecting it to the language model: given a set of test questions, are the retrieved chunks actually relevant to those questions? Poor retrieval quality is the most common source of RAG failures, and diagnosing it requires inspecting retrieved chunks, not just final answers.

  5. 5

    Construct the prompt and generate an answer

    Pass the retrieved chunks as context to the language model with instructions to answer based only on the provided information. A basic prompt structure: a system instruction that specifies the model's role and instructs it not to answer from general knowledge, the retrieved chunks clearly labeled as source material, and the user's question. Test that the model respects the grounding instruction: on questions where the answer is not in the retrieved chunks, the model should say it does not have enough information rather than generating a plausible-sounding answer from training data.

  6. 6

    Evaluate and iterate on retrieval quality

    Evaluate the pipeline on a set of test questions with known correct answers from the corpus. For each question, check: were the correct chunks retrieved? Did the model answer accurately from those chunks? When the answer is wrong, trace back to whether retrieval or generation failed — retrieval failure is more common and is fixed by adjusting chunk size, the number of chunks retrieved, or the chunking strategy. Generation failures — where the right chunks were retrieved but the model answered incorrectly — point to prompt issues or model capability limits for the task.

Common pitfalls

  • Chunking documents at fixed token counts without respecting natural boundaries: mid-sentence or mid-paragraph chunks degrade retrieval quality.
  • Using different embedding models for indexing documents and encoding queries: vectors from different models are not comparable, producing garbage retrieval results.
  • Skipping isolated retrieval testing and only evaluating end-to-end answers: retrieval failures are harder to diagnose from final answer quality alone.
  • Not grounding the model's answer in the retrieved context: without an explicit instruction to answer only from provided passages, the model will mix retrieved information with training data, making hallucination harder to detect.
  • Treating RAG as a finished solution without ongoing evaluation: document corpora change, and retrieval quality should be re-evaluated after significant corpus updates.

Frequently asked questions

What vector database should I use for RAG?

For a basic implementation, an in-memory vector search library is sufficient — no separate database infrastructure required. For production systems with large corpora or high query volume, a dedicated vector database provides persistence, scalability, and more efficient indexing. The choice depends on corpus size, query volume, and whether you need the index to persist across restarts. Start with the simplest option that meets your requirements.

How many chunks should I retrieve per query?

Three to five is a common starting point. Retrieving more chunks increases the chance of including the relevant passage but also increases context window usage and may introduce noise. The right number depends on how specific your queries are relative to your chunk size — narrow queries over well-chunked documents may need only one or two chunks; broad queries may need more. Test on representative queries and measure whether accuracy improves as you increase k, and where additional chunks stop adding value.

Is your organisation ready for AI agents?

Take the assessment →