Learn AI Series (#64) - Retrieval-Augmented Generation (RAG) - Basics
What will I learn
- You will learn why LLMs hallucinate and how RAG addresses the problem;
- the RAG pipeline: query, retrieve, generate;
- document chunking strategies -- size, overlap, and semantic boundaries;
- embedding models for retrieval -- choosing the right one;
- context window management -- fitting retrieved content into the prompt;
- building a working document Q&A system from scratch.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution;
- The ambition to learn AI and machine learning.
Difficulty
- Beginner
Curriculum (of the Learn AI Series):
- Learn AI Series (#1) - What Machine Learning Actually Is
- Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
- Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World
- Learn AI Series (#4) - Your First Prediction - No Math, Just Intuition
- Learn AI Series (#5) - Patterns in Data - What "Learning" Actually Looks Like
- Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
- Learn AI Series (#7) - The Training Loop - See It Work Step by Step
- Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra
- Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability
- Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch
- Learn AI Series (#11) - Making Linear Regression Real
- Learn AI Series (#12) - Classification - Logistic Regression From Scratch
- Learn AI Series (#13) - Evaluation - How to Know If Your Model Actually Works
- Learn AI Series (#14) - Data Preparation - The 80% Nobody Talks About
- Learn AI Series (#15) - Feature Engineering and Selection
- Learn AI Series (#16) - Scikit-Learn - The Standard Library of ML
- Learn AI Series (#17) - Decision Trees - How Machines Make Decisions
- Learn AI Series (#18) - Random Forests - Wisdom of Crowds
- Learn AI Series (#19) - Gradient Boosting - The Kaggle Champion
- Learn AI Series (#20) - Support Vector Machines - Drawing the Perfect Boundary
- Learn AI Series (#21) - Mini Project - Predicting Crypto Market Regimes
- Learn AI Series (#22) - K-Means Clustering - Finding Groups
- Learn AI Series (#23) - Advanced Clustering - Beyond K-Means
- Learn AI Series (#24) - Dimensionality Reduction - PCA
- Learn AI Series (#25) - Advanced Dimensionality Reduction - t-SNE and UMAP
- Learn AI Series (#26) - Anomaly Detection - Finding What Doesn't Belong
- Learn AI Series (#27) - Recommendation Systems - "Users Like You Also Liked..."
- Learn AI Series (#28) - Time Series Fundamentals - When Order Matters
- Learn AI Series (#29) - Time Series Forecasting - Predicting What Comes Next
- Learn AI Series (#30) - Natural Language Processing - Text as Data
- Learn AI Series (#31) - Word Embeddings - Meaning in Numbers
- Learn AI Series (#32) - Bayesian Methods - Thinking in Probabilities
- Learn AI Series (#33) - Ensemble Methods Deep Dive - Stacking and Blending
- Learn AI Series (#34) - ML Engineering - From Notebook to Production
- Learn AI Series (#35) - Data Ethics and Bias in ML
- Learn AI Series (#36) - Mini Project - Complete ML Pipeline
- Learn AI Series (#37) - The Perceptron - Where It All Started
- Learn AI Series (#38) - Neural Networks From Scratch - Forward Pass
- Learn AI Series (#39) - Neural Networks From Scratch - Backpropagation
- Learn AI Series (#40) - Training Neural Networks - Practical Challenges
- Learn AI Series (#41) - Optimization Algorithms - SGD, Momentum, Adam
- Learn AI Series (#42) - PyTorch Fundamentals - Tensors and Autograd
- Learn AI Series (#43) - PyTorch Data and Training
- Learn AI Series (#44) - PyTorch nn.Module - Building Real Networks
- Learn AI Series (#45) - Convolutional Neural Networks - Theory
- Learn AI Series (#46) - CNNs in Practice - Classic to Modern Architectures
- Learn AI Series (#47) - CNN Applications - Detection, Segmentation, Style Transfer
- Learn AI Series (#48) - Recurrent Neural Networks - Sequences
- Learn AI Series (#49) - LSTM and GRU - Solving the Memory Problem
- Learn AI Series (#50) - Sequence-to-Sequence Models
- Learn AI Series (#51) - Attention Mechanisms
- Learn AI Series (#52) - The Transformer Architecture (Part 1)
- Learn AI Series (#53) - The Transformer Architecture (Part 2)
- Learn AI Series (#54) - Vision Transformers
- Learn AI Series (#55) - Generative Adversarial Networks
- Learn AI Series (#56) - Mini Project - Building a Transformer From Scratch
- Learn AI Series (#57) - Language Modeling - Predicting the Next Word
- Learn AI Series (#58) - GPT Architecture - Decoder-Only Transformers
- Learn AI Series (#59) - BERT and Encoder Models
- Learn AI Series (#60) - Training Large Language Models
- Learn AI Series (#61) - Instruction Tuning and Alignment
- Learn AI Series (#62) - Prompt Engineering - Getting the Most from LLMs
- Learn AI Series (#63) - Embeddings and Vector Search
- Learn AI Series (#64) - Retrieval-Augmented Generation (RAG) - Basics (this post)
Learn AI Series (#64) - Retrieval-Augmented Generation (RAG) - Basics
Solutions to Episode #63 Exercises
Exercise 1: Document deduplication system with threshold-based flagging and precision/recall analysis.
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
documents = [
# 5 pairs of near-duplicates
"Python is widely used for data science and machine learning.",
"Python is commonly used in data science and ML applications.",
"The Eiffel Tower stands 330 meters tall in Paris.",
"Standing at 330 meters, the Eiffel Tower is located in Paris.",
"Neural networks learn by adjusting weights through backpropagation.",
"Backpropagation adjusts neural network weights during the learning process.",
"Amsterdam is the capital city of the Netherlands.",
"The capital of the Netherlands is Amsterdam.",
"Random forests combine many decision trees for better predictions.",
"By combining multiple decision trees, random forests improve prediction accuracy.",
# 10 unique documents
"Quantum computing uses qubits instead of classical bits.",
"The Great Wall of China spans over 21,000 kilometers.",
"Photosynthesis converts sunlight into chemical energy in plants.",
"Shakespeare wrote approximately 37 plays during his lifetime.",
"The Pacific Ocean is the largest and deepest ocean on Earth.",
"DNA carries genetic instructions for biological development.",
"Mount Everest is the tallest mountain above sea level.",
"The speed of light is approximately 300,000 km per second.",
"Honey never spoils due to its low moisture content.",
"The human brain contains roughly 86 billion neurons.",
]
# Ground truth: pairs (i, j) that are duplicates
ground_truth = {(0,1), (2,3), (4,5), (6,7), (8,9)}
def find_duplicates(docs, threshold=0.85):
embeddings = model.encode(docs, normalize_embeddings=True)
n = len(docs)
flagged = []
for i in range(n):
for j in range(i+1, n):
sim = float(embeddings[i] @ embeddings[j])
if sim >= threshold:
flagged.append((i, j, sim))
return flagged
print(f"{'Threshold':>10} {'Flagged':>8} {'TP':>5} {'FP':>5} "
f"{'Precision':>10} {'Recall':>8}")
print("-" * 52)
for thresh in [0.70, 0.75, 0.80, 0.85, 0.90, 0.95]:
flagged = find_duplicates(documents, threshold=thresh)
flagged_pairs = {(i, j) for i, j, _ in flagged}
tp = len(flagged_pairs & ground_truth)
fp = len(flagged_pairs - ground_truth)
precision = tp / len(flagged_pairs) if flagged_pairs else 0
recall = tp / len(ground_truth)
print(f"{thresh:>10.2f} {len(flagged):>8} {tp:>5} {fp:>5} "
f"{precision:>10.2f} {recall:>8.2f}")
print("\nDetailed flags at threshold=0.85:")
for i, j, sim in find_duplicates(documents, 0.85):
is_true = (i, j) in ground_truth
tag = "TRUE DUP" if is_true else "FALSE POS"
print(f" [{tag}] sim={sim:.3f}: "
f"'{documents[i][:40]}' vs '{documents[j][:40]}'")
Lower thresholds catch more duplicates (higher recall) but also flag non-duplicates (lower precision). The sweet spot depends on your tolerance for false positives vs missing actual duplicates. For most deduplication tasks, 0.85 is a reasonable starting point -- precise enough to avoid flagging loosely related documents, sensitive enough to catch genuine paraphrases.
Exercise 2: IVF index built from scratch using NumPy.
import numpy as np
from sklearn.cluster import KMeans
import time
np.random.seed(42)
class SimpleIVF:
def __init__(self):
self.centroids = None
self.clusters = {}
self.vectors = None
def train(self, vectors, n_clusters):
km = KMeans(n_clusters=n_clusters, random_state=42, n_init=5)
km.fit(vectors)
self.centroids = km.cluster_centers_
self.centroids = (self.centroids /
np.linalg.norm(self.centroids, axis=1, keepdims=True))
def add(self, vectors):
self.vectors = vectors
# Assign each vector to nearest centroid
sims = vectors @ self.centroids.T
assignments = np.argmax(sims, axis=1)
self.clusters = {}
for idx, cluster_id in enumerate(assignments):
self.clusters.setdefault(int(cluster_id), []).append(idx)
def search(self, query, k=10, nprobe=5):
# Find nprobe nearest centroids
centroid_sims = query @ self.centroids.T
top_clusters = np.argsort(centroid_sims)[::-1][:nprobe]
# Brute force within those clusters
candidate_indices = []
for c in top_clusters:
candidate_indices.extend(self.clusters.get(int(c), []))
if not candidate_indices:
return np.array([]), np.array([])
candidates = np.array(candidate_indices)
cand_vecs = self.vectors[candidates]
sims = cand_vecs @ query
top_local = np.argsort(sims)[::-1][:k]
return candidates[top_local], sims[top_local]
# Benchmark
d = 128
n = 50_000
vectors = np.random.randn(n, d).astype('float32')
vectors = vectors / np.linalg.norm(vectors, axis=1, keepdims=True)
ivf = SimpleIVF()
ivf.train(vectors, n_clusters=100)
ivf.add(vectors)
# Brute force baseline
queries = np.random.randn(50, d).astype('float32')
queries = queries / np.linalg.norm(queries, axis=1, keepdims=True)
t0 = time.time()
exact_results = []
for q in queries:
sims = vectors @ q
exact_results.append(np.argsort(sims)[::-1][:10])
t_exact = time.time() - t0
print(f"{'nprobe':>7} {'Time (ms)':>10} {'Speedup':>8} {'Recall@10':>10}")
print("-" * 40)
for nprobe in [1, 3, 5, 10, 25]:
t0 = time.time()
ivf_results = []
for q in queries:
idx, _ = ivf.search(q, k=10, nprobe=nprobe)
ivf_results.append(set(idx))
t_ivf = time.time() - t0
recalls = []
for exact, approx in zip(exact_results, ivf_results):
recalls.append(len(set(exact) & approx) / 10.0)
mean_recall = np.mean(recalls)
print(f"{nprobe:>7} {t_ivf*1000:>10.1f} {t_exact/t_ivf:>8.1f}x "
f"{mean_recall:>10.3f}")
print(f"\nExact search: {t_exact*1000:.1f}ms for {len(queries)} queries")
The pattern is clear: nprobe=1 is fast but misses quit some true neighbors (the query's nearest vectors might live in an adjacent cluster). nprobe=25 searches 25% of all clusters and gets recall above 95%. The tradeoff curve is exactly what production systems like FAISS optimize -- except FAISS also uses product quantization and SIMD instructions for the inner loop, which we're not doing here.
Exercise 3: Semantic search evaluation harness with standard IR metrics.
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
topics = {
"programming": [
"Python is a high-level interpreted programming language.",
"JavaScript runs natively in web browsers for interactive pages.",
"Variable scope determines where a variable can be accessed.",
"Object-oriented programming uses classes to organize code.",
"Debugging involves finding and fixing errors in source code.",
"Compilation translates human-readable code into machine instructions.",
"Version control with Git tracks changes to source files.",
"APIs allow different software systems to communicate.",
"Unit tests verify that individual functions work correctly.",
"Recursion is when a function calls itself to solve a problem.",
],
"biology": [
"DNA carries the genetic instructions for all living organisms.",
"Mitosis is the process of cell division for growth and repair.",
"Enzymes are proteins that catalyze biochemical reactions.",
"Photosynthesis converts sunlight to chemical energy in plants.",
"Evolution occurs through natural selection over generations.",
"Neurons transmit electrical signals through the nervous system.",
"Red blood cells carry oxygen throughout the body.",
"Chromosomes contain tightly coiled DNA molecules.",
"Bacteria are single-celled organisms found everywhere on Earth.",
"The immune system defends the body against pathogens.",
],
"geography": [
"The Amazon River is the largest river by water volume.",
"Mount Everest is the highest point above sea level on Earth.",
"The Sahara Desert spans most of North Africa.",
"Tectonic plates float on the asthenosphere and cause earthquakes.",
"The Pacific Ocean is the largest ocean covering one-third of Earth.",
"Glaciers are large masses of ice that move slowly over land.",
"Volcanoes form at convergent plate boundaries and hotspots.",
"The equator divides the Earth into northern and southern hemispheres.",
"Coral reefs are underwater ecosystems built by tiny organisms.",
"Continental drift explains how continents moved over millions of years.",
],
"physics": [
"Gravity is a force that attracts objects with mass toward each other.",
"Light travels at approximately 300,000 km per second in vacuum.",
"Electrons orbit the nucleus of an atom in energy levels.",
"Entropy measures the disorder or randomness in a system.",
"Electromagnetic waves include radio, microwave, and visible light.",
"Newton's laws describe the relationship between force and motion.",
"Kinetic energy is the energy an object has due to its motion.",
"Quantum mechanics describes behavior at the atomic scale.",
"Friction is a force that opposes the relative motion of surfaces.",
"Thermodynamics studies heat, work, and energy transfer.",
],
"history": [
"The Roman Empire lasted from 27 BC until 476 AD in the west.",
"The printing press was invented by Gutenberg around 1440.",
"The Industrial Revolution began in Britain in the late 1700s.",
"World War II lasted from 1939 to 1945 and involved most nations.",
"The Renaissance was a cultural rebirth in Europe from the 1300s.",
"Ancient Egypt built pyramids as tombs for their pharaohs.",
"The French Revolution began in 1789 with the fall of the Bastille.",
"The Silk Road connected East Asia and Europe through trade routes.",
"The Cold War was a geopolitical rivalry between the US and USSR.",
"The Declaration of Independence was signed in 1776.",
],
}
# Build corpus
all_docs = []
doc_topics = []
for topic, docs in topics.items():
for doc in docs:
all_docs.append(doc)
doc_topics.append(topic)
embeddings = model.encode(all_docs, normalize_embeddings=True)
queries = [
("How do computers run programs?", "programming"),
("What programming techniques help organize code?", "programming"),
("How do living cells reproduce?", "biology"),
("What molecules carry genetic information?", "biology"),
("What are the largest natural features on Earth?", "geography"),
("How does the surface of the Earth change?", "geography"),
("What forces govern the physical universe?", "physics"),
("How does energy move between systems?", "physics"),
("What were major conflicts in world history?", "history"),
("How did technology change civilization?", "history"),
]
def precision_at_k(retrieved, relevant, k):
return sum(1 for d in retrieved[:k] if d in relevant) / k
def recall_at_k(retrieved, relevant, k):
if not relevant:
return 0
return sum(1 for d in retrieved[:k] if d in relevant) / len(relevant)
def mrr(retrieved, relevant):
for rank, d in enumerate(retrieved, 1):
if d in relevant:
return 1.0 / rank
return 0.0
def ndcg_at_k(retrieved, relevant, k):
dcg = 0.0
for i, d in enumerate(retrieved[:k]):
if d in relevant:
dcg += 1.0 / np.log2(i + 2) # i+2 because i is 0-indexed
# Ideal: all relevant docs at top positions
ideal_hits = min(len(relevant), k)
idcg = sum(1.0 / np.log2(i + 2) for i in range(ideal_hits))
return dcg / idcg if idcg > 0 else 0.0
# Run evaluation
k_values = [1, 3, 5, 10]
results_by_topic = {}
for query_text, query_topic in queries:
q_emb = model.encode([query_text], normalize_embeddings=True)
scores = (embeddings @ q_emb.T).flatten()
ranked = np.argsort(scores)[::-1]
relevant = {i for i, t in enumerate(doc_topics) if t == query_topic}
for k in k_values:
key = (query_topic, k)
if key not in results_by_topic:
results_by_topic[key] = {"p": [], "r": [], "mrr": [], "ndcg": []}
results_by_topic[key]["p"].append(precision_at_k(ranked, relevant, k))
results_by_topic[key]["r"].append(recall_at_k(ranked, relevant, k))
results_by_topic[key]["mrr"].append(mrr(ranked, relevant))
results_by_topic[key]["ndcg"].append(ndcg_at_k(ranked, relevant, k))
print(f"{'Topic':<14} {'k':>3} {'P@k':>7} {'R@k':>7} {'MRR':>7} {'NDCG@k':>7}")
print("-" * 50)
for topic in topics:
for k in k_values:
key = (topic, k)
m = results_by_topic[key]
print(f"{topic:<14} {k:>3} {np.mean(m['p']):>7.3f} "
f"{np.mean(m['r']):>7.3f} {np.mean(m['mrr']):>7.3f} "
f"{np.mean(m['ndcg']):>7.3f}")
print()
Topics with more distinctive vocabulary (biology, physics) tend to perform better because their terms are further apart in embedding space. More generic topics (programming, history) can overlap more -- "technology" appears in both programming and history contexts, which dilutes retrieval precision. This is exactly why domain-specific embedding models exist and why hybrid search (combining embeddings with keyword matching, as we covered in episode #63) outperforms pure semantic search in practice.
On to today's episode
Here we go! In episode #63 we built the entire infrastructure for semantic search -- embedding text into vectors, finding similar content using cosine similarity, scaling with vector databases and approximate nearest neighbor algorithms. All of that was the plumbing. Today we connect it to something practical that has changed how people actually use language models in the real world.
The problem is straightforward. LLMs have a knowledge cutoff. Everything they know was baked into their weights during pre-training (as we covered in episode #60). Ask about something that happened after the cutoff and they don't know. Ask about your company's internal docs and they can't have read them. Ask for a specific citation and they might hallucinate one that sounds completely convincing but points to a paper that doesn't exist.
Retrieval-Augmented Generation -- RAG -- fixes this by giving the model a reference library at inference time. In stead of relying solely on memorised knowledge in the weights, the model retrieves relevant documents from an external knowledge base and uses them as context when generating its response. The difference between answering from memory and answering with an open book in front of you, basically ;-)
The name comes from a 2020 paper by Lewis et al. at Facebook AI Research, but the core idea is almost embarrassingly intuitive: search first, then answer. And what makes it especially relevant for us right now is that we already built every component we need. Episode #63 gave us embeddings and vector search. Episode #62 gave us prompt engineering. This episode connects them into one complete system.
Why not just use a bigger context window?
Before we build anything, I want to address the obvious question. Modern LLMs have context windows of 128K tokens or more. Why not just dump all your documents into the prompt and skip the retrieval step?
Three reasons:
Cost: tokens cost money (or compute time, if you're running locally). Stuffing 100K tokens of mostly-irrelevant context into every query is wasteful. RAG retrieves only the 3-5 most relevant chunks -- maybe 1000-2000 tokens total. That's a 50-100x reduction in token usage per query.
Quality: this is the one that surprises people. More context doesn't always mean better answers. Research on the "lost in the middle" phenomenon (Liu et al., 2023) shows that LLMs pay much more attention to content at the beginning and end of the context, and tend to ignore information buried in the middle. With a 100K-token context, your answer might be on page 47 of the documents -- right in the zone the model ignores. With RAG, the relevant chunk is right there at the top.
Freshness: you can update a knowledge base in seconds. Replacing a document takes no retraining, no fine-tuning, no model changes. A context window approach requires you to re-assemble the entire document set for every query, which is fine for static collections but impractical when your data changes frequently.
# The economic argument for RAG
import math
def estimate_cost(n_docs, avg_doc_tokens, queries_per_day,
cost_per_1k_tokens=0.01):
"""Compare costs: full context vs RAG retrieval."""
# Full context: stuff everything in every time
full_ctx_tokens = n_docs * avg_doc_tokens
full_cost_daily = queries_per_day * (full_ctx_tokens / 1000) * cost_per_1k_tokens
# RAG: retrieve top 5 chunks of ~300 tokens each
rag_tokens = 5 * 300
rag_cost_daily = queries_per_day * (rag_tokens / 1000) * cost_per_1k_tokens
savings = full_cost_daily - rag_cost_daily
ratio = full_cost_daily / rag_cost_daily if rag_cost_daily > 0 else float('inf')
return {
"full_context_daily": full_cost_daily,
"rag_daily": rag_cost_daily,
"savings_daily": savings,
"cost_ratio": ratio,
}
scenarios = [
("Small (100 docs, 50 queries/day)", 100, 500, 50),
("Medium (1K docs, 500 queries/day)", 1000, 500, 500),
("Large (10K docs, 5K queries/day)", 10000, 500, 5000),
]
print(f"{'Scenario':<40} {'Full ctx':>10} {'RAG':>10} {'Ratio':>8}")
print("-" * 72)
for name, n, avg, qpd in scenarios:
r = estimate_cost(n, avg, qpd)
print(f"{name:<40} ${r['full_context_daily']:>9.2f} "
f"${r['rag_daily']:>9.2f} {r['cost_ratio']:>7.0f}x")
print("\nFor 10K documents queried 5K times/day, RAG is ~3333x cheaper.")
print("And it performs BETTER because of the 'lost in the middle' effect.")
The numbers speak for themselves. RAG isn't just a workaround for small context windows -- it's the architecturally correct approach even when you have a large context window. Having said that, the two approaches aren't mutually exclusive. Some production systems use RAG to retrieve relevant documents, then stuff those documents into a large context window alongside the conversation history. Best of both worlds.
The RAG pipeline
Every RAG system follows the same three steps:
- Query: the user asks a question
- Retrieve: find relevant documents from a knowledge base using embedding similarity (episode #63)
- Generate: feed the retrieved documents plus the question into the LLM's prompt and let it generate a grounded answer
The retrieval step is where our previous work pays off. We embed the query, search the vector database, get back the top-k most similar chunks, and inject those into the prompt as context.
from sentence_transformers import SentenceTransformer
import numpy as np
class SimpleRAG:
def __init__(self, embed_model='all-MiniLM-L6-v2'):
self.embedder = SentenceTransformer(embed_model)
self.chunks = []
self.embeddings = None
def add_documents(self, texts):
"""Embed and store document chunks."""
self.chunks = texts
self.embeddings = self.embedder.encode(
texts, normalize_embeddings=True)
def retrieve(self, query, top_k=3):
"""Find top-k most relevant chunks for a query."""
q_emb = self.embedder.encode(
[query], normalize_embeddings=True)
scores = self.embeddings @ q_emb.T
top_idx = np.argsort(scores.flatten())[::-1][:top_k]
return [(self.chunks[i], float(scores[i][0])) for i in top_idx]
def build_prompt(self, query, contexts):
"""Build a retrieval-augmented prompt."""
context_text = "\n\n---\n\n".join([c for c, _ in contexts])
return f"""Answer the question based on the following context.
If the context doesn't contain enough information, say so.
Context:
{context_text}
Question: {query}
Answer:"""
# Usage
rag = SimpleRAG()
rag.add_documents([
"Python was created by Guido van Rossum and first released in 1991.",
"PyTorch is maintained by Meta and is the most popular deep learning framework.",
"The transformer architecture was introduced in the 2017 paper Attention Is All You Need.",
"BERT uses masked language modeling for pre-training on unlabeled text.",
"RAG combines retrieval with generation to reduce hallucination in LLMs.",
"Scikit-learn provides efficient tools for predictive data analysis.",
"Word2Vec introduced the idea of learning dense word representations.",
"GPT models are decoder-only transformers trained for next-token prediction.",
])
query = "When was Python created?"
contexts = rag.retrieve(query, top_k=2)
prompt = rag.build_prompt(query, contexts)
print("Retrieved contexts:")
for chunk, score in contexts:
print(f" [{score:.3f}] {chunk}")
print(f"\n--- Generated Prompt ---\n{prompt}")
The key thing happening here: the LLM doesn't need to have memorised when Python was created. The relevant document was retrieved from the knowledge base and placed directly in the prompt. The model reads it and synthesises an answer. If the information changes (say Python releases version 4.0 next year), you update the knowledge base, NOT the model. No retraining, no fine-tuning, no cost beyond swapping out a document.
Why RAG beats pure memorisation
Three fundamental advantages over relying on model weights alone:
Freshness: update your knowledge base any time. Company policies change? Swap the document. New product release? Add the spec. The LLM always answers from the latest information. Zero retraining involved.
Grounding: the model generates answers based on specific source documents, not vague training memories. You can trace every claim back to a source. This is critical for applications where accuracy matters -- legal, medical, financial. "According to document X, the answer is Y" is infinitely more useful than "I think the answer is Y" when the stakes are high.
Scale: an LLM's context window is limited (4K to 128K tokens depending on the model). But your knowledge base can contain millions of documents. RAG searches the entire knowledge base and retrieves only the relevant portions, effectively giving the LLM access to unlimited information within a fixed-size prompt.
# Demonstrating the three advantages
advantages = {
"Freshness": {
"scenario": "Company updated its refund policy yesterday",
"without_rag": "LLM uses 6-month-old policy from training data",
"with_rag": "LLM retrieves current policy document, answers correctly",
},
"Grounding": {
"scenario": "User asks about drug interactions",
"without_rag": "LLM might hallucinate plausible-sounding interactions",
"with_rag": "LLM cites specific medical database entries with sources",
},
"Scale": {
"scenario": "Knowledge base has 500,000 technical documents",
"without_rag": "Can't fit in context window. Fine-tuning is expensive.",
"with_rag": "Retrieves 3-5 relevant chunks per query. 500K docs searchable.",
},
}
for name, info in advantages.items():
print(f"\n{name}:")
print(f" Scenario: {info['scenario']}")
print(f" Without RAG: {info['without_rag']}")
print(f" With RAG: {info['with_rag']}")
Document chunking -- the hidden make-or-break step
Before you can retrieve documents, you need to chunk them. A 50-page PDF won't fit in a single embedding (and even if it did, the embedding would be too diluted to match specific queries). A 200-word paragraph is specific enough to be retrievable but still contains enough context to be useful.
Chunking splits your documents into pieces that can be independently embedded and retrieved. And the chunk size creates a fundamental tradeoff that directly impacts retrieval quality:
- Too small (50-100 tokens): chunks lack context. "It was founded in 1991" means nothing without knowing what "it" refers to. High precision on exact matches but low recall because the embedding doesn't capture the full meaning.
- Too large (2000+ tokens): chunks contain multiple topics. The embedding becomes a vague average of everything in the chunk. A relevant sentence is buried in pages of irrelevant text, and the vector doesn't match the query well.
- Sweet spot (200-500 tokens): enough context to be self-contained, focused enough to match specific queries. This is where most production systems operate.
def chunk_by_tokens(text, chunk_size=400, overlap=50):
"""Fixed-size chunking with overlap."""
words = text.split()
chunks = []
start = 0
while start < len(words):
end = start + chunk_size
chunk = ' '.join(words[start:end])
chunks.append(chunk)
start = end - overlap # overlap prevents splitting mid-concept
return chunks
def chunk_by_paragraphs(text, max_size=500):
"""Paragraph-aware chunking that respects document structure."""
paragraphs = text.split('\n\n')
chunks = []
current = ""
for para in paragraphs:
combined_len = len(current.split()) + len(para.split())
if combined_len > max_size and current:
chunks.append(current.strip())
current = para
else:
current += "\n\n" + para if current else para
if current.strip():
chunks.append(current.strip())
return chunks
# Compare the two approaches on a sample document
sample_doc = """Machine learning is a branch of artificial intelligence. It enables
computers to learn from data without being explicitly programmed.
The most common types are supervised learning, unsupervised learning,
and reinforcement learning. Each has different use cases and requirements.
Supervised learning uses labeled training data. The model learns a mapping
from inputs to outputs by examining examples. Common algorithms include
linear regression, decision trees, and neural networks.
Unsupervised learning finds patterns in unlabeled data. Clustering and
dimensionality reduction are the main techniques. K-means and PCA are
the classic examples we covered in episodes 22 and 24.
Reinforcement learning involves an agent interacting with an environment.
The agent receives rewards for good actions and penalties for bad ones.
This is how game-playing AI systems like AlphaGo were trained."""
token_chunks = chunk_by_tokens(sample_doc, chunk_size=30, overlap=5)
para_chunks = chunk_by_paragraphs(sample_doc, max_size=50)
print(f"Token-based ({len(token_chunks)} chunks):")
for i, c in enumerate(token_chunks):
print(f" Chunk {i}: '{c[:60]}...' ({len(c.split())} words)")
print(f"\nParagraph-based ({len(para_chunks)} chunks):")
for i, c in enumerate(para_chunks):
print(f" Chunk {i}: '{c[:60]}...' ({len(c.split())} words)")
The overlap between chunks is important. If a concept spans a chunk boundary (which happens more often than you'd think), the overlap ensures both chunks contain enough context to match queries about that concept. A 50-word overlap means the end of chunk N and the beginning of chunk N+1 share 50 words.
Paragraph-aware chunking respects natural document structure. In stead of slicing mid-sentence at exactly the 400th token, it splits at paragraph boundaries and merges small paragraphs together until reaching the target size. This produces more coherent chunks -- each chunk is about one topic, not half of one topic and half of another.
There's also semantic chunking (more advanced): use the embedding model itself to detect topic boundaries. Embed each sentence, compute cosine similarity between consecutive sentences, and split where the similarity drops below a threshold. Computationally expensive but produces the most coherent chunks because the boundaries correspond to actual topic shifts in the text. We'll explore this in the next episode when we cover advanced RAG techniques.
Choosing an embedding model
The embedding model is the single most important component in a RAG system. A poor embedding model means you retrieve irrelevant documents, and even the best LLM can't produce a good answer from bad context. Garbage in, garbage out -- the oldest rule in computing.
# Practical embedding model comparison
models_info = {
"all-MiniLM-L6-v2": {
"dim": 384,
"speed": "fast (80ms/batch)",
"quality": "good for general English",
"use_case": "prototyping, small apps",
},
"E5-large-v2 (Microsoft)": {
"dim": 1024,
"speed": "medium (250ms/batch)",
"quality": "strong retrieval quality",
"use_case": "production search systems",
},
"text-embedding-3-small (OpenAI)": {
"dim": 1536,
"speed": "API call (~200ms)",
"quality": "excellent, benchmark-leading",
"use_case": "production apps with budget for API",
},
"BGE-base-en (BAAI)": {
"dim": 768,
"speed": "fast (100ms/batch)",
"quality": "competitive with commercial models",
"use_case": "production, self-hosted, open-source",
},
"nomic-embed-text": {
"dim": 768,
"speed": "fast (90ms/batch)",
"quality": "strong, 8192 token context",
"use_case": "long documents, self-hosted",
},
}
print(f"{'Model':<32} {'Dim':>5} {'Speed':<20} {'Best for'}")
print("-" * 85)
for name, info in models_info.items():
print(f"{name:<32} {info['dim']:>5} {info['speed']:<20} "
f"{info['use_case']}")
print("\nHow to choose:")
print(" Prototyping: all-MiniLM-L6-v2 (fast, good enough)")
print(" Production (self-hosted): BGE-base-en or nomic-embed-text")
print(" Production (API budget): text-embedding-3-small/large")
print(" Specialized domain: fine-tune on domain pairs (next episode)")
A practical test you should always do before committing to a model: embed your query and 5 documents you KNOW are relevant plus 5 documents you KNOW are irrelevant. Check that the relevant ones rank higher. If they don't, try a different model. This takes 10 minutes and saves you from building an entire system on a model that doesn't understand your domain.
One subtlety worth knowing: some embedding models are asymmetric -- they're trained with short queries matched against long passages (the "query-document" paradigm). These models expect different input for queries vs documents. For example, E5 models use a prefix: queries get "query: How does RAG work?" while documents get "passage: RAG combines retrieval with generation...". Using the right prefix can improve retrieval quality by 5-10%. Check the model's documentation.
Context window management
You've retrieved your top 5 chunks. Now you need to fit them into the LLM's prompt alongside the system instruction, the question, and room for the answer. This is context window management, and getting it wrong is one of the most common RAG failures.
def build_rag_prompt(query, chunks, max_context_tokens=3000):
"""Build a prompt that respects the context window budget."""
system = ("Answer the question based on the provided context. "
"Cite the source when possible. If the context doesn't "
"contain the answer, say 'I don't have enough information.'")
context_parts = []
total_tokens = 0
for i, (chunk, score) in enumerate(chunks):
# Rough token estimate: 1 token ~= 4 English characters
chunk_tokens = len(chunk) // 4
if total_tokens + chunk_tokens > max_context_tokens:
break
context_parts.append(
f"[Source {i+1}, relevance: {score:.2f}]\n{chunk}")
total_tokens += chunk_tokens
context = "\n\n---\n\n".join(context_parts)
prompt = f"""{system}
Context:
{context}
Question: {query}
Answer:"""
prompt_tokens = len(prompt) // 4
print(f"Context: {len(context_parts)} chunks, "
f"~{total_tokens} tokens")
print(f"Total prompt: ~{prompt_tokens} tokens")
print(f"Remaining for answer: "
f"~{4096 - prompt_tokens} tokens (assuming 4K window)")
return prompt
# Example
chunks = [
("RAG combines retrieval with generation. The model first searches "
"a knowledge base for relevant documents, then uses them as context "
"for generating its response. This grounds the output in real data.",
0.92),
("The original RAG paper by Lewis et al. (2020) at Facebook AI "
"demonstrated that combining a pre-trained seq2seq model with a "
"neural retriever significantly improved performance on knowledge-"
"intensive NLP tasks compared to pure generation.", 0.87),
("Hallucination in language models refers to the generation of "
"plausible-sounding but factually incorrect information. RAG "
"mitigates this by providing real documents as evidence.", 0.84),
("The transformer architecture uses self-attention to process "
"sequences in parallel rather than sequentially.", 0.45),
]
prompt = build_rag_prompt("What is RAG and why does it work?", chunks)
A few important principles for context management:
Rank matters: put the most relevant chunks first. Remember the "lost in the middle" phenomenon from the section above -- the model pays more attention to content near the start and end of the context. Your highest-scoring chunk should be position 1, not buried at position 5.
Relevance threshold: don't include chunks below a minimum similarity score. That last chunk in our example (score 0.45, about transformers) is only tangentially related to the query about RAG. Including it adds noise without signal, and can actually confuse the model. A threshold of 0.5-0.6 for cosine similarity is a reasonable starting point.
Token budgeting: reserve tokens for each component of the prompt. System instruction takes ~50-100 tokens, the question ~20-50, and you want at least 500-1000 tokens for the answer. What's left over is your context budget.
Source attribution: label each chunk with its source document name or ID. This enables the model to say "According to document X..." which makes answers verifiable. Traceability is the whole point of RAG.
Building a complete document Q&A system
Time to put the entire pipeline together. This is a fully functional RAG system you can actually use:
from sentence_transformers import SentenceTransformer
import numpy as np
class DocumentQA:
"""Complete RAG pipeline: ingest, embed, retrieve, generate prompt."""
def __init__(self, model_name='all-MiniLM-L6-v2'):
self.embedder = SentenceTransformer(model_name)
self.chunks = []
self.sources = []
self.embeddings = None
def ingest(self, documents):
"""Ingest documents as (text, source_name) tuples."""
for text, source in documents:
doc_chunks = self._chunk_text(text, max_size=300)
for chunk in doc_chunks:
self.chunks.append(chunk)
self.sources.append(source)
self.embeddings = self.embedder.encode(
self.chunks, normalize_embeddings=True,
show_progress_bar=True)
print(f"Ingested {len(documents)} documents -> "
f"{len(self.chunks)} chunks")
def _chunk_text(self, text, max_size=300):
"""Split text into paragraph-aware chunks."""
paragraphs = text.split('\n\n')
chunks = []
current = ""
for para in paragraphs:
if len(current.split()) + len(para.split()) > max_size:
if current:
chunks.append(current.strip())
current = para
else:
current += "\n\n" + para if current else para
if current.strip():
chunks.append(current.strip())
return chunks if chunks else [text]
def answer(self, query, top_k=3, threshold=0.3):
"""Retrieve relevant chunks and build a grounded prompt."""
q_emb = self.embedder.encode(
[query], normalize_embeddings=True)
scores = (self.embeddings @ q_emb.T).flatten()
top_idx = np.argsort(scores)[::-1][:top_k]
contexts = []
for i in top_idx:
if scores[i] > threshold:
contexts.append({
'text': self.chunks[i],
'source': self.sources[i],
'score': float(scores[i])
})
prompt = self._build_prompt(query, contexts)
return prompt, contexts
def _build_prompt(self, query, contexts):
if not contexts:
return (f"No relevant context found for: {query}\n"
f"Please rephrase your question.")
ctx_parts = []
for c in contexts:
ctx_parts.append(
f"[From: {c['source']}, score: {c['score']:.2f}]\n"
f"{c['text']}")
ctx = "\n\n---\n\n".join(ctx_parts)
return f"""Answer the question based on the context below.
Cite sources when possible. If unsure, say so.
{ctx}
Question: {query}
Answer:"""
# Demo usage
qa = DocumentQA()
qa.ingest([
("Python 3.12 introduced significant performance improvements "
"through the implementation of a specialized adaptive interpreter. "
"Comprehension inlining and improved error messages were also added. "
"The release was published in October 2023.",
"python-changelog.md"),
("The transformer architecture was introduced in the 2017 paper "
"Attention Is All You Need by Vaswani et al. It replaces recurrence "
"with self-attention, processing all positions in parallel. This "
"enabled much faster training on GPUs compared to RNNs and LSTMs.",
"ml-notes.md"),
("Retrieval-Augmented Generation (RAG) was proposed by Lewis et al. "
"in 2020. It combines a neural retriever with a seq2seq generator. "
"The retriever finds relevant passages from a knowledge base, and "
"the generator produces an answer conditioned on those passages. "
"RAG significantly reduces hallucination compared to pure generation.",
"rag-paper-summary.md"),
("FAISS (Facebook AI Similarity Search) is a library for efficient "
"similarity search over dense vectors. It supports GPU acceleration "
"and approximate nearest neighbor algorithms like IVF and HNSW. "
"For collections under 100K vectors, flat (exact) search is fast "
"enough. For millions of vectors, use IVF with product quantization.",
"vector-db-guide.md"),
])
queries = [
"What's new in Python 3.12?",
"How does RAG reduce hallucination?",
"What vector search library should I use for a million vectors?",
]
for q in queries:
prompt, sources = qa.answer(q)
print(f"\nQuery: '{q}'")
print(f"Retrieved {len(sources)} relevant chunks:")
for s in sources:
print(f" [{s['score']:.3f}] from {s['source']}: "
f"{s['text'][:60]}...")
print()
This is a complete, working RAG system. The prompt goes to any LLM -- feed it into an API call or a local model, and the response will be grounded in your documents. The sources list provides full traceability. You know exactly which documents informed the answer.
Evaluation -- how do you know your RAG system works?
Building a RAG system is one thing. Knowing if it's actually good is another. You need metrics that capture both retrieval quality and answer quality separately, because a failure in either component breaks the whole pipeline.
import numpy as np
def evaluate_rag_retrieval(queries, ground_truth_docs, retrieved_docs):
"""Evaluate the retrieval component of a RAG system.
queries: list of query strings
ground_truth_docs: list of sets of relevant doc indices
retrieved_docs: list of lists of retrieved doc indices
"""
metrics = {"precision@3": [], "recall@3": [], "mrr": []}
for gt, retrieved in zip(ground_truth_docs, retrieved_docs):
top3 = retrieved[:3]
# Precision@3: of the 3 retrieved, how many were relevant?
p3 = sum(1 for d in top3 if d in gt) / 3
metrics["precision@3"].append(p3)
# Recall@3: of all relevant docs, how many did we find in top 3?
r3 = sum(1 for d in top3 if d in gt) / len(gt) if gt else 0
metrics["recall@3"].append(r3)
# MRR: reciprocal rank of first relevant result
rr = 0
for rank, d in enumerate(retrieved, 1):
if d in gt:
rr = 1.0 / rank
break
metrics["mrr"].append(rr)
print("RAG Retrieval Evaluation:")
print(f" Precision@3: {np.mean(metrics['precision@3']):.3f}")
print(f" Recall@3: {np.mean(metrics['recall@3']):.3f}")
print(f" MRR: {np.mean(metrics['mrr']):.3f}")
return metrics
# Simulated evaluation
# 10 queries, each with known relevant docs and retrieval results
np.random.seed(42)
n_queries = 10
ground_truth = [set(np.random.choice(50, size=3, replace=False))
for _ in range(n_queries)]
# Simulate a decent retriever: ~70% of the time the top result is relevant
retrieved = []
for gt in ground_truth:
gt_list = list(gt)
if np.random.random() < 0.7:
# Good retrieval: put a relevant doc first
r = [gt_list[0]] + list(np.random.choice(50, size=4, replace=False))
else:
# Bad retrieval: no relevant doc in top results
r = list(np.random.choice(50, size=5, replace=False))
retrieved.append(r)
evaluate_rag_retrieval(
[f"query_{i}" for i in range(n_queries)],
ground_truth,
retrieved)
print("\nFor end-to-end evaluation, you also need:")
print(" - Answer correctness (does the final answer match ground truth?)")
print(" - Faithfulness (is the answer supported by retrieved context?)")
print(" - Answer relevance (does the answer address the question?)")
In production, you'd evaluate the full pipeline end-to-end using a test set of question-answer pairs. The key metrics are:
- Retrieval quality: did the retriever find the right documents? (Precision, Recall, MRR -- same metrics from exercise 3 of episode #63)
- Answer correctness: does the generated answer match the expected answer?
- Faithfulness: is every claim in the answer supported by the retrieved context? (A model that ignores the context and hallucinates anyway is NOT doing RAG properly)
- Answer relevance: does the answer actually address the question, or does it talk about something tangentially related?
There are now standardized frameworks for RAG evaluation -- RAGAS is one of the popular ones -- that automate these checks. But understanding what they measure is more important than knowing the framework. You want to catch two types of failures: retrieval failures (right answer exists in the knowledge base but wasn't retrieved) and generation failures (right context was retrieved but the model answered wrong anyway).
Common pitfalls and how to avoid them
I've seen these mistakes enough times that they're worth calling out explicitly:
Bad chunking: chunks that split mid-sentence or mid-concept produce poor embeddings and confusing context. Always inspect your chunks manually on a sample before scaling. Print 20 random chunks and read them. If they don't make sense as standalone pieces of text, your chunking needs work.
Embedding-query mismatch: some embedding models are trained with asymmetric input. Using a symmetric model (designed for sentence-sentence similarity) for query-document retrieval is like using a screwdriver as a hammer -- it sort of works but you're losing quality. Check if your model needs query/passage prefixes.
Missing context across chunks: if the answer requires information from two different chunks that aren't retrieved together, the model can't synthesise them. Overlap helps. So does increasing top_k from 3 to 5. But the fundamental fix is ensuring your chunks are self-contained enough to answer questions independently.
Over-retrieval: stuffing too many chunks into the context dilutes the relevant information. The model has to wade through noise to find the signal. I'd rather retrieve 3 highly relevant chunks than 10 marginally relevant ones. Quality over quantity.
No relevance filtering: if you always return top_k results regardless of their actual relevance score, you'll sometimes inject completely irrelevant chunks. That low-scoring chunk about cooking recipes has no business being in the context for a question about Python programming. Always apply a minimum threshold.
# Demonstrating the relevance threshold problem
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
knowledge_base = [
"RAG combines retrieval and generation for better answers.",
"The Eiffel Tower is 330 meters tall.",
"Python decorators use the @ symbol before function definitions.",
"Chocolate chip cookies were invented by accident in 1938.",
"Vector databases store and search embedding vectors efficiently.",
]
embeddings = model.encode(knowledge_base, normalize_embeddings=True)
query = "How does RAG improve LLM responses?"
q_emb = model.encode([query], normalize_embeddings=True)
scores = (embeddings @ q_emb.T).flatten()
print("All chunks ranked by relevance:")
for idx in np.argsort(scores)[::-1]:
status = "INCLUDE" if scores[idx] > 0.4 else "REJECT"
print(f" [{status}] {scores[idx]:.3f}: {knowledge_base[idx]}")
print("\nWithout threshold: ALL 5 chunks go into context (noise!)")
print("With threshold 0.4: only relevant chunks go into context")
Putting it all in perspective
RAG is not magic. It's an engineering pattern -- a well-understood combination of retrieval (episode #63) and generation (episode #62) that solves a real problem. The beauty is that every component is replaceable and improvable independently. Better embedding model? Swap it in, rebuild your index. Better chunking strategy? Re-process your documents. Better LLM? Just change the generation endpoint. Each improvement compounds.
What we built today is the basic RAG pipeline -- sometimes called "naive RAG" in the literature. It works surprisingly well for many use cases, but there are clear limitations. What happens when the query needs information scattered across 10 different documents? What if the chunks are ambiguous without their parent document's title? What about questions that require multi-step reasoning over the retrieved content? These are the problems that advanced RAG techniques address, and that's where we're going next ;-)
The bottom line
- RAG augments LLMs with external knowledge at inference time -- search first, then answer. No retraining needed, no fine-tuning, just better prompts;
- The pipeline is: embed query -> find similar chunks in vector store -> inject into prompt -> LLM generates grounded answer;
- Document chunking splits large texts into embeddable, retrievable pieces -- 200-500 tokens with overlap is the sweet spot for most applications;
- Chunk at natural boundaries (paragraphs, sections) rather than arbitrary token counts. Semantic chunking detects topic shifts automatically;
- The embedding model is the most critical component -- a bad retriever means bad context means bad answers, regardless of how good the LLM is;
- Context window management requires balancing retrieved content, system instructions, the question, and answer space. Put most relevant chunks first (lost in the middle effect);
- Relevance thresholds prevent low-quality chunks from polluting the context. Better to retrieve 3 great chunks than 10 mediocre ones;
- Source attribution enables traceability -- every claim can be traced to a specific document. This is what makes RAG trustworthy in production.
Exercises
Exercise 1: Build a multi-source RAG system with source tracking. Create 5 "documents" (each 200-400 words) on different topics: one about Python, one about machine learning, one about cooking, one about geography, and one about music. Chunk each document into 3-5 pieces. Build a RAG pipeline that retrieves the top 3 chunks for any query and constructs a prompt that labels each chunk with its source document name. Test with 10 queries (2 per topic) and verify that the retriever correctly pulls chunks from the right source documents. Print a confusion matrix showing which source documents were retrieved for queries from each topic. Calculate and print the "source precision" -- the fraction of retrieved chunks that came from the correct topic's document.
Exercise 2: Implement a chunk quality analyzer. Write a function analyze_chunks(text, chunk_sizes=[100, 200, 300, 500]) that chunks the same document at different sizes (using word-count-based chunking with 10% overlap) and for each size computes: (a) number of chunks, (b) average chunk length in words, (c) "self-containment score" -- embed each chunk and the full document, compute cosine similarity between each chunk and the full document, and average those similarities, (d) "inter-chunk distinctness" -- average pairwise cosine distance between chunks (higher = chunks cover different topics, which is what you want). Run this on a 1000+ word text you create (or grab from Wikipedia) and print a table comparing the four chunk sizes. Which chunk size gives the best balance of self-containment and distinctness?
Exercise 3: Build a RAG evaluation harness. Create a test set of 10 question-answer pairs where you know which chunks contain the answer (ground truth). Build a RAG pipeline, run all 10 queries, and evaluate: (a) Retrieval Recall@3 -- did the right chunk appear in the top 3? (b) Context Relevance -- average similarity score of retrieved chunks to the query, (c) "Answer Coverage" -- embed the ground truth answer and the retrieved context, compute their cosine similarity as a proxy for whether the retrieved context contains enough information to answer the question. Print a per-query breakdown and summary statistics. Identify which queries the system fails on and hypothesize why (wrong chunks retrieved? right chunks but too diluted? query too vague?).