Learn AI Series (#65) - RAG - Advanced Techniques

What will I learn

You will learn hybrid search -- combining dense vector search with sparse keyword matching for better retrieval;
re-ranking with cross-encoders -- improving retrieval quality with a second-stage model;
query expansion and transformation -- rewriting queries to bridge vocabulary gaps;
HyDE (Hypothetical Document Embedding) -- a counterintuitive trick that works;
multi-hop retrieval -- reasoning across multiple documents;
RAG evaluation metrics -- faithfulness, relevance, completeness;
RAG vs fine-tuning -- when to use which approach, and when to use both.

Requirements

A working modern computer running macOS, Windows or Ubuntu;
An installed Python 3(.11+) distribution;
The ambition to learn AI and machine learning.

Difficulty

Beginner

Curriculum (of the `Learn AI Series`):

Learn AI Series (#65) - RAG - Advanced Techniques

Solutions to Episode #64 Exercises

Exercise 1: Multi-source RAG system with source tracking and confusion matrix.

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

sources = {
    "python_guide.md": [
        "Python uses indentation to define code blocks instead of curly braces. "
        "This design choice makes Python code readable by default.",
        "List comprehensions in Python provide a concise way to create lists. "
        "The syntax is [expr for item in iterable if condition].",
        "Python's GIL (Global Interpreter Lock) prevents true parallel execution "
        "of threads. Use multiprocessing for CPU-bound parallelism.",
        "Decorators in Python are functions that modify the behavior of other "
        "functions. They use the @ syntax placed above the decorated function.",
    ],
    "ml_intro.md": [
        "Supervised learning requires labeled training data where each example "
        "has an input and a known correct output.",
        "Gradient descent iteratively adjusts model parameters by moving in the "
        "direction that reduces the loss function.",
        "Overfitting occurs when a model memorizes training data instead of "
        "learning general patterns. Regularization helps prevent this.",
        "Cross-validation splits data into k folds and trains k separate models "
        "to get a robust estimate of performance.",
    ],
    "cooking_101.md": [
        "The Maillard reaction occurs when proteins and sugars are heated above "
        "140 degrees Celsius, creating brown color and complex flavors.",
        "Emulsification combines two immiscible liquids like oil and vinegar "
        "using an emulsifier such as egg yolk or mustard.",
        "Braising involves searing meat at high heat then slowly cooking it "
        "in liquid at low temperature for several hours.",
        "Fermentation uses microorganisms to convert sugars into acids, gases, "
        "or alcohol. Bread, yogurt, and kimchi all rely on fermentation.",
    ],
    "geography_facts.md": [
        "The Mariana Trench reaches a depth of approximately 11,034 meters, "
        "making it the deepest known point in Earth's oceans.",
        "Iceland sits on the Mid-Atlantic Ridge where the North American and "
        "Eurasian tectonic plates are spreading apart.",
        "The Sahara Desert was green and lush around 6000 years ago during "
        "the African Humid Period, with lakes and vegetation.",
        "The Amazon rainforest produces roughly 20 percent of the world's "
        "oxygen and contains approximately 10 percent of all species.",
    ],
    "music_theory.md": [
        "A chord is three or more notes played simultaneously. Major chords "
        "sound bright while minor chords sound darker or melancholic.",
        "Time signatures indicate the rhythmic structure of music. 4/4 time "
        "has four quarter-note beats per measure.",
        "The circle of fifths shows the relationships between the twelve "
        "tones of the chromatic scale and their key signatures.",
        "Counterpoint is the art of combining independent melodic lines so "
        "they sound harmonious when played together.",
    ],
}

# Build the index
all_chunks = []
chunk_sources = []
for source_name, chunks in sources.items():
    for chunk in chunks:
        all_chunks.append(chunk)
        chunk_sources.append(source_name)

embeddings = model.encode(all_chunks, normalize_embeddings=True)

# Define queries with expected source
queries = [
    ("How does Python handle code structure?", "python_guide.md"),
    ("What is list comprehension syntax?", "python_guide.md"),
    ("How do you prevent overfitting?", "ml_intro.md"),
    ("What is gradient descent used for?", "ml_intro.md"),
    ("What makes food turn brown when heated?", "cooking_101.md"),
    ("How does bread rise?", "cooking_101.md"),
    ("What is the deepest point in the ocean?", "geography_facts.md"),
    ("Where are tectonic plates pulling apart?", "geography_facts.md"),
    ("What makes a chord major or minor?", "music_theory.md"),
    ("How are musical keys related to each other?", "music_theory.md"),
]

source_names = list(sources.keys())
confusion = np.zeros((len(source_names), len(source_names)), dtype=int)

total_correct = 0
total_retrieved = 0
top_k = 3

print(f"{'Query':<45} {'Expected':<20} {'Retrieved sources'}")
print("-" * 95)

for query_text, expected_source in queries:
    q_emb = model.encode([query_text], normalize_embeddings=True)
    scores = (embeddings @ q_emb.T).flatten()
    top_idx = np.argsort(scores)[::-1][:top_k]

    retrieved_sources = [chunk_sources[i] for i in top_idx]
    exp_idx = source_names.index(expected_source)

    correct = sum(1 for s in retrieved_sources if s == expected_source)
    total_correct += correct
    total_retrieved += top_k

    for rs in retrieved_sources:
        ret_idx = source_names.index(rs)
        confusion[exp_idx][ret_idx] += 1

    src_str = ", ".join(f"{s}" for s in retrieved_sources)
    print(f"{query_text[:43]:<45} {expected_source:<20} {src_str}")

source_precision = total_correct / total_retrieved
print(f"\nSource precision: {source_precision:.2f} "
      f"({total_correct}/{total_retrieved} chunks from correct source)")

print(f"\nConfusion matrix (rows=expected, cols=retrieved):")
header = "".join(f"{s[:8]:>10}" for s in source_names)
print(f"{'':>18}{header}")
for i, name in enumerate(source_names):
    row = "".join(f"{confusion[i][j]:>10}" for j in range(len(source_names)))
    print(f"{name[:16]:<18}{row}")

Source precision above 0.8 means the retriever consistently pulls from the right topic document. Confusion between sources (off-diagonal entries) tells you where the semantic boundaries are fuzzy -- if cooking queries occasionally retrieve chemistry-related ML content, that's an embedding model limitation, not a bug in your system.

Exercise 2: Chunk quality analyzer comparing different chunk sizes.

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

long_text = """Machine learning has transformed virtually every industry over the
past decade. In healthcare, ML models analyze medical images to detect tumors,
predict patient outcomes, and recommend treatment plans. The accuracy of these
systems often matches or exceeds human radiologists on specific tasks.

In finance, algorithmic trading systems use reinforcement learning to execute
trades at speeds no human could match. Fraud detection systems process millions
of transactions per second, flagging suspicious patterns that would be invisible
to manual review. Credit scoring models evaluate loan applications using hundreds
of features simultaneously.

Natural language processing has enabled conversational AI assistants, automated
translation between languages, and sentiment analysis of customer feedback at
scale. The transformer architecture, introduced in 2017, fundamentally changed
how we process text data. Models like BERT and GPT demonstrated that pre-training
on massive text corpora produces representations useful for virtually any NLP task.

Computer vision applications range from autonomous vehicles that interpret road
scenes in real time to manufacturing systems that detect defects on production
lines. Object detection, image segmentation, and pose estimation all rely on
convolutional neural networks or more recently vision transformers.

Recommendation systems power the content feeds of social media platforms, the
product suggestions on e-commerce sites, and the playlist generation on music
streaming services. Collaborative filtering, content-based filtering, and hybrid
approaches each have tradeoffs in terms of cold start behavior, scalability, and
the diversity of recommendations produced.

The deployment of ML systems in production raises important questions about
fairness, accountability, and transparency. Models trained on biased data
perpetuate and amplify existing biases. Explaining why a model made a specific
prediction remains an active research area, especially for deep neural networks
where the decision process is distributed across millions of parameters.

Data engineering is often the most time-consuming part of any ML project. Cleaning
data, handling missing values, encoding categorical variables, and building feature
pipelines account for an estimated 80 percent of a data scientist's time. Tools
like Apache Spark, dbt, and Airflow help manage data workflows at scale.

Transfer learning has democratized ML by allowing practitioners to start from
pre-trained models rather than training from scratch. Fine-tuning a model that
already understands language or images requires far less data and compute than
training one from zero. This has lowered the barrier to entry significantly."""

def chunk_text(text, chunk_size, overlap_pct=0.1):
    words = text.split()
    overlap = max(1, int(chunk_size * overlap_pct))
    chunks = []
    start = 0
    while start < len(words):
        end = start + chunk_size
        chunk = ' '.join(words[start:end])
        chunks.append(chunk)
        start = end - overlap
    return chunks

def analyze_chunks(text, chunk_sizes=[100, 200, 300, 500]):
    doc_emb = model.encode([text], normalize_embeddings=True)

    print(f"{'Size':>6} {'Chunks':>7} {'Avg len':>8} "
          f"{'Self-contain':>13} {'Distinctness':>13}")
    print("-" * 52)

    for size in chunk_sizes:
        chunks = chunk_text(text, size)
        n = len(chunks)
        avg_len = np.mean([len(c.split()) for c in chunks])

        chunk_embs = model.encode(chunks, normalize_embeddings=True)

        # Self-containment: similarity to full document
        self_scores = (chunk_embs @ doc_emb.T).flatten()
        self_contain = np.mean(self_scores)

        # Inter-chunk distinctness: average pairwise distance
        if n > 1:
            sim_matrix = chunk_embs @ chunk_embs.T
            mask = np.triu(np.ones((n, n), dtype=bool), k=1)
            pairwise_sims = sim_matrix[mask]
            distinctness = 1.0 - np.mean(pairwise_sims)
        else:
            distinctness = 0.0

        print(f"{size:>6} {n:>7} {avg_len:>8.0f} "
              f"{self_contain:>13.3f} {distinctness:>13.3f}")

analyze_chunks(long_text)
print("\nSmall chunks: high distinctness (different topics) but low "
      "self-containment (too little context per chunk).")
print("Large chunks: high self-containment but low distinctness "
      "(each chunk covers too many topics, diluting the embedding).")
print("The sweet spot balances both -- typically 200-300 words.")

Smaller chunks (100 words) are more distinct from each other (they cover different sub-topics) but less self-contained (each chunk is too fragmented to stand on its own). Larger chunks (500 words) are more self-contained but less distinct -- they blur multiple topics into one embedding. The sweet spot is in the middle, which is exactly what we observed in episode #64 with the 200-500 token recommendation.

Exercise 3: RAG evaluation harness with ground truth and multiple metrics.

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

# Knowledge base with known chunks
chunks = [
    "Python was created by Guido van Rossum and first released in 1991.",
    "The transformer architecture was introduced in the Attention Is All "
    "You Need paper by Vaswani et al. in 2017.",
    "RAG combines retrieval with generation to reduce hallucination.",
    "FAISS is a library by Meta for efficient similarity search over vectors.",
    "Gradient descent updates parameters by computing partial derivatives "
    "of the loss function with respect to each parameter.",
    "BERT uses masked language modeling as its pre-training objective.",
    "Convolutional neural networks use filters to detect spatial patterns.",
    "K-means clustering partitions data into k groups by minimizing "
    "within-cluster variance.",
    "Dropout randomly sets neuron activations to zero during training "
    "to prevent overfitting.",
    "Word2Vec learns dense word representations by predicting context "
    "words from a target word or vice versa.",
    "Batch normalization normalizes layer inputs to stabilize training.",
    "Adam optimizer combines momentum with adaptive learning rates.",
    "Cosine similarity measures the angle between two vectors, ignoring "
    "magnitude differences.",
    "LSTM networks use gates to control information flow, solving the "
    "vanishing gradient problem in recurrent networks.",
    "The attention mechanism lets models focus on relevant parts of the "
    "input sequence when producing each output token.",
]

embeddings = model.encode(chunks, normalize_embeddings=True)

# Test set: (question, answer, relevant_chunk_indices)
test_set = [
    ("Who created Python?",
     "Guido van Rossum created Python, releasing it in 1991.", {0}),
    ("What paper introduced transformers?",
     "The Attention Is All You Need paper by Vaswani et al in 2017.", {1}),
    ("How does RAG reduce hallucination?",
     "RAG retrieves relevant documents to ground the generation.", {2}),
    ("What library does Meta offer for vector search?",
     "FAISS by Meta for efficient similarity search.", {3}),
    ("How does gradient descent work?",
     "It computes partial derivatives and updates parameters.", {4}),
    ("What is BERT's pre-training objective?",
     "Masked language modeling -- predicting masked tokens.", {5}),
    ("How do CNNs detect patterns?",
     "They use convolutional filters for spatial patterns.", {6}),
    ("What does k-means minimize?",
     "Within-cluster variance to partition data into k groups.", {7}),
    ("How does dropout regularize?",
     "It randomly zeroes activations during training.", {8}),
    ("What does attention allow a model to do?",
     "Focus on relevant parts of input when generating output.", {14}),
]

def retrieval_recall_at_k(retrieved, relevant, k):
    return sum(1 for i in retrieved[:k] if i in relevant) / len(relevant)

results = []
top_k = 3

print(f"{'Query':<42} {'R@3':>5} {'Ctx Rel':>8} {'Coverage':>9}")
print("-" * 68)

for question, answer, relevant_idx in test_set:
    q_emb = model.encode([question], normalize_embeddings=True)
    scores = (embeddings @ q_emb.T).flatten()
    top_idx = np.argsort(scores)[::-1][:top_k]

    # Retrieval Recall@3
    recall = retrieval_recall_at_k(top_idx, relevant_idx, top_k)

    # Context Relevance: avg similarity of retrieved chunks to query
    ctx_rel = np.mean([scores[i] for i in top_idx])

    # Answer Coverage: similarity between retrieved context and
    # ground truth answer (proxy for information sufficiency)
    retrieved_text = " ".join([chunks[i] for i in top_idx])
    ret_emb = model.encode([retrieved_text], normalize_embeddings=True)
    ans_emb = model.encode([answer], normalize_embeddings=True)
    coverage = float(ret_emb @ ans_emb.T)

    results.append({
        "recall": recall, "ctx_rel": ctx_rel, "coverage": coverage,
        "question": question, "top_idx": list(top_idx),
        "relevant": relevant_idx,
    })

    hit = "HIT" if recall > 0 else "MISS"
    print(f"{question[:40]:<42} {recall:>5.2f} {ctx_rel:>8.3f} "
          f"{coverage:>9.3f} [{hit}]")

print(f"\n--- Summary ---")
print(f"Mean Recall@3:        "
      f"{np.mean([r['recall'] for r in results]):.3f}")
print(f"Mean Context Relevance: "
      f"{np.mean([r['ctx_rel'] for r in results]):.3f}")
print(f"Mean Answer Coverage:   "
      f"{np.mean([r['coverage'] for r in results]):.3f}")

misses = [r for r in results if r['recall'] == 0]
if misses:
    print(f"\nFailed queries ({len(misses)}):")
    for r in misses:
        print(f"  '{r['question'][:50]}' -- retrieved {r['top_idx']}, "
              f"needed {r['relevant']}")

Queries with distinctive vocabulary ("FAISS", "BERT", "k-means") tend to get perfect recall because those terms are strong discriminators in embedding space. More generic queries ("How does gradient descent work?") can miss because the vocabulary overlaps with multiple chunks about optimization. This is exactly the kind of analysis that tells you where hybrid search (adding keyword matching, as we covered at the end of episode #63) would help the most.

On to today's episode

Here we go! In episode #64 we built a complete RAG pipeline from scratch -- document chunking, embedding, retrieval, prompt construction, and evaluation. That pipeline (sometimes called "naive RAG" in the literature) works surprisingly well for straightforward question-answering. But it breaks down in predictable ways once your queries or your knowledge base get more complex.

The user asks "How do I make Python faster?" but the documents talk about "performance optimization", "profiling", and "caching strategies." The embedding model can't bridge that vocabulary gap. Or a question like "How does company X's Q3 revenue compare to the industry average?" requires pulling facts from two different documents and combining them. Or you retrieve 20 candidates and the right answer is somewhere in the middle of the ranked list -- not at the top where it needs to be.

These aren't edge cases. They're the bread and butter of real-world information retrieval, and every production RAG system has to deal with them. The good news: there are well-understood techniques for each failure mode. And they build directly on what we already know ;-)

Hybrid search -- dense plus sparse

Pure vector search has a blind spot that I hinted at in episode #63 when we discussed combining semantic and keyword approaches. Here's the problem stated clearly: if your document mentions "PyTorch 2.1 torch.compile" and the user searches for "torch.compile", a keyword search finds it instantly. A semantic search might rank it lower because semantically similar concepts like "model compilation" or "JIT tracing" score higher in embedding space.

Hybrid search fixes this by running BOTH a vector search (dense retrieval) and a keyword search (sparse retrieval), then merging the results.

BM25 is the standard sparse retrieval algorithm. We touched on TF-IDF back in episode #30 -- BM25 is essentially a refined version that accounts for term frequency saturation, document length normalization, and corpus statistics. It's been the backbone of search engines since the 1990s and it's still excellent at what it does: finding documents that contain the exact terms you're looking for.

from rank_bm25 import BM25Okapi
import numpy as np

class HybridSearch:
    """Combine dense (semantic) and sparse (keyword) retrieval."""

    def __init__(self, embedder):
        self.embedder = embedder
        self.chunks = []
        self.dense_embeddings = None
        self.bm25 = None

    def index(self, chunks):
        self.chunks = chunks
        # Dense: encode all chunks
        self.dense_embeddings = self.embedder.encode(
            chunks, normalize_embeddings=True)
        # Sparse: build BM25 index from tokenized chunks
        tokenized = [c.lower().split() for c in chunks]
        self.bm25 = BM25Okapi(tokenized)

    def search(self, query, top_k=5, alpha=0.5):
        """Search with weighted combination of dense + sparse.

        alpha=1.0 -> pure semantic
        alpha=0.0 -> pure keyword
        """
        # Dense retrieval (cosine similarity)
        q_emb = self.embedder.encode(
            [query], normalize_embeddings=True)
        dense_scores = (self.dense_embeddings @ q_emb.T).flatten()

        # Sparse retrieval (BM25)
        sparse_scores = self.bm25.get_scores(query.lower().split())

        # Normalize both to [0, 1]
        d_min, d_max = dense_scores.min(), dense_scores.max()
        dense_norm = (dense_scores - d_min) / (d_max - d_min + 1e-8)

        s_min, s_max = sparse_scores.min(), sparse_scores.max()
        sparse_norm = (sparse_scores - s_min) / (s_max - s_min + 1e-8)

        # Weighted combination
        combined = alpha * dense_norm + (1 - alpha) * sparse_norm
        top_idx = np.argsort(combined)[::-1][:top_k]

        return [(self.chunks[i], float(combined[i]),
                 float(dense_norm[i]), float(sparse_norm[i]))
                for i in top_idx]


# Demo: show where hybrid outperforms either alone
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

search = HybridSearch(model)
search.index([
    "PyTorch 2.1 introduced torch.compile for model optimization",
    "Model compilation improves inference speed significantly",
    "The torch.nn module provides neural network building blocks",
    "Compiling Python code with Cython can boost performance",
    "TorchScript traces models for deployment in C++ environments",
    "JIT compilation converts code to machine instructions at runtime",
    "Performance optimization requires profiling bottlenecks first",
    "The torch.compile function uses TorchDynamo and TorchInductor",
])

query = "torch.compile"
print(f"Query: '{query}'\n")
print(f"{'Rank':<5} {'Combined':>9} {'Dense':>7} {'Sparse':>8}  Text")
print("-" * 80)
for rank, (text, combined, dense, sparse) in enumerate(
        search.search(query, top_k=5, alpha=0.5), 1):
    print(f"{rank:<5} {combined:>9.3f} {dense:>7.3f} {sparse:>8.3f}  "
          f"{text[:55]}")

print("\n--- Compare approaches ---")
for name, alpha_val in [("Pure semantic", 1.0), ("Pure keyword", 0.0),
                         ("Hybrid (0.5)", 0.5)]:
    results = search.search(query, top_k=3, alpha=alpha_val)
    top_texts = [t[:45] for t, _, _, _ in results]
    print(f"{name:<16}: {' | '.join(top_texts)}")

The alpha parameter controls the balance. In practice, alpha between 0.3 and 0.7 works best for most datasets. The optimal value depends on your content -- technical documentation with lots of specific identifiers benefits from more keyword weight (lower alpha), while conversational content benefits from more semantic weight (higher alpha). You can tune it on a small validation set of query-relevance pairs.

Why does this matter so much? Because semantic search and keyword search fail on different queries. Semantic search catches paraphrases and conceptual matches ("how to speed up Python" matches "performance optimization techniques"). Keyword search catches exact terms, names, version numbers, and identifiers that embeddings sometimes flatten into generic concept space. The combination is stronger than either alone -- consistently, across essentially every benchmark I've seen.

Re-ranking -- the second pass

The embedding model used for initial retrieval is designed to be fast. It encodes query and document independently (that's the bi-encoder architecture from episode #59), which means you can pre-compute all document embeddings and just compute the query embedding at search time. Quick matrix multiply, done. But encoding query and document independently means the model can't see fine-grained interactions between them.

A re-ranker (also called a cross-encoder) is a different beast. It takes the query AND document together as a single input, processes them jointly through the transformer, and outputs a relevance score. It's slower -- because it has to process each query-document pair separately -- but MUCH more accurate.

The pattern: retrieve top 20-50 candidates with the fast bi-encoder, then re-rank those candidates with the accurate cross-encoder, return the top 5.

from sentence_transformers import SentenceTransformer, CrossEncoder
import numpy as np

class RerankedRAG:
    """Two-stage retrieval: fast bi-encoder + accurate cross-encoder."""

    def __init__(self):
        self.embedder = SentenceTransformer('all-MiniLM-L6-v2')
        self.reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
        self.chunks = []
        self.embeddings = None

    def index(self, chunks):
        self.chunks = chunks
        self.embeddings = self.embedder.encode(
            chunks, normalize_embeddings=True)

    def search(self, query, initial_k=20, final_k=5):
        # Stage 1: fast retrieval with bi-encoder
        q_emb = self.embedder.encode(
            [query], normalize_embeddings=True)
        scores = (self.embeddings @ q_emb.T).flatten()
        candidates_idx = np.argsort(scores)[::-1][:initial_k]

        # Stage 2: re-rank candidates with cross-encoder
        pairs = [(query, self.chunks[i]) for i in candidates_idx]
        rerank_scores = self.reranker.predict(pairs)

        # Sort by reranker score, return top final_k
        ranked = sorted(
            zip(candidates_idx, rerank_scores),
            key=lambda x: x[1], reverse=True)

        return [(self.chunks[i], float(s)) for i, s in ranked[:final_k]]


# Demonstrate where re-ranking corrects bi-encoder mistakes
rag = RerankedRAG()
rag.index([
    "Python is not suitable for real-time embedded systems.",
    "Python is the most popular language for data science.",
    "Python is widely used in machine learning applications.",
    "Python can be used for embedded systems with MicroPython.",
    "Embedded systems require low-level memory control.",
    "Real-time systems have strict timing constraints.",
    "C and Rust are preferred for embedded programming.",
    "Python's garbage collector makes it unsuitable for hard RT.",
    "MicroPython runs on microcontrollers with limited resources.",
    "Data science workflows benefit from Python's rich ecosystem.",
])

query = "Is Python good for embedded systems?"
print(f"Query: '{query}'\n")

# Compare: bi-encoder only vs with re-ranking
q_emb = rag.embedder.encode([query], normalize_embeddings=True)
bi_scores = (rag.embeddings @ q_emb.T).flatten()
bi_top = np.argsort(bi_scores)[::-1][:5]

print("Bi-encoder top 5:")
for rank, i in enumerate(bi_top, 1):
    print(f"  {rank}. [{bi_scores[i]:.3f}] {rag.chunks[i]}")

print("\nRe-ranked top 5:")
results = rag.search(query, initial_k=10, final_k=5)
for rank, (text, score) in enumerate(results, 1):
    print(f"  {rank}. [{score:.3f}] {text}")

Why does re-ranking help? The bi-encoder processes query and document separately -- it can't model fine-grained word interactions. "Python is not suitable for embedded systems" and "Python is widely used in machine learning" both contain "Python" and similar vocabulary. The bi-encoder might rank them similarly. The cross-encoder processes the query and document together, attending to every word interaction -- it catches the negation "not suitable" and correctly ranks it lower for a query about suitability.

The cost is latency. Re-ranking 20 candidates adds roughly 50-200ms depending on the cross-encoder model and your hardware. For most applications that's perfectly fine. For latency-critical systems, you can shrink the candidate pool or use a smaller cross-encoder.

Having said that, the accuracy improvement from re-ranking is often dramatic. In my experience, re-ranking improves retrieval precision by 10-30% on most datasets. It's one of those techniques where the effort-to-payoff ratio is excellent -- a few lines of code for a substantial quality boost.

Query expansion and transformation

Sometimes the user's vocabulary just doesn't match the documents. "How to make Python faster?" versus documents that talk about "performance optimization", "profiling", "caching strategies", and "algorithmic complexity." The semantic gap is too wide for the embedding model to bridge reliably.

Query expansion generates additional search queries from the original, casting a wider net:

def expand_query_with_llm(query, llm_generate):
    """Use an LLM to generate alternative search queries.

    llm_generate: function that takes a prompt and returns text
    """
    prompt = f"""Generate 3 alternative search queries for the
following question. Each should use different vocabulary and
capture a different aspect.
Return only the queries, one per line.

Original: {query}

Alternative queries:"""
    response = llm_generate(prompt)
    alternatives = [q.strip() for q in response.strip().split('\n')
                    if q.strip()]
    return [query] + alternatives[:3]

# Simulated example (without actual LLM call)
original = "How to make Python faster?"
expanded = [
    original,
    "Python performance optimization techniques",
    "profiling and speeding up Python code",
    "Python bottleneck analysis caching strategies",
]

print(f"Original query:  {original}")
print(f"Expanded to {len(expanded)} queries:")
for i, q in enumerate(expanded):
    print(f"  {i+1}. {q}")
print(f"\nRun all {len(expanded)} queries against the vector store,")
print(f"collect all results, deduplicate by chunk ID.")
print(f"This catches documents using different terminology.")

You run all expanded queries against the vector store, collect all results, and deduplicate. The expanded queries use domain-appropriate vocabulary that the user might not have thought of, bridging the gap between the question's language and the document's language.

HyDE -- Hypothetical Document Embedding

This one is genuinely counterintuitive. In stead of searching with the user's question, you ask an LLM to generate a hypothetical answer to the question, then embed that answer and search for real documents similar to it.

def hyde_search(query, llm_generate, embedder, index, chunks,
                top_k=5):
    """HyDE: search with a hypothetical answer, not the question.

    The idea: the hypothetical answer uses domain vocabulary
    that's closer to the actual documents than the question is.
    """
    # Step 1: generate hypothetical answer
    prompt = (f"Write a short, factual paragraph answering: {query}")
    hypothetical = llm_generate(prompt)

    # Step 2: embed the hypothetical answer (NOT the question)
    hyp_emb = embedder.encode(
        [hypothetical], normalize_embeddings=True)

    # Step 3: search for real documents similar to the hypothesis
    scores = (index @ hyp_emb.T).flatten()
    top_idx = np.argsort(scores)[::-1][:top_k]

    return top_idx, hypothetical


# Simulated example
query = "How to make Python faster?"

# What the LLM might generate as a hypothetical answer:
hypothetical_answer = (
    "Python performance can be improved through several techniques. "
    "Profiling with cProfile identifies bottlenecks. List "
    "comprehensions and generator expressions are faster than "
    "explicit loops. NumPy vectorization replaces Python loops with "
    "C-level operations. Caching with functools.lru_cache avoids "
    "redundant computation. For CPU-bound tasks, multiprocessing "
    "bypasses the GIL. Cython and PyPy offer compilation-based "
    "speedups. Algorithm selection often matters more than "
    "micro-optimization."
)

print(f"Original query: '{query}'")
print(f"\nHypothetical answer (generated by LLM):")
print(f"  {hypothetical_answer[:120]}...")
print(f"\nWhy this works:")
print(f"  The question uses: 'make Python faster'")
print(f"  The hypothesis uses: 'profiling', 'cProfile', 'vectorization',")
print(f"    'lru_cache', 'multiprocessing', 'GIL', 'Cython', 'PyPy'")
print(f"  These are the EXACT terms in the documents we want to find.")
print(f"  Even if the hypothesis is factually wrong, its vocabulary")
print(f"  is closer to the real documents than the original question.")

Wait, you're searching with a potentially hallucinated answer? Yes. And it works. The key insight: even if the hypothetical answer is factually imprecise, it uses vocabulary and concepts that are much closer to the actual documents than the user's question was. The embedding of "profiling with cProfile identifies bottlenecks" is closer to a real document about Python profiling than the embedding of "how to make Python faster" is. You're not using the hypothesis as an answer -- you're using it as a better search query.

HyDE was proposed by Gao et al. (2022) and has become a standard technique in production RAG systems. It's especially effective when users ask questions in casual language about technical topics.

Multi-hop retrieval

Some questions simply can't be answered from a single retrieved document. "How does company X's Q3 revenue compare to the industry average?" requires: (1) finding the company's Q3 revenue figures, (2) finding the industry average, and (3) comparing them. No single chunk contains both pieces of information.

Multi-hop retrieval breaks complex questions into sub-questions, retrieves for each independently, and synthesises:

def multi_hop_rag(question, retrieve_fn, llm_generate, top_k=3):
    """Break complex questions into sub-questions, retrieve for each.

    retrieve_fn: function(query, k) -> list of (chunk, score)
    llm_generate: function(prompt) -> str
    """
    # Step 1: decompose the question
    decompose_prompt = f"""Break this question into 2-3 simpler
sub-questions that can each be answered from a single document:

Question: {question}

Sub-questions (one per line):"""

    sub_qs_text = llm_generate(decompose_prompt)
    sub_questions = [q.strip() for q in sub_qs_text.strip().split('\n')
                     if q.strip()]

    # Step 2: retrieve for each sub-question
    all_contexts = {}
    for sq in sub_questions:
        results = retrieve_fn(sq, k=top_k)
        for chunk, score in results:
            if chunk not in all_contexts:
                all_contexts[chunk] = {
                    'score': score, 'sources': [sq]}
            else:
                all_contexts[chunk]['sources'].append(sq)

    # Step 3: build combined context (deduplicated)
    sorted_ctx = sorted(all_contexts.items(),
                        key=lambda x: x[1]['score'], reverse=True)

    # Step 4: generate final answer from all gathered context
    context_text = "\n\n".join([chunk for chunk, _ in sorted_ctx])
    final_prompt = f"""Answer the question using the context below.
Cite which parts of the context support your answer.

Context:
{context_text}

Question: {question}
Answer:"""

    return final_prompt, sub_questions, sorted_ctx


# Simulated example
question = ("How does Company X's Q3 revenue compare to "
            "the industry average?")

# What the LLM would decompose into:
sub_questions = [
    "What was Company X's Q3 revenue?",
    "What is the industry average Q3 revenue?",
    "How do these figures compare?",
]

print(f"Original question: {question}\n")
print(f"Decomposed into {len(sub_questions)} sub-questions:")
for i, sq in enumerate(sub_questions, 1):
    print(f"  {i}. {sq}")
print(f"\nEach sub-question gets its own retrieval call.")
print(f"Results are deduplicated and merged.")
print(f"The LLM then synthesises from the combined context.")

Multi-hop is more expensive -- multiple retrieval calls, multiple LLM calls for decomposition and synthesis. But it's necessary for questions that require combining information from different parts of your knowledge base. The decomposition step is itself a prompt engineering challenge -- the LLM needs to correctly identify what sub-information is needed without generating redundant or irrelevant sub-questions.

In production systems, you'd add caching (store sub-question embeddings for reuse), parallelise the retrieval calls, and set a hard limit on the number of sub-questions to control cost.

Evaluating RAG systems

How do you know if your RAG system is actually good? This question has three separate dimensions, and evaluating only one gives you a misleading picture.

Retrieval quality: are we finding the right documents?

Recall@k: what fraction of relevant documents appear in the top k results?
Mean Reciprocal Rank (MRR): how high does the first relevant document rank?
NDCG: do relevant documents rank above irrelevant ones?

We built all three of these metrics in exercise 3 of episode #63 and in exercise 3 of episode #64. They tell you whether the retrieval pipeline is doing its job -- finding the needle in the haystack.

Generation quality: given the right documents, does the LLM produce a good answer?

Faithfulness: does the answer only contain information present in the retrieved context? (Detecting hallucination is the big one here)
Relevance: does the answer actually address the question?
Completeness: does the answer cover all aspects that the context supports?

End-to-end quality: does the full pipeline produce correct answers?

Answer accuracy: are the answers factually correct?
Citation accuracy: do cited sources actually support the claims?

def evaluate_faithfulness(answer, context, check_fn):
    """Check if every claim in the answer is supported by context.

    In production, check_fn would be an LLM call.
    Here we simulate with keyword overlap.
    """
    answer_sentences = [s.strip() for s in answer.split('.')
                        if s.strip()]
    context_lower = context.lower()

    results = []
    for sent in answer_sentences:
        # Simple check: do key words from the sentence
        # appear in the context?
        words = set(sent.lower().split())
        stopwords = {'the', 'a', 'an', 'is', 'are', 'was', 'were',
                     'in', 'on', 'at', 'to', 'for', 'of', 'and',
                     'or', 'it', 'this', 'that', 'with', 'by'}
        key_words = words - stopwords
        if not key_words:
            continue
        overlap = sum(1 for w in key_words if w in context_lower)
        score = overlap / len(key_words)
        supported = score > 0.5
        results.append({
            'claim': sent[:60],
            'support_score': score,
            'verdict': 'SUPPORTED' if supported else 'NOT SUPPORTED'
        })

    return results


# Example
context = ("RAG combines retrieval with generation to reduce "
           "hallucination. The retriever finds relevant passages "
           "from a knowledge base. The generator produces answers "
           "conditioned on those passages.")

good_answer = ("RAG reduces hallucination by retrieving relevant "
               "passages from a knowledge base. The generator "
               "then produces answers based on those passages.")

bad_answer = ("RAG was invented by Google in 2019 and uses "
              "reinforcement learning to improve retrieval. "
              "It requires a minimum of 1 million documents.")

print("=== Good answer (faithful) ===")
for r in evaluate_faithfulness(good_answer, context, None):
    print(f"  [{r['verdict']:<14}] ({r['support_score']:.2f}) "
          f"{r['claim']}")

print("\n=== Bad answer (hallucinated) ===")
for r in evaluate_faithfulness(bad_answer, context, None):
    print(f"  [{r['verdict']:<14}] ({r['support_score']:.2f}) "
          f"{r['claim']}")

print("\nIn production, use an LLM as judge (not keyword overlap).")
print("Frameworks: RAGAS, TruLens, DeepEval automate this.")

LLM-as-judge -- using one LLM to evaluate another LLM's output -- is the practical standard for RAG evaluation in 2025/2026. You prompt a strong model (like GPT-4) to check whether an answer is faithful to its context, whether it addresses the question, and whether it's complete. Frameworks like RAGAS, TruLens, and DeepEval package this into ready-to-use evaluation pipelines.

The important thing is understanding what you're measuring. Two types of failures to watch for:

Retrieval failures: the right answer exists in the knowledge base but wasn't retrieved. Fix: better embeddings, hybrid search, query expansion.
Generation failures: the right context was retrieved but the model answered wrong anyway (hallucinated, ignored context, or misunderstood). Fix: better prompting, re-ranking to put the best context first (remember the "lost in the middle" problem from episode #64), or a different LLM.

RAG vs fine-tuning -- when to use which

This is a question I see come up constantly, and the answer is more nuanced than most people make it.

Use RAG when:

Knowledge changes frequently (documents updated daily or weekly)
You need citations and traceability -- "according to document X"
The knowledge base is large and diverse
You need to add new information without retraining
Privacy matters: data stays in your infrastructure, not in model weights
You want to experiment quickly without GPU-intensive training

Use fine-tuning when:

You need the model to learn a specific style or behavior (not just facts)
Latency is critical and you can't afford the retrieval overhead
The knowledge is stable and well-defined (medical protocols, legal codes)
You want the model to internalize domain expertise, not just reference it
The task requires a specialized output format consistently

Use both -- and this is the common production pattern:

# The "RAG + fine-tuned model" architecture
architecture = {
    "Fine-tuned model": {
        "provides": [
            "Domain-specific vocabulary understanding",
            "Consistent output formatting",
            "Appropriate tone and style for the domain",
            "Better instruction following for domain tasks",
        ],
        "example": "Fine-tune on 1000 medical Q&A pairs so the model "
                   "understands clinical terminology and answers in "
                   "structured medical report format.",
    },
    "RAG layer": {
        "provides": [
            "Up-to-date factual information",
            "Citations and source traceability",
            "Access to private/proprietary documents",
            "Easy updates without retraining",
        ],
        "example": "Retrieve from a database of current drug "
                   "interactions, clinical guidelines, and recent "
                   "research papers updated weekly.",
    },
    "Combined": {
        "result": "The fine-tuned model understands medical language "
                  "and output conventions. RAG provides the specific, "
                  "current facts. Together: accurate, well-formatted, "
                  "cited medical answers.",
    }
}

for component, info in architecture.items():
    print(f"\n{component}:")
    if 'provides' in info:
        for item in info['provides']:
            print(f"  - {item}")
    if 'example' in info:
        print(f"  Example: {info['example']}")
    if 'result' in info:
        print(f"  -> {info['result']}")

print("\n--- Decision matrix ---")
factors = [
    ("Knowledge freshness", "Dynamic", "Static", "RAG"),
    ("Need for citations", "Yes", "No", "RAG"),
    ("Output style/format", "Standard", "Custom", "Fine-tune"),
    ("Latency budget", "Relaxed", "Tight", "Fine-tune"),
    ("Domain vocabulary", "General", "Specialized", "Fine-tune"),
    ("Update frequency", "Weekly+", "Yearly", "Fine-tune"),
]

print(f"{'Factor':<22} {'Favors RAG':<14} {'Favors FT':<14} "
      f"{'Winner'}")
print("-" * 60)
for factor, rag_val, ft_val, winner in factors:
    print(f"{factor:<22} {rag_val:<14} {ft_val:<14} {winner}")

The key takeaway: RAG and fine-tuning solve different problems. RAG is about what the model knows (external knowledge). Fine-tuning is about how the model behaves (style, format, domain adaptation). Most production systems benefit from both, and understanding which problem you're solving determines which tool to reach for.

Putting it all together -- the advanced RAG stack

Here's what a production RAG pipeline actually looks like once you layer these techniques:

class AdvancedRAGPipeline:
    """Production RAG pipeline with hybrid search + re-ranking."""

    def __init__(self, embedder, reranker, bm25_index, chunks):
        self.embedder = embedder
        self.reranker = reranker
        self.bm25 = bm25_index
        self.chunks = chunks
        self.embeddings = embedder.encode(
            chunks, normalize_embeddings=True)

    def query(self, question, alpha=0.5, initial_k=20,
              final_k=5, threshold=0.0):
        # Stage 1: Hybrid retrieval (dense + sparse)
        q_emb = self.embedder.encode(
            [question], normalize_embeddings=True)
        dense_scores = (self.embeddings @ q_emb.T).flatten()
        sparse_scores = self.bm25.get_scores(
            question.lower().split())

        # Normalize
        d_range = dense_scores.max() - dense_scores.min() + 1e-8
        dense_norm = (dense_scores - dense_scores.min()) / d_range
        s_range = sparse_scores.max() - sparse_scores.min() + 1e-8
        sparse_norm = (sparse_scores - sparse_scores.min()) / s_range

        combined = alpha * dense_norm + (1 - alpha) * sparse_norm
        candidates = np.argsort(combined)[::-1][:initial_k]

        # Stage 2: Re-rank with cross-encoder
        pairs = [(question, self.chunks[i]) for i in candidates]
        rerank_scores = self.reranker.predict(pairs)

        # Stage 3: Filter by threshold and return top results
        ranked = sorted(
            zip(candidates, rerank_scores),
            key=lambda x: x[1], reverse=True)
        results = [(self.chunks[i], float(s))
                    for i, s in ranked[:final_k]
                    if s > threshold]
        return results

# Usage pattern:
# pipeline = AdvancedRAGPipeline(embedder, reranker, bm25, chunks)
# results = pipeline.query("How does torch.compile work?")
# prompt = build_prompt(question, results)
# answer = llm.generate(prompt)

print("Advanced RAG pipeline stages:")
print("  1. Hybrid retrieval: dense + BM25 keyword search")
print("  2. Re-ranking: cross-encoder scores query-doc pairs")
print("  3. Threshold filtering: remove low-confidence matches")
print("  4. Prompt construction: format context for the LLM")
print("  5. Generation: LLM answers using retrieved context")
print("\nOptional additions:")
print("  - Query expansion (before stage 1)")
print("  - HyDE (before stage 1)")
print("  - Multi-hop decomposition (wraps the entire pipeline)")

Each stage addresses a specific failure mode: hybrid search handles vocabulary mismatch, re-ranking handles imprecise initial scoring, and threshold filtering prevents noise from contaminating the context. You don't need all of them for every application -- start with the basics from episode #64, measure where your system fails, and add the appropriate technique.

Common RAG pitfalls in production

Before we wrap up, a few things I've run into that are worth calling out:

Chunk boundaries that split entities: if your chunking splits "The CEO, John Smith, announced..." into one chunk and "...the new product line would launch in Q3" into another, neither chunk is self-sufficient. Overlap helps (as we discussed in episode #64), but for structured documents, consider chunking at section or paragraph boundaries in stead of fixed token counts.

Stale embeddings: if you update documents but don't re-embed them, the vector index is out of sync with the actual content. Production systems need an embedding refresh pipeline that triggers when documents change.

Context window waste: stuffing 10 mediocre chunks into the prompt when 3 excellent ones would suffice. The "lost in the middle" phenomenon from episode #64 means the LLM ignores middle chunks anyway. Better to retrieve fewer, higher-quality chunks.

Embedding model drift: if you switch embedding models, ALL your stored embeddings become invalid. Cosine similarities between vectors from different models are meaningless. Plan for re-indexing when you upgrade your embedding model.

# Quick diagnostic: is your RAG system failing at retrieval
# or generation?
def diagnose_rag_failure(question, retrieved_chunks, answer,
                          ground_truth_answer, relevant_chunks):
    """Figure out WHERE the pipeline is breaking."""
    # Check retrieval
    retrieved_set = set(range(len(retrieved_chunks)))
    relevant_set = set(relevant_chunks)
    retrieval_hit = bool(retrieved_set & relevant_set)

    # Check answer quality (simple word overlap proxy)
    gt_words = set(ground_truth_answer.lower().split())
    ans_words = set(answer.lower().split())
    stopwords = {'the', 'a', 'an', 'is', 'are', 'was', 'in',
                 'on', 'to', 'for', 'of', 'and', 'or', 'it'}
    gt_key = gt_words - stopwords
    overlap = len(gt_key & ans_words) / max(len(gt_key), 1)

    if not retrieval_hit:
        diagnosis = "RETRIEVAL FAILURE: right docs not retrieved"
        fix = ("Try: hybrid search, query expansion, "
               "better embeddings")
    elif overlap < 0.3:
        diagnosis = ("GENERATION FAILURE: right docs retrieved "
                     "but answer is wrong")
        fix = ("Try: re-ranking, better prompt, "
               "put best chunks first")
    else:
        diagnosis = "PIPELINE OK: good retrieval + good answer"
        fix = "No fix needed"

    print(f"Question: {question[:50]}")
    print(f"Retrieval hit: {retrieval_hit}")
    print(f"Answer overlap: {overlap:.2f}")
    print(f"Diagnosis: {diagnosis}")
    print(f"Suggested fix: {fix}")
    return diagnosis

# Example
diagnose_rag_failure(
    "What year was Python created?",
    ["Python was released in 1991 by Guido van Rossum."],
    "Python was created in 1991.",
    "Python was created in 1991 by Guido van Rossum.",
    [0])

The bottom line

Hybrid search (dense + sparse/BM25) consistently outperforms either approach alone -- semantic catches paraphrases, keyword catches exact terms and identifiers;
Re-ranking with a cross-encoder improves precision by modeling fine-grained query-document interactions in a second pass. Typical improvement: 10-30%;
Query expansion (alternative queries, synonyms) and HyDE (hypothetical answers) bridge vocabulary gaps between user questions and document language;
Multi-hop retrieval decomposes complex questions into sub-questions, retrieves for each, and synthesises a combined answer from multiple sources;
Evaluate RAG on three dimensions: retrieval quality (right documents?), generation quality (faithful answer?), and end-to-end accuracy. Use LLM-as-judge for generation metrics;
RAG is for dynamic knowledge with citations; fine-tuning is for style and stable domain expertise. Production systems often use both together -- fine-tune for domain understanding, RAG for current facts;
Diagnose failures by separating retrieval from generation -- fix the right stage.

Exercises

Exercise 1: Build a hybrid search comparison benchmark. Create a knowledge base of 30 chunks spanning 3 topics (10 per topic). Write 15 queries (5 per topic) where some queries use exact terminology from the documents and others use paraphrases. Implement three search modes: pure semantic (alpha=1.0), pure BM25 (alpha=0.0), and hybrid (alpha=0.5). For each mode, compute Precision@3 and Recall@3 against ground truth relevant chunks. Print a comparison table showing where hybrid outperforms the individual approaches. Identify at least 2 queries where semantic search fails but keyword search succeeds, and 2 where the reverse is true.

Exercise 2: Implement a re-ranking evaluation harness. Create a corpus of 50 chunks. Write 10 queries with known relevant chunks (ground truth). For each query, retrieve top-20 candidates using a bi-encoder (SentenceTransformer), then re-rank with a cross-encoder (CrossEncoder). Compare the ranking before and after re-ranking by computing NDCG@5 and MRR for both stages. Print a per-query breakdown showing how many positions the correct chunk moved (up or down) after re-ranking. Calculate the average improvement in NDCG@5 from re-ranking.

Exercise 3: Build a RAG failure diagnostic tool. Create a test set of 10 question-answer pairs with ground truth relevant chunks. Implement a RAG pipeline (retrieve + generate prompt). For each query, classify the failure mode: (a) retrieval failure (relevant chunk not in top-3), (b) generation failure (relevant chunk retrieved but answer doesn't match ground truth), or (c) success. For retrieval failures, test whether query expansion (adding 2 manually written alternative queries) fixes the retrieval. For generation failures, test whether re-ranking the candidates (moving the relevant chunk to position 1) fixes the answer quality. Print a diagnostic report summarizing how many failures each technique would fix.

Bedankt en tot de volgende keer!

Hive account@scipio

Learn AI Series (#65) - RAG - Advanced Techniques

Learn AI Series (#65) - RAG - Advanced Techniques

What will I learn

Requirements

Difficulty

Curriculum (of the Learn AI Series):

Learn AI Series (#65) - RAG - Advanced Techniques

Solutions to Episode #64 Exercises

On to today's episode

Hybrid search -- dense plus sparse

Re-ranking -- the second pass

Query expansion and transformation

HyDE -- Hypothetical Document Embedding

Multi-hop retrieval

Evaluating RAG systems

RAG vs fine-tuning -- when to use which

Putting it all together -- the advanced RAG stack

Common RAG pitfalls in production

The bottom line

Exercises

Bedankt en tot de volgende keer!

Curriculum (of the `Learn AI Series`):