Learn AI Series (#63) - Embeddings and Vector Search
What will I learn
- You will learn text embeddings -- representing meaning as dense vectors;
- sentence embeddings vs word embeddings -- why sentence-level matters for real applications;
- similarity metrics: cosine, euclidean, and dot product -- when to use each;
- vector databases: Chroma, FAISS, Qdrant and when to pick which;
- indexing strategies: flat, IVF, HNSW -- the speed vs accuracy tradeoff;
- building a semantic search system from scratch.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution;
- The ambition to learn AI and machine learning.
Difficulty
- Beginner
Curriculum (of the Learn AI Series):
- Learn AI Series (#1) - What Machine Learning Actually Is
- Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
- Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World
- Learn AI Series (#4) - Your First Prediction - No Math, Just Intuition
- Learn AI Series (#5) - Patterns in Data - What "Learning" Actually Looks Like
- Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
- Learn AI Series (#7) - The Training Loop - See It Work Step by Step
- Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra
- Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability
- Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch
- Learn AI Series (#11) - Making Linear Regression Real
- Learn AI Series (#12) - Classification - Logistic Regression From Scratch
- Learn AI Series (#13) - Evaluation - How to Know If Your Model Actually Works
- Learn AI Series (#14) - Data Preparation - The 80% Nobody Talks About
- Learn AI Series (#15) - Feature Engineering and Selection
- Learn AI Series (#16) - Scikit-Learn - The Standard Library of ML
- Learn AI Series (#17) - Decision Trees - How Machines Make Decisions
- Learn AI Series (#18) - Random Forests - Wisdom of Crowds
- Learn AI Series (#19) - Gradient Boosting - The Kaggle Champion
- Learn AI Series (#20) - Support Vector Machines - Drawing the Perfect Boundary
- Learn AI Series (#21) - Mini Project - Predicting Crypto Market Regimes
- Learn AI Series (#22) - K-Means Clustering - Finding Groups
- Learn AI Series (#23) - Advanced Clustering - Beyond K-Means
- Learn AI Series (#24) - Dimensionality Reduction - PCA
- Learn AI Series (#25) - Advanced Dimensionality Reduction - t-SNE and UMAP
- Learn AI Series (#26) - Anomaly Detection - Finding What Doesn't Belong
- Learn AI Series (#27) - Recommendation Systems - "Users Like You Also Liked..."
- Learn AI Series (#28) - Time Series Fundamentals - When Order Matters
- Learn AI Series (#29) - Time Series Forecasting - Predicting What Comes Next
- Learn AI Series (#30) - Natural Language Processing - Text as Data
- Learn AI Series (#31) - Word Embeddings - Meaning in Numbers
- Learn AI Series (#32) - Bayesian Methods - Thinking in Probabilities
- Learn AI Series (#33) - Ensemble Methods Deep Dive - Stacking and Blending
- Learn AI Series (#34) - ML Engineering - From Notebook to Production
- Learn AI Series (#35) - Data Ethics and Bias in ML
- Learn AI Series (#36) - Mini Project - Complete ML Pipeline
- Learn AI Series (#37) - The Perceptron - Where It All Started
- Learn AI Series (#38) - Neural Networks From Scratch - Forward Pass
- Learn AI Series (#39) - Neural Networks From Scratch - Backpropagation
- Learn AI Series (#40) - Training Neural Networks - Practical Challenges
- Learn AI Series (#41) - Optimization Algorithms - SGD, Momentum, Adam
- Learn AI Series (#42) - PyTorch Fundamentals - Tensors and Autograd
- Learn AI Series (#43) - PyTorch Data and Training
- Learn AI Series (#44) - PyTorch nn.Module - Building Real Networks
- Learn AI Series (#45) - Convolutional Neural Networks - Theory
- Learn AI Series (#46) - CNNs in Practice - Classic to Modern Architectures
- Learn AI Series (#47) - CNN Applications - Detection, Segmentation, Style Transfer
- Learn AI Series (#48) - Recurrent Neural Networks - Sequences
- Learn AI Series (#49) - LSTM and GRU - Solving the Memory Problem
- Learn AI Series (#50) - Sequence-to-Sequence Models
- Learn AI Series (#51) - Attention Mechanisms
- Learn AI Series (#52) - The Transformer Architecture (Part 1)
- Learn AI Series (#53) - The Transformer Architecture (Part 2)
- Learn AI Series (#54) - Vision Transformers
- Learn AI Series (#55) - Generative Adversarial Networks
- Learn AI Series (#56) - Mini Project - Building a Transformer From Scratch
- Learn AI Series (#57) - Language Modeling - Predicting the Next Word
- Learn AI Series (#58) - GPT Architecture - Decoder-Only Transformers
- Learn AI Series (#59) - BERT and Encoder Models
- Learn AI Series (#60) - Training Large Language Models
- Learn AI Series (#61) - Instruction Tuning and Alignment
- Learn AI Series (#62) - Prompt Engineering - Getting the Most from LLMs
- Learn AI Series (#63) - Embeddings and Vector Search (this post)
Learn AI Series (#63) - Embeddings and Vector Search
Solutions to Episode #62 Exercises
Exercise 1: Build a prompt comparison benchmark with zero-shot, 2-shot, and 5-shot variants across 10 classification tasks.
import random
random.seed(42)
tasks = [
{
"name": "sentiment",
"labels": ["positive", "negative"],
"keywords": {"positive": ["great", "love", "excellent", "amazing",
"wonderful", "fantastic", "superb"],
"negative": ["terrible", "hate", "awful", "horrible",
"disgusting", "worst", "broken"]},
"examples": [
("This product is absolutely wonderful!", "positive"),
("Complete garbage, waste of money.", "negative"),
("I love everything about this service.", "positive"),
("Terrible experience, never again.", "negative"),
("Fantastic quality and fast shipping!", "positive"),
],
"test": [
("Great value for the price!", "positive"),
("Horrible customer service.", "negative"),
("I love how easy this is to use.", "positive"),
("The worst purchase I ever made.", "negative"),
("Excellent build quality overall.", "positive"),
]
},
{
"name": "spam",
"labels": ["spam", "not_spam"],
"keywords": {"spam": ["free", "winner", "click", "urgent",
"congratulations", "prize", "limited"],
"not_spam": ["meeting", "report", "project",
"schedule", "update", "attached"]},
"examples": [
("Congratulations! You won a FREE prize!", "spam"),
("Meeting rescheduled to 3pm tomorrow.", "not_spam"),
("URGENT: Click here to claim your reward!", "spam"),
("Please find the project report attached.", "not_spam"),
("Limited time offer: free gift inside!", "spam"),
],
"test": [
("Click now for your free vacation!", "spam"),
("Updated the project schedule.", "not_spam"),
("You are the lucky winner today!", "spam"),
("Quarterly report is ready for review.", "not_spam"),
("Urgent: claim your prize immediately.", "spam"),
]
},
]
# (truncated to 2 tasks for space -- in practice you'd have 10)
def simulate_model(prompt, test_text, task):
"""Keyword-based classification simulator."""
text_lower = test_text.lower()
scores = {}
for label, kws in task["keywords"].items():
scores[label] = sum(1 for w in kws if w in text_lower)
if max(scores.values()) == 0:
return task["labels"][0]
return max(scores, key=scores.get)
def run_benchmark(tasks, n_shots_list=[0, 2, 5]):
print(f"{'Task':<12} {'0-shot':>8} {'2-shot':>8} {'5-shot':>8}")
print("-" * 40)
for task in tasks:
accs = []
for n in n_shots_list:
correct = 0
for text, true_label in task["test"]:
pred = simulate_model("", text, task)
if pred == true_label:
correct += 1
accs.append(correct / len(task["test"]))
print(f"{task['name']:<12} {accs[0]:>8.1%} {accs[1]:>8.1%} "
f"{accs[2]:>8.1%}")
run_benchmark(tasks)
print("\nNote: simulated model uses keyword matching.")
print("Real API calls would show few-shot >> zero-shot.")
The benchmark harness is what matters here -- the simulate_model function is a placeholder for actual API calls. In production you'd swap it for openai.chat.completions.create(...) or equivalent, and you'd see real accuracy improvements with more examples.
Exercise 2: Chain-of-thought verifier that extracts and re-computes arithmetic steps.
import re
def verify_cot(question, cot_response):
"""Verify arithmetic steps in a chain-of-thought response."""
pattern = r'(\d+(?:\.\d+)?)\s*([+\-x*/])\s*(\d+(?:\.\d+)?)\s*=\s*(\d+(?:\.\d+)?)'
steps = re.findall(pattern, cot_response.replace('x', '*'))
results = []
for left, op, right, stated in steps:
left_f, right_f, stated_f = float(left), float(right), float(stated)
op_fixed = op.replace('x', '*')
if op_fixed == '+':
computed = left_f + right_f
elif op_fixed == '-':
computed = left_f - right_f
elif op_fixed in ('*', 'x'):
computed = left_f * right_f
elif op_fixed == '/':
computed = left_f / right_f if right_f != 0 else float('inf')
else:
computed = None
correct = abs(computed - stated_f) < 0.01 if computed else False
results.append({
"expr": f"{left} {op} {right}",
"stated": stated_f,
"computed": computed,
"correct": correct
})
return results
problems = [
("5 shelves * 8 boxes = 40, 40 * 12 = 480",
"5 * 8 = 40, 40 * 12 = 480"),
("3 + 4 = 7, 7 * 2 = 14",
"3 + 4 = 7, 7 * 2 = 14"),
("10 * 5 = 50, 50 + 3 = 53",
"10 * 5 = 55, 55 + 3 = 53"), # ERROR: 10*5=55
("24 / 3 = 8, 8 * 4 = 32",
"24 / 3 = 8, 8 * 4 = 36"), # ERROR: 8*4=36
("100 - 25 = 75, 75 / 5 = 15",
"100 - 25 = 75, 75 / 5 = 15"),
]
print(f"{'Problem':<35} {'Steps':>6} {'Errors':>7}")
print("-" * 52)
for question, cot in problems:
results = verify_cot(question, cot)
errors = sum(1 for r in results if not r["correct"])
status = "OK" if errors == 0 else f"{errors} BAD"
print(f"{cot[:33]:<35} {len(results):>6} {status:>7}")
for r in results:
if not r["correct"]:
print(f" ERROR: {r['expr']} stated={r['stated']}, "
f"computed={r['computed']}")
The verifier catches arithmetic errors by re-computing each step independently. Real self-consistency pipelines use this exact approach: generate multiple CoT paths, verify each, discard paths with arithmetic errors, majority-vote on the rest.
Exercise 3: Sampling parameter visualizer showing temperature, top-k, and top-p effects.
import torch
import torch.nn.functional as F
import math
torch.manual_seed(42)
logits = torch.randn(20) * 3 # 20-token vocabulary
logits[0] += 4.0 # make token 0 the most likely
logits[1] += 2.0 # token 1 second most likely
print("=== Temperature Scaling ===")
print(f"{'Temp':>6} {'Top prob':>10} {'Entropy':>10} {'Eff vocab':>10}")
for temp in [0.1, 0.5, 1.0, 2.0]:
probs = F.softmax(logits / temp, dim=-1)
top_p = probs.max().item()
entropy = -(probs * probs.clamp(min=1e-10).log()).sum().item()
eff_vocab = (probs > 0.01).sum().item()
print(f"{temp:>6.1f} {top_p:>10.4f} {entropy:>10.4f} {eff_vocab:>10}")
print(f"\n=== Top-k Filtering ===")
print(f"{'k':>6} {'Mass kept':>10} {'Eff vocab':>10}")
probs = F.softmax(logits, dim=-1)
sorted_p, _ = probs.sort(descending=True)
for k in [3, 5, 10, 20]:
kept = sorted_p[:k].sum().item()
print(f"{k:>6} {kept:>10.4f} {min(k, 20):>10}")
print(f"\n=== Top-p (Nucleus) Filtering ===")
print(f"{'p':>6} {'Tokens kept':>12} {'Mass':>10}")
cumsum = sorted_p.cumsum(dim=0)
for p in [0.5, 0.9, 0.95, 0.99]:
n_tokens = (cumsum < p).sum().item() + 1
actual_mass = sorted_p[:n_tokens].sum().item()
print(f"{p:>6.2f} {n_tokens:>12} {actual_mass:>10.4f}")
Lower temperature concentrates mass on the top token (0.1 makes it nearly deterministic). Top-k sets a hard cutoff regardless of the distribution shape. Top-p adapts -- when the model is confident, fewer tokens pass the threshold; when it's uncertain, more get through. This adaptive behaviour is why top-p is generally preferred over top-k in practice.
On to today's episode
Here we go! At the end of episode #62 I teased what was coming -- the idea of converting text into numerical vectors so you can search for "similar" content. That's exactly what we're building today. And this is NOT just academic -- embeddings and vector search are the foundation for one of the most practical applications of modern AI. Every search engine, every recommendation system, every "find similar documents" feature you've ever used is built on some variant of what we're about to cover.
Back in episode #31, we learned about word embeddings -- representing individual words as dense vectors where similar words live close together. "King" and "queen" are neighbors; "king" and "toaster" are not. That was powerful for its time. But word embeddings have a fundamental limitation: each word gets exactly ONE vector, regardless of context. The word "bank" gets the same embedding whether we're talking about a river bank or a financial bank. And there's no standard way to combine word embeddings into a meaningful representation of an entire sentence or paragraph.
Modern text embeddings solve both problems. They use transformer models (episode #52-53) to produce context-dependent, sentence-level vectors. "The bank of the river" and "The bank approved my loan" produce completely different embeddings because the model attends to the full context. And the resulting vector represents the meaning of the entire text, not just individual words.
This is the foundation for semantic search, RAG (which we'll get into next), recommendation systems, and clustering at scale. Let's build it from scratch ;-)
From words to sentences
A sentence embedding model takes arbitrary text and produces a fixed-size vector (typically 384 to 1536 dimensions) that captures semantic meaning. Texts with similar meanings produce vectors that are close together in this high-dimensional space -- regardless of whether they share any words.
# Using sentence-transformers (the standard library for this)
# pip install sentence-transformers
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2') # 384-dim, fast
texts = [
"How do I learn Python programming?",
"What's the best way to start coding in Python?",
"Recipe for chocolate chip cookies",
"The weather in Amsterdam is rainy today",
]
embeddings = model.encode(texts)
print(f"Shape: {embeddings.shape}") # (4, 384)
# The first two are semantically similar despite different words
for i in range(len(texts)):
for j in range(i+1, len(texts)):
sim = np.dot(embeddings[i], embeddings[j]) / (
np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[j]))
print(f"Sim({i},{j}): {sim:.3f} "
f"'{texts[i][:35]}' vs '{texts[j][:35]}'")
The first two texts -- "How do I learn Python programming?" and "What's the best way to start coding in Python?" -- should have high cosine similarity (0.8+) despite sharing almost no words. They mean the same thing. Traditional keyword search (TF-IDF, BM25, the stuff we built in episode #30) would miss this connection entirely because it relies on word overlap. Embedding-based search catches it because the model understands semantics, not just surface tokens.
And that's the whole game, really. Everything else in this episode is infrastructure to make this core idea work at scale.
How embedding models are trained
Most embedding models start from a pre-trained transformer (usually BERT-like, as we covered in episode #59) and are fine-tuned with a contrastive objective: push similar text pairs close together in embedding space, push dissimilar pairs apart.
The training data consists of pairs labeled as similar or dissimilar: question-answer pairs from forums (similar), paraphrases from NLI datasets (similar), random text combinations (dissimilar). The loss function -- typically InfoNCE -- pushes the model to produce high cosine similarity for positive pairs and low similarity for negative pairs.
import torch
import torch.nn.functional as F
def info_nce_loss(anchor, positive, negatives, temperature=0.05):
"""InfoNCE loss: the standard contrastive learning objective.
anchor: (batch, dim) -- the query embeddings
positive: (batch, dim) -- the matching embeddings
negatives: (batch, n_neg, dim) -- non-matching embeddings
"""
# Cosine similarity between anchor and positive
pos_sim = F.cosine_similarity(anchor, positive, dim=-1) / temperature
# Cosine similarity between anchor and each negative
neg_sims = F.cosine_similarity(
anchor.unsqueeze(1), negatives, dim=-1) / temperature
# Combine: positive at index 0, negatives after
logits = torch.cat([pos_sim.unsqueeze(-1), neg_sims], dim=-1)
# Target: the positive is always at index 0
labels = torch.zeros(logits.size(0), dtype=torch.long)
return F.cross_entropy(logits, labels)
# Quick demonstration
batch_size, dim, n_neg = 4, 128, 7
anchor = F.normalize(torch.randn(batch_size, dim), dim=-1)
positive = anchor + 0.1 * torch.randn(batch_size, dim) # slightly noisy
positive = F.normalize(positive, dim=-1)
negatives = F.normalize(torch.randn(batch_size, n_neg, dim), dim=-1)
loss = info_nce_loss(anchor, positive, negatives)
print(f"InfoNCE loss: {loss.item():.4f}")
print(f"(Lower = model better at distinguishing positives from negatives)")
The temperature parameter is critical. Lower temperature (0.05) makes the loss sharper -- small differences in cosine similarity translate into large differences in loss. This forces the model to learn fine-grained distinctions between similar and dissimilar texts. Too high and the model gets lazy; too low and training becomes unstable.
Popular embedding models you'll encounter in 2025/2026:
- all-MiniLM-L6-v2: 384 dimensions, fast, good for general English text
- E5-large-v2 (Microsoft): 1024 dimensions, better quality, slower
- text-embedding-3-small/large (OpenAI): 1536 dimensions, API-only, excellent quality
- BGE (BAAI): open-source, competitive with commercial models
- Cohere embed-v3: multilingual, strong on retrieval benchmarks
- nomic-embed-text: open-source, 8192 token context, very competitive
Similarity metrics
Given two embedding vectors, how do you measure how "close" they are? Three common options, each with different properties.
Cosine similarity: measures the angle between vectors, ignoring magnitude. Two vectors pointing in the same direction have cosine similarity 1.0, regardless of their length. This is the most common choice for text embeddings because embedding models are typically trained to optimize cosine similarity.
cos_sim(a, b) = (a . b) / (||a|| * ||b||)
Euclidean distance (L2): straight-line distance between vector endpoints. Smaller = more similar. Sensitive to vector magnitude -- a long vector far from a short vector might have higher L2 distance than two short vectors pointing in completely different directions.
Dot product (inner product): a . b = ||a|| * ||b|| * cos(angle). Combines directional similarity AND magnitude. Used when you want "more relevant AND more prominent" to score higher -- for example, if longer documents produce longer embedding vectors and you want to surface popular/extensive documents.
import numpy as np
def compare_metrics(a, b, label=""):
cosine = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
euclidean = np.linalg.norm(a - b)
dot = np.dot(a, b)
print(f"{label}")
print(f" Cosine sim: {cosine:.4f}")
print(f" Euclidean: {euclidean:.4f}")
print(f" Dot product: {dot:.4f}")
print()
a = np.array([1.0, 2.0, 3.0])
b = np.array([2.0, 4.0, 6.0]) # same direction, double magnitude
c = np.array([3.0, 1.0, 0.5]) # different direction, similar magnitude
compare_metrics(a, b, "a vs b (same direction, different magnitude)")
compare_metrics(a, c, "a vs c (different direction)")
print("Key insight:")
print(" Cosine: a,b are IDENTICAL (1.0) -- ignores magnitude")
print(" Euclidean: a,b are FAR apart -- magnitude matters")
print(" Dot product: a,b score HIGH -- same direction + big magnitude")
For most text retrieval tasks, cosine similarity is the right choice. It's what the embedding model was trained to optimize, and it's invariant to text length (longer texts don't automatically score higher just because they produce bigger vectors). Having said that, if your vectors are already L2-normalized (which many libraries do by default), cosine similarity and dot product are mathematically identical -- so it doesn't matter which you compute.
Vector databases
With embeddings, search becomes "find the nearest vectors to my query vector." For small collections (< 100K vectors), brute-force comparison against every single vector in your collection works fine. For millions or billions of vectors, you need a vector database with efficient indexing.
FAISS (Facebook AI Similarity Search)
Not a database in the traditional sense -- it's a library for efficient in-memory similarity search. Developed by Meta, it's the fastest option when your vectors fit in RAM:
import faiss
import numpy as np
# Create 100,000 random vectors of dimension 384
d = 384
n = 100_000
np.random.seed(42)
vectors = np.random.randn(n, d).astype('float32')
# Flat index: exact search (brute force, O(n) per query)
index = faiss.IndexFlatIP(d) # inner product (use IndexFlatL2 for L2)
faiss.normalize_L2(vectors) # normalize so IP == cosine similarity
index.add(vectors)
print(f"Index contains {index.ntotal} vectors")
# Search: find 5 nearest neighbors for 3 queries
queries = np.random.randn(3, d).astype('float32')
faiss.normalize_L2(queries)
distances, indices = index.search(queries, k=5)
print(f"\nQuery 0 -- top 5 neighbors:")
for rank, (dist, idx) in enumerate(zip(distances[0], indices[0])):
print(f" #{rank+1}: vector {idx}, similarity={dist:.4f}")
FAISS supports GPU acceleration (faiss-gpu) and approximate nearest neighbor algorithms for billion-scale search. We'll get to those shortly.
Chroma
A lightweight, developer-friendly vector database designed for local prototyping and small-to-medium apps. No server process needed:
# pip install chromadb
import chromadb
client = chromadb.Client() # in-memory, ephemeral
collection = client.create_collection(
"my_docs",
metadata={"hnsw:space": "cosine"} # use cosine distance
)
# Add documents -- Chroma can embed them internally
# (configure an embedding function, or pass pre-computed embeddings)
collection.add(
documents=[
"Python is a high-level programming language",
"JavaScript runs in web browsers natively",
"Cookies are delicious baked goods made with butter",
"Machine learning requires large datasets",
"The transformer architecture revolutionized NLP",
],
ids=["doc1", "doc2", "doc3", "doc4", "doc5"]
)
# Query -- returns semantically similar documents
results = collection.query(
query_texts=["How to code in Python?"],
n_results=3
)
print("Query: 'How to code in Python?'")
print(f"Top 3 results:")
for doc, dist in zip(results['documents'][0], results['distances'][0]):
print(f" [{dist:.4f}] {doc}")
Chroma handles embedding, storage, and retrieval in one simple API. For a hobby project or a prototype with under 1 million documents, it's hard to beat. The tradeoff: it doesn't scale to billions of vectors and lacks advanced features like filtered search at massive scale.
Qdrant, Pinecone, Weaviate
For production deployments at real scale:
- Qdrant: open-source, written in Rust, fast, supports filtering combined with vector search (e.g. "find similar documents WHERE category='tech' AND date > 2024")
- Pinecone: fully managed cloud service, zero operational overhead, pay-per-query pricing
- Weaviate: open-source, supports hybrid search (vector + keyword BM25 combined)
The choice between these depends on your operational preferences (managed vs self-hosted), scale requirements, and whether you need advanced filtering.
Indexing strategies
Brute-force search (comparing the query against every single vector) is O(n) per query. Fine for 10K vectors, absolutely impractical for 10M or 100M. Approximate Nearest Neighbor (ANN) algorithms trade a small accuracy loss for massive speed gains.
IVF (Inverted File Index)
Partition all vectors into clusters using k-means. At query time, only search the clusters nearest to the query in stead of searching everything:
import faiss
import numpy as np
d = 384
n = 100_000
np.random.seed(42)
vectors = np.random.randn(n, d).astype('float32')
faiss.normalize_L2(vectors)
# IVF with 100 clusters
nlist = 100
quantizer = faiss.IndexFlatIP(d)
index_ivf = faiss.IndexIVFFlat(quantizer, d, nlist,
faiss.METRIC_INNER_PRODUCT)
# Must train (learns cluster centroids) before adding vectors
index_ivf.train(vectors)
index_ivf.add(vectors)
# nprobe controls accuracy vs speed tradeoff
# Higher nprobe = more clusters searched = better recall, slower
query = np.random.randn(1, d).astype('float32')
faiss.normalize_L2(query)
for nprobe in [1, 5, 10, 50]:
index_ivf.nprobe = nprobe
distances, indices = index_ivf.search(query, k=5)
print(f"nprobe={nprobe:>3}: top match idx={indices[0][0]}, "
f"sim={distances[0][0]:.4f}")
print("\nWith 100 clusters and nprobe=10, we search only 10% of data")
print("~10x faster than brute force, usually 95%+ recall")
With 1000 clusters and nprobe=10, you're searching roughly 1% of your data -- a ~100x speedup for ~95-98% recall (meaning you find 95-98% of the true top-k results).
HNSW (Hierarchical Navigable Small World)
Build a multi-layer graph where each node connects to its nearby neighbors. Search by starting at a random entry point on the top layer, greedily navigating toward the query, then descending to finer layers:
import faiss
import numpy as np
import time
d = 384
n = 100_000
np.random.seed(42)
vectors = np.random.randn(n, d).astype('float32')
faiss.normalize_L2(vectors)
# HNSW index
index_hnsw = faiss.IndexHNSWFlat(d, 32) # 32 = connections per node
index_hnsw.metric_type = faiss.METRIC_INNER_PRODUCT
index_hnsw.hnsw.efConstruction = 200 # build quality
index_hnsw.add(vectors)
# Compare exact vs HNSW
index_flat = faiss.IndexFlatIP(d)
index_flat.add(vectors)
queries = np.random.randn(100, d).astype('float32')
faiss.normalize_L2(queries)
t0 = time.time()
exact_d, exact_i = index_flat.search(queries, k=10)
t_exact = time.time() - t0
index_hnsw.hnsw.efSearch = 64 # search quality parameter
t0 = time.time()
hnsw_d, hnsw_i = index_hnsw.search(queries, k=10)
t_hnsw = time.time() - t0
# Measure recall: fraction of true top-10 found by HNSW
recall = np.mean([
len(set(hnsw_i[i]) & set(exact_i[i])) / 10.0
for i in range(100)
])
print(f"Exact search: {t_exact*1000:.1f}ms for 100 queries")
print(f"HNSW search: {t_hnsw*1000:.1f}ms for 100 queries")
print(f"Speedup: {t_exact/t_hnsw:.1f}x")
print(f"Recall@10: {recall:.3f}")
HNSW gives excellent recall (often 99%+) with 10-100x speedup over exact search. The tradeoff: higher memory usage than IVF (it stores the graph structure in addition to the vectors).
Product Quantization (PQ)
Compress vectors by splitting each into sub-vectors and quantizing each sub-vector independently. Reduces memory by 4-8x with modest accuracy loss. Often combined with IVF for billion-scale systems.
The rule of thumb:
- < 100K vectors: flat (exact) search. It's fast enough.
- 100K - 10M: HNSW. Best speed-accuracy tradeoff.
- 10M+: IVF + PQ with possible GPU acceleration. Memory is the bottleneck.
Building a semantic search system
Let's put everything together -- a minimal but complete semantic search engine:
from sentence_transformers import SentenceTransformer
import numpy as np
class SemanticSearch:
def __init__(self, model_name='all-MiniLM-L6-v2'):
self.model = SentenceTransformer(model_name)
self.documents = []
self.embeddings = None
def index(self, documents):
"""Embed and store documents for search."""
self.documents = documents
self.embeddings = self.model.encode(
documents, normalize_embeddings=True,
show_progress_bar=False
)
print(f"Indexed {len(documents)} documents "
f"({self.embeddings.shape[1]} dimensions)")
def search(self, query, top_k=5):
"""Find the top-k most similar documents to query."""
q_emb = self.model.encode(
[query], normalize_embeddings=True)
# Cosine similarity (vectors are pre-normalized)
scores = self.embeddings @ q_emb.T
top_idx = np.argsort(scores.flatten())[::-1][:top_k]
return [(self.documents[i], float(scores[i][0]))
for i in top_idx]
# Usage
search = SemanticSearch()
docs = [
"Machine learning is a subset of artificial intelligence",
"Python is widely used for data science and ML",
"The Eiffel Tower is located in Paris, France",
"Neural networks are inspired by biological neurons",
"Amsterdam is the capital of the Netherlands",
"Gradient descent optimizes model parameters iteratively",
"The transformer architecture uses self-attention mechanisms",
"Random forests combine multiple decision trees",
"Backpropagation computes gradients through a neural network",
"Natural language processing deals with human language",
]
search.index(docs)
queries = [
"How do neural networks learn?",
"What is the capital of Holland?",
"Tell me about ensemble methods in ML",
]
for q in queries:
print(f"\nQuery: '{q}'")
results = search.search(q, top_k=3)
for doc, score in results:
print(f" {score:.3f}: {doc}")
"How do neural networks learn?" matches "Backpropagation computes gradients through a neural network" and "Neural networks are inspired by biological neurons" -- neither shares more than one or two words with the query, but they're semantically relevant. "What is the capital of Holland?" matches "Amsterdam is the capital of the Netherlands" because the model understands that Holland and the Netherlands refer to the same country. This is the power of semantic search over keyword matching.
This pattern -- encode documents once, encode query at search time, compute similarities -- is the building block for Retrieval-Augmented Generation (which is next), recommendation systems, duplicate detection, plagiarism checking, and any application where you need "find content similar to this."
Scaling considerations
A few practical things that matter when you move from prototype to production:
Batch encoding: embed documents in batches, not one at a time. model.encode(docs, batch_size=256) is orders of magnitude faster than encoding in a loop because it utilizes GPU parallelism.
Dimensionality reduction: if storage is tight, you can project embeddings to lower dimensions (using PCA from episode #24) with surprisingly little quality loss. Going from 1536d to 256d often preserves 90%+ of the retrieval quality.
Hybrid search: combine vector similarity with keyword matching (BM25). Some queries are better served by exact keyword match ("error code XYZ-123"), others by semantic similarity ("how to fix authentication issues"). Production systems often compute both scores and combine them with a weighted sum or a learned reranker.
import numpy as np
def hybrid_search(query, docs, embeddings, model,
alpha=0.7, top_k=5):
"""Combine semantic search with keyword matching.
alpha: weight for semantic score (1-alpha for keyword)
"""
# Semantic scores
q_emb = model.encode([query], normalize_embeddings=True)
sem_scores = (embeddings @ q_emb.T).flatten()
# Simple keyword scores (normalized word overlap)
q_words = set(query.lower().split())
kw_scores = np.array([
len(q_words & set(doc.lower().split())) / max(len(q_words), 1)
for doc in docs
])
# Normalize both to [0, 1] range
if sem_scores.max() > sem_scores.min():
sem_norm = ((sem_scores - sem_scores.min()) /
(sem_scores.max() - sem_scores.min()))
else:
sem_norm = np.zeros_like(sem_scores)
if kw_scores.max() > kw_scores.min():
kw_norm = ((kw_scores - kw_scores.min()) /
(kw_scores.max() - kw_scores.min()))
else:
kw_norm = np.zeros_like(kw_scores)
# Weighted combination
combined = alpha * sem_norm + (1 - alpha) * kw_norm
top_idx = np.argsort(combined)[::-1][:top_k]
print(f"Query: '{query}' (alpha={alpha})")
for i in top_idx:
print(f" [{combined[i]:.3f}] sem={sem_norm[i]:.2f} "
f"kw={kw_norm[i]:.2f} | {docs[i][:50]}")
# This is the pattern used by Weaviate, Elasticsearch, etc.
print("Hybrid search combines semantic understanding with")
print("exact keyword matching for the best of both worlds.")
Most production search systems (Elasticsearch with vector plugin, Weaviate, Vespa) implement this hybrid pattern. Pure semantic search can miss exact matches that keyword search nails trivially, and pure keyword search misses paraphrases that semantic search handles effortlessly. The combination is stronger than either alone.
The bottom line
- Text embeddings convert sentences and paragraphs into fixed-size vectors that capture semantic meaning -- similar texts produce similar vectors regardless of word overlap;
- Embedding models are trained with contrastive objectives (InfoNCE): push similar pairs close, push dissimilar pairs apart in embedding space;
- Cosine similarity is the standard metric for comparing text embeddings -- it measures directional similarity while ignoring vector magnitude;
- Vector databases (FAISS, Chroma, Qdrant) enable efficient similarity search over millions of vectors through specialized indexing;
- Indexing strategies (IVF, HNSW, PQ) trade small accuracy losses for massive speed gains at scale -- use flat for < 100K, HNSW for 100K-10M, IVF+PQ for 10M+;
- Semantic search finds conceptually similar content regardless of keyword overlap -- "How to code in Python" matches "Python programming tutorial" because they mean the same thing;
- Hybrid search (semantic + keyword combined) gives the best results in production systems.
Exercises
Exercise 1: Build a document deduplication system. Create a list of 20 short documents where 5 pairs are near-duplicates (paraphrases of each other) and 10 are unique. Write a function find_duplicates(documents, threshold=0.85) that embeds all documents (use all-MiniLM-L6-v2), computes all pairwise cosine similarities, and returns pairs with similarity above the threshold. Print a report showing which pairs were flagged as duplicates, their similarity scores, and whether the flagging was correct (compare against your ground truth). Experiment with thresholds from 0.7 to 0.95 and report precision/recall at each level.
Exercise 2: Implement IVF from scratch (no FAISS). Using only NumPy, write a class SimpleIVF with: (a) train(vectors, n_clusters) that runs k-means to find cluster centroids (use your own k-means from episode #22 or sklearn.cluster.KMeans), (b) add(vectors) that assigns each vector to its nearest cluster, (c) search(query, k, nprobe) that finds the nprobe nearest cluster centroids to the query, then does brute-force search only within those clusters. Benchmark your implementation against brute-force exact search on 50,000 random vectors (dimension 128): measure query time and recall@10 for nprobe values [1, 3, 5, 10, 25].
Exercise 3: Build a semantic search evaluation harness. Create a small test collection of 50 documents spanning 5 topics (10 documents per topic). Write 10 queries (2 per topic). For each query, define the relevant documents (ground truth). Implement the standard IR metrics: Precision@k, Recall@k, Mean Reciprocal Rank (MRR), and NDCG@k. Run your semantic search system (from the episode) on all 10 queries, compute all four metrics at k=1, k=3, k=5, and k=10, and print a summary table. Compare: does semantic search do better on some topics than others? Why might that be?