Learn AI Series (#63) - Embeddings and Vector Search

What will I learn

You will learn text embeddings -- representing meaning as dense vectors;
sentence embeddings vs word embeddings -- why sentence-level matters for real applications;
similarity metrics: cosine, euclidean, and dot product -- when to use each;
vector databases: Chroma, FAISS, Qdrant and when to pick which;
indexing strategies: flat, IVF, HNSW -- the speed vs accuracy tradeoff;
building a semantic search system from scratch.

Requirements

A working modern computer running macOS, Windows or Ubuntu;
An installed Python 3(.11+) distribution;
The ambition to learn AI and machine learning.

Difficulty

Beginner

Curriculum (of the `Learn AI Series`):

Learn AI Series (#63) - Embeddings and Vector Search

Solutions to Episode #62 Exercises

Exercise 1: Build a prompt comparison benchmark with zero-shot, 2-shot, and 5-shot variants across 10 classification tasks.

import random

random.seed(42)

tasks = [
    {
        "name": "sentiment",
        "labels": ["positive", "negative"],
        "keywords": {"positive": ["great", "love", "excellent", "amazing",
                                   "wonderful", "fantastic", "superb"],
                     "negative": ["terrible", "hate", "awful", "horrible",
                                  "disgusting", "worst", "broken"]},
        "examples": [
            ("This product is absolutely wonderful!", "positive"),
            ("Complete garbage, waste of money.", "negative"),
            ("I love everything about this service.", "positive"),
            ("Terrible experience, never again.", "negative"),
            ("Fantastic quality and fast shipping!", "positive"),
        ],
        "test": [
            ("Great value for the price!", "positive"),
            ("Horrible customer service.", "negative"),
            ("I love how easy this is to use.", "positive"),
            ("The worst purchase I ever made.", "negative"),
            ("Excellent build quality overall.", "positive"),
        ]
    },
    {
        "name": "spam",
        "labels": ["spam", "not_spam"],
        "keywords": {"spam": ["free", "winner", "click", "urgent",
                              "congratulations", "prize", "limited"],
                     "not_spam": ["meeting", "report", "project",
                                  "schedule", "update", "attached"]},
        "examples": [
            ("Congratulations! You won a FREE prize!", "spam"),
            ("Meeting rescheduled to 3pm tomorrow.", "not_spam"),
            ("URGENT: Click here to claim your reward!", "spam"),
            ("Please find the project report attached.", "not_spam"),
            ("Limited time offer: free gift inside!", "spam"),
        ],
        "test": [
            ("Click now for your free vacation!", "spam"),
            ("Updated the project schedule.", "not_spam"),
            ("You are the lucky winner today!", "spam"),
            ("Quarterly report is ready for review.", "not_spam"),
            ("Urgent: claim your prize immediately.", "spam"),
        ]
    },
]

# (truncated to 2 tasks for space -- in practice you'd have 10)

def simulate_model(prompt, test_text, task):
    """Keyword-based classification simulator."""
    text_lower = test_text.lower()
    scores = {}
    for label, kws in task["keywords"].items():
        scores[label] = sum(1 for w in kws if w in text_lower)
    if max(scores.values()) == 0:
        return task["labels"][0]
    return max(scores, key=scores.get)

def run_benchmark(tasks, n_shots_list=[0, 2, 5]):
    print(f"{'Task':<12} {'0-shot':>8} {'2-shot':>8} {'5-shot':>8}")
    print("-" * 40)
    for task in tasks:
        accs = []
        for n in n_shots_list:
            correct = 0
            for text, true_label in task["test"]:
                pred = simulate_model("", text, task)
                if pred == true_label:
                    correct += 1
            accs.append(correct / len(task["test"]))
        print(f"{task['name']:<12} {accs[0]:>8.1%} {accs[1]:>8.1%} "
              f"{accs[2]:>8.1%}")

run_benchmark(tasks)
print("\nNote: simulated model uses keyword matching.")
print("Real API calls would show few-shot >> zero-shot.")

The benchmark harness is what matters here -- the simulate_model function is a placeholder for actual API calls. In production you'd swap it for openai.chat.completions.create(...) or equivalent, and you'd see real accuracy improvements with more examples.

Exercise 2: Chain-of-thought verifier that extracts and re-computes arithmetic steps.

import re

def verify_cot(question, cot_response):
    """Verify arithmetic steps in a chain-of-thought response."""
    pattern = r'(\d+(?:\.\d+)?)\s*([+\-x*/])\s*(\d+(?:\.\d+)?)\s*=\s*(\d+(?:\.\d+)?)'
    steps = re.findall(pattern, cot_response.replace('x', '*'))
    results = []
    for left, op, right, stated in steps:
        left_f, right_f, stated_f = float(left), float(right), float(stated)
        op_fixed = op.replace('x', '*')
        if op_fixed == '+':
            computed = left_f + right_f
        elif op_fixed == '-':
            computed = left_f - right_f
        elif op_fixed in ('*', 'x'):
            computed = left_f * right_f
        elif op_fixed == '/':
            computed = left_f / right_f if right_f != 0 else float('inf')
        else:
            computed = None
        correct = abs(computed - stated_f) < 0.01 if computed else False
        results.append({
            "expr": f"{left} {op} {right}",
            "stated": stated_f,
            "computed": computed,
            "correct": correct
        })
    return results

problems = [
    ("5 shelves * 8 boxes = 40, 40 * 12 = 480",
     "5 * 8 = 40, 40 * 12 = 480"),
    ("3 + 4 = 7, 7 * 2 = 14",
     "3 + 4 = 7, 7 * 2 = 14"),
    ("10 * 5 = 50, 50 + 3 = 53",
     "10 * 5 = 55, 55 + 3 = 53"),  # ERROR: 10*5=55
    ("24 / 3 = 8, 8 * 4 = 32",
     "24 / 3 = 8, 8 * 4 = 36"),  # ERROR: 8*4=36
    ("100 - 25 = 75, 75 / 5 = 15",
     "100 - 25 = 75, 75 / 5 = 15"),
]

print(f"{'Problem':<35} {'Steps':>6} {'Errors':>7}")
print("-" * 52)
for question, cot in problems:
    results = verify_cot(question, cot)
    errors = sum(1 for r in results if not r["correct"])
    status = "OK" if errors == 0 else f"{errors} BAD"
    print(f"{cot[:33]:<35} {len(results):>6} {status:>7}")
    for r in results:
        if not r["correct"]:
            print(f"  ERROR: {r['expr']} stated={r['stated']}, "
                  f"computed={r['computed']}")

The verifier catches arithmetic errors by re-computing each step independently. Real self-consistency pipelines use this exact approach: generate multiple CoT paths, verify each, discard paths with arithmetic errors, majority-vote on the rest.

Exercise 3: Sampling parameter visualizer showing temperature, top-k, and top-p effects.

import torch
import torch.nn.functional as F
import math

torch.manual_seed(42)
logits = torch.randn(20) * 3  # 20-token vocabulary
logits[0] += 4.0  # make token 0 the most likely
logits[1] += 2.0  # token 1 second most likely

print("=== Temperature Scaling ===")
print(f"{'Temp':>6} {'Top prob':>10} {'Entropy':>10} {'Eff vocab':>10}")
for temp in [0.1, 0.5, 1.0, 2.0]:
    probs = F.softmax(logits / temp, dim=-1)
    top_p = probs.max().item()
    entropy = -(probs * probs.clamp(min=1e-10).log()).sum().item()
    eff_vocab = (probs > 0.01).sum().item()
    print(f"{temp:>6.1f} {top_p:>10.4f} {entropy:>10.4f} {eff_vocab:>10}")

print(f"\n=== Top-k Filtering ===")
print(f"{'k':>6} {'Mass kept':>10} {'Eff vocab':>10}")
probs = F.softmax(logits, dim=-1)
sorted_p, _ = probs.sort(descending=True)
for k in [3, 5, 10, 20]:
    kept = sorted_p[:k].sum().item()
    print(f"{k:>6} {kept:>10.4f} {min(k, 20):>10}")

print(f"\n=== Top-p (Nucleus) Filtering ===")
print(f"{'p':>6} {'Tokens kept':>12} {'Mass':>10}")
cumsum = sorted_p.cumsum(dim=0)
for p in [0.5, 0.9, 0.95, 0.99]:
    n_tokens = (cumsum < p).sum().item() + 1
    actual_mass = sorted_p[:n_tokens].sum().item()
    print(f"{p:>6.2f} {n_tokens:>12} {actual_mass:>10.4f}")

Lower temperature concentrates mass on the top token (0.1 makes it nearly deterministic). Top-k sets a hard cutoff regardless of the distribution shape. Top-p adapts -- when the model is confident, fewer tokens pass the threshold; when it's uncertain, more get through. This adaptive behaviour is why top-p is generally preferred over top-k in practice.

On to today's episode

Here we go! At the end of episode #62 I teased what was coming -- the idea of converting text into numerical vectors so you can search for "similar" content. That's exactly what we're building today. And this is NOT just academic -- embeddings and vector search are the foundation for one of the most practical applications of modern AI. Every search engine, every recommendation system, every "find similar documents" feature you've ever used is built on some variant of what we're about to cover.

Back in episode #31, we learned about word embeddings -- representing individual words as dense vectors where similar words live close together. "King" and "queen" are neighbors; "king" and "toaster" are not. That was powerful for its time. But word embeddings have a fundamental limitation: each word gets exactly ONE vector, regardless of context. The word "bank" gets the same embedding whether we're talking about a river bank or a financial bank. And there's no standard way to combine word embeddings into a meaningful representation of an entire sentence or paragraph.

Modern text embeddings solve both problems. They use transformer models (episode #52-53) to produce context-dependent, sentence-level vectors. "The bank of the river" and "The bank approved my loan" produce completely different embeddings because the model attends to the full context. And the resulting vector represents the meaning of the entire text, not just individual words.

This is the foundation for semantic search, RAG (which we'll get into next), recommendation systems, and clustering at scale. Let's build it from scratch ;-)

From words to sentences

A sentence embedding model takes arbitrary text and produces a fixed-size vector (typically 384 to 1536 dimensions) that captures semantic meaning. Texts with similar meanings produce vectors that are close together in this high-dimensional space -- regardless of whether they share any words.

# Using sentence-transformers (the standard library for this)
# pip install sentence-transformers
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')  # 384-dim, fast

texts = [
    "How do I learn Python programming?",
    "What's the best way to start coding in Python?",
    "Recipe for chocolate chip cookies",
    "The weather in Amsterdam is rainy today",
]
embeddings = model.encode(texts)
print(f"Shape: {embeddings.shape}")  # (4, 384)

# The first two are semantically similar despite different words
for i in range(len(texts)):
    for j in range(i+1, len(texts)):
        sim = np.dot(embeddings[i], embeddings[j]) / (
            np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[j]))
        print(f"Sim({i},{j}): {sim:.3f}  "
              f"'{texts[i][:35]}' vs '{texts[j][:35]}'")

The first two texts -- "How do I learn Python programming?" and "What's the best way to start coding in Python?" -- should have high cosine similarity (0.8+) despite sharing almost no words. They mean the same thing. Traditional keyword search (TF-IDF, BM25, the stuff we built in episode #30) would miss this connection entirely because it relies on word overlap. Embedding-based search catches it because the model understands semantics, not just surface tokens.

And that's the whole game, really. Everything else in this episode is infrastructure to make this core idea work at scale.

How embedding models are trained

Most embedding models start from a pre-trained transformer (usually BERT-like, as we covered in episode #59) and are fine-tuned with a contrastive objective: push similar text pairs close together in embedding space, push dissimilar pairs apart.

The training data consists of pairs labeled as similar or dissimilar: question-answer pairs from forums (similar), paraphrases from NLI datasets (similar), random text combinations (dissimilar). The loss function -- typically InfoNCE -- pushes the model to produce high cosine similarity for positive pairs and low similarity for negative pairs.

import torch
import torch.nn.functional as F

def info_nce_loss(anchor, positive, negatives, temperature=0.05):
    """InfoNCE loss: the standard contrastive learning objective.

    anchor: (batch, dim) -- the query embeddings
    positive: (batch, dim) -- the matching embeddings
    negatives: (batch, n_neg, dim) -- non-matching embeddings
    """
    # Cosine similarity between anchor and positive
    pos_sim = F.cosine_similarity(anchor, positive, dim=-1) / temperature

    # Cosine similarity between anchor and each negative
    neg_sims = F.cosine_similarity(
        anchor.unsqueeze(1), negatives, dim=-1) / temperature

    # Combine: positive at index 0, negatives after
    logits = torch.cat([pos_sim.unsqueeze(-1), neg_sims], dim=-1)
    # Target: the positive is always at index 0
    labels = torch.zeros(logits.size(0), dtype=torch.long)
    return F.cross_entropy(logits, labels)

# Quick demonstration
batch_size, dim, n_neg = 4, 128, 7
anchor = F.normalize(torch.randn(batch_size, dim), dim=-1)
positive = anchor + 0.1 * torch.randn(batch_size, dim)  # slightly noisy
positive = F.normalize(positive, dim=-1)
negatives = F.normalize(torch.randn(batch_size, n_neg, dim), dim=-1)

loss = info_nce_loss(anchor, positive, negatives)
print(f"InfoNCE loss: {loss.item():.4f}")
print(f"(Lower = model better at distinguishing positives from negatives)")

The temperature parameter is critical. Lower temperature (0.05) makes the loss sharper -- small differences in cosine similarity translate into large differences in loss. This forces the model to learn fine-grained distinctions between similar and dissimilar texts. Too high and the model gets lazy; too low and training becomes unstable.

Popular embedding models you'll encounter in 2025/2026:

all-MiniLM-L6-v2: 384 dimensions, fast, good for general English text
E5-large-v2 (Microsoft): 1024 dimensions, better quality, slower
text-embedding-3-small/large (OpenAI): 1536 dimensions, API-only, excellent quality
BGE (BAAI): open-source, competitive with commercial models
Cohere embed-v3: multilingual, strong on retrieval benchmarks
nomic-embed-text: open-source, 8192 token context, very competitive

Similarity metrics

Given two embedding vectors, how do you measure how "close" they are? Three common options, each with different properties.

Cosine similarity: measures the angle between vectors, ignoring magnitude. Two vectors pointing in the same direction have cosine similarity 1.0, regardless of their length. This is the most common choice for text embeddings because embedding models are typically trained to optimize cosine similarity.

cos_sim(a, b) = (a . b) / (||a|| * ||b||)

Euclidean distance (L2): straight-line distance between vector endpoints. Smaller = more similar. Sensitive to vector magnitude -- a long vector far from a short vector might have higher L2 distance than two short vectors pointing in completely different directions.

Dot product (inner product): a . b = ||a|| * ||b|| * cos(angle). Combines directional similarity AND magnitude. Used when you want "more relevant AND more prominent" to score higher -- for example, if longer documents produce longer embedding vectors and you want to surface popular/extensive documents.

import numpy as np

def compare_metrics(a, b, label=""):
    cosine = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
    euclidean = np.linalg.norm(a - b)
    dot = np.dot(a, b)
    print(f"{label}")
    print(f"  Cosine sim: {cosine:.4f}")
    print(f"  Euclidean:  {euclidean:.4f}")
    print(f"  Dot product: {dot:.4f}")
    print()

a = np.array([1.0, 2.0, 3.0])
b = np.array([2.0, 4.0, 6.0])  # same direction, double magnitude
c = np.array([3.0, 1.0, 0.5])  # different direction, similar magnitude

compare_metrics(a, b, "a vs b (same direction, different magnitude)")
compare_metrics(a, c, "a vs c (different direction)")

print("Key insight:")
print("  Cosine: a,b are IDENTICAL (1.0) -- ignores magnitude")
print("  Euclidean: a,b are FAR apart -- magnitude matters")
print("  Dot product: a,b score HIGH -- same direction + big magnitude")

For most text retrieval tasks, cosine similarity is the right choice. It's what the embedding model was trained to optimize, and it's invariant to text length (longer texts don't automatically score higher just because they produce bigger vectors). Having said that, if your vectors are already L2-normalized (which many libraries do by default), cosine similarity and dot product are mathematically identical -- so it doesn't matter which you compute.

Vector databases

With embeddings, search becomes "find the nearest vectors to my query vector." For small collections (< 100K vectors), brute-force comparison against every single vector in your collection works fine. For millions or billions of vectors, you need a vector database with efficient indexing.

FAISS (Facebook AI Similarity Search)

Not a database in the traditional sense -- it's a library for efficient in-memory similarity search. Developed by Meta, it's the fastest option when your vectors fit in RAM:

import faiss
import numpy as np

# Create 100,000 random vectors of dimension 384
d = 384
n = 100_000
np.random.seed(42)
vectors = np.random.randn(n, d).astype('float32')

# Flat index: exact search (brute force, O(n) per query)
index = faiss.IndexFlatIP(d)  # inner product (use IndexFlatL2 for L2)
faiss.normalize_L2(vectors)   # normalize so IP == cosine similarity
index.add(vectors)

print(f"Index contains {index.ntotal} vectors")

# Search: find 5 nearest neighbors for 3 queries
queries = np.random.randn(3, d).astype('float32')
faiss.normalize_L2(queries)
distances, indices = index.search(queries, k=5)

print(f"\nQuery 0 -- top 5 neighbors:")
for rank, (dist, idx) in enumerate(zip(distances[0], indices[0])):
    print(f"  #{rank+1}: vector {idx}, similarity={dist:.4f}")

FAISS supports GPU acceleration (faiss-gpu) and approximate nearest neighbor algorithms for billion-scale search. We'll get to those shortly.

Chroma

A lightweight, developer-friendly vector database designed for local prototyping and small-to-medium apps. No server process needed:

# pip install chromadb
import chromadb

client = chromadb.Client()  # in-memory, ephemeral
collection = client.create_collection(
    "my_docs",
    metadata={"hnsw:space": "cosine"}  # use cosine distance
)

# Add documents -- Chroma can embed them internally
# (configure an embedding function, or pass pre-computed embeddings)
collection.add(
    documents=[
        "Python is a high-level programming language",
        "JavaScript runs in web browsers natively",
        "Cookies are delicious baked goods made with butter",
        "Machine learning requires large datasets",
        "The transformer architecture revolutionized NLP",
    ],
    ids=["doc1", "doc2", "doc3", "doc4", "doc5"]
)

# Query -- returns semantically similar documents
results = collection.query(
    query_texts=["How to code in Python?"],
    n_results=3
)
print("Query: 'How to code in Python?'")
print(f"Top 3 results:")
for doc, dist in zip(results['documents'][0], results['distances'][0]):
    print(f"  [{dist:.4f}] {doc}")

Chroma handles embedding, storage, and retrieval in one simple API. For a hobby project or a prototype with under 1 million documents, it's hard to beat. The tradeoff: it doesn't scale to billions of vectors and lacks advanced features like filtered search at massive scale.

Qdrant, Pinecone, Weaviate

For production deployments at real scale:

Qdrant: open-source, written in Rust, fast, supports filtering combined with vector search (e.g. "find similar documents WHERE category='tech' AND date > 2024")
Pinecone: fully managed cloud service, zero operational overhead, pay-per-query pricing
Weaviate: open-source, supports hybrid search (vector + keyword BM25 combined)

The choice between these depends on your operational preferences (managed vs self-hosted), scale requirements, and whether you need advanced filtering.

Indexing strategies

Brute-force search (comparing the query against every single vector) is O(n) per query. Fine for 10K vectors, absolutely impractical for 10M or 100M. Approximate Nearest Neighbor (ANN) algorithms trade a small accuracy loss for massive speed gains.

IVF (Inverted File Index)

Partition all vectors into clusters using k-means. At query time, only search the clusters nearest to the query in stead of searching everything:

import faiss
import numpy as np

d = 384
n = 100_000
np.random.seed(42)
vectors = np.random.randn(n, d).astype('float32')
faiss.normalize_L2(vectors)

# IVF with 100 clusters
nlist = 100
quantizer = faiss.IndexFlatIP(d)
index_ivf = faiss.IndexIVFFlat(quantizer, d, nlist,
                                faiss.METRIC_INNER_PRODUCT)

# Must train (learns cluster centroids) before adding vectors
index_ivf.train(vectors)
index_ivf.add(vectors)

# nprobe controls accuracy vs speed tradeoff
# Higher nprobe = more clusters searched = better recall, slower
query = np.random.randn(1, d).astype('float32')
faiss.normalize_L2(query)

for nprobe in [1, 5, 10, 50]:
    index_ivf.nprobe = nprobe
    distances, indices = index_ivf.search(query, k=5)
    print(f"nprobe={nprobe:>3}: top match idx={indices[0][0]}, "
          f"sim={distances[0][0]:.4f}")

print("\nWith 100 clusters and nprobe=10, we search only 10% of data")
print("~10x faster than brute force, usually 95%+ recall")

With 1000 clusters and nprobe=10, you're searching roughly 1% of your data -- a ~100x speedup for ~95-98% recall (meaning you find 95-98% of the true top-k results).

HNSW (Hierarchical Navigable Small World)

Build a multi-layer graph where each node connects to its nearby neighbors. Search by starting at a random entry point on the top layer, greedily navigating toward the query, then descending to finer layers:

import faiss
import numpy as np
import time

d = 384
n = 100_000
np.random.seed(42)
vectors = np.random.randn(n, d).astype('float32')
faiss.normalize_L2(vectors)

# HNSW index
index_hnsw = faiss.IndexHNSWFlat(d, 32)  # 32 = connections per node
index_hnsw.metric_type = faiss.METRIC_INNER_PRODUCT
index_hnsw.hnsw.efConstruction = 200  # build quality
index_hnsw.add(vectors)

# Compare exact vs HNSW
index_flat = faiss.IndexFlatIP(d)
index_flat.add(vectors)

queries = np.random.randn(100, d).astype('float32')
faiss.normalize_L2(queries)

t0 = time.time()
exact_d, exact_i = index_flat.search(queries, k=10)
t_exact = time.time() - t0

index_hnsw.hnsw.efSearch = 64  # search quality parameter
t0 = time.time()
hnsw_d, hnsw_i = index_hnsw.search(queries, k=10)
t_hnsw = time.time() - t0

# Measure recall: fraction of true top-10 found by HNSW
recall = np.mean([
    len(set(hnsw_i[i]) & set(exact_i[i])) / 10.0
    for i in range(100)
])

print(f"Exact search:  {t_exact*1000:.1f}ms for 100 queries")
print(f"HNSW search:   {t_hnsw*1000:.1f}ms for 100 queries")
print(f"Speedup:       {t_exact/t_hnsw:.1f}x")
print(f"Recall@10:     {recall:.3f}")

HNSW gives excellent recall (often 99%+) with 10-100x speedup over exact search. The tradeoff: higher memory usage than IVF (it stores the graph structure in addition to the vectors).

Product Quantization (PQ)

Compress vectors by splitting each into sub-vectors and quantizing each sub-vector independently. Reduces memory by 4-8x with modest accuracy loss. Often combined with IVF for billion-scale systems.

The rule of thumb:

< 100K vectors: flat (exact) search. It's fast enough.
100K - 10M: HNSW. Best speed-accuracy tradeoff.
10M+: IVF + PQ with possible GPU acceleration. Memory is the bottleneck.

Building a semantic search system

Let's put everything together -- a minimal but complete semantic search engine:

from sentence_transformers import SentenceTransformer
import numpy as np

class SemanticSearch:
    def __init__(self, model_name='all-MiniLM-L6-v2'):
        self.model = SentenceTransformer(model_name)
        self.documents = []
        self.embeddings = None

    def index(self, documents):
        """Embed and store documents for search."""
        self.documents = documents
        self.embeddings = self.model.encode(
            documents, normalize_embeddings=True,
            show_progress_bar=False
        )
        print(f"Indexed {len(documents)} documents "
              f"({self.embeddings.shape[1]} dimensions)")

    def search(self, query, top_k=5):
        """Find the top-k most similar documents to query."""
        q_emb = self.model.encode(
            [query], normalize_embeddings=True)
        # Cosine similarity (vectors are pre-normalized)
        scores = self.embeddings @ q_emb.T
        top_idx = np.argsort(scores.flatten())[::-1][:top_k]
        return [(self.documents[i], float(scores[i][0]))
                for i in top_idx]

# Usage
search = SemanticSearch()
docs = [
    "Machine learning is a subset of artificial intelligence",
    "Python is widely used for data science and ML",
    "The Eiffel Tower is located in Paris, France",
    "Neural networks are inspired by biological neurons",
    "Amsterdam is the capital of the Netherlands",
    "Gradient descent optimizes model parameters iteratively",
    "The transformer architecture uses self-attention mechanisms",
    "Random forests combine multiple decision trees",
    "Backpropagation computes gradients through a neural network",
    "Natural language processing deals with human language",
]
search.index(docs)

queries = [
    "How do neural networks learn?",
    "What is the capital of Holland?",
    "Tell me about ensemble methods in ML",
]

for q in queries:
    print(f"\nQuery: '{q}'")
    results = search.search(q, top_k=3)
    for doc, score in results:
        print(f"  {score:.3f}: {doc}")

"How do neural networks learn?" matches "Backpropagation computes gradients through a neural network" and "Neural networks are inspired by biological neurons" -- neither shares more than one or two words with the query, but they're semantically relevant. "What is the capital of Holland?" matches "Amsterdam is the capital of the Netherlands" because the model understands that Holland and the Netherlands refer to the same country. This is the power of semantic search over keyword matching.

This pattern -- encode documents once, encode query at search time, compute similarities -- is the building block for Retrieval-Augmented Generation (which is next), recommendation systems, duplicate detection, plagiarism checking, and any application where you need "find content similar to this."

Scaling considerations

A few practical things that matter when you move from prototype to production:

Batch encoding: embed documents in batches, not one at a time. model.encode(docs, batch_size=256) is orders of magnitude faster than encoding in a loop because it utilizes GPU parallelism.

Dimensionality reduction: if storage is tight, you can project embeddings to lower dimensions (using PCA from episode #24) with surprisingly little quality loss. Going from 1536d to 256d often preserves 90%+ of the retrieval quality.

Hybrid search: combine vector similarity with keyword matching (BM25). Some queries are better served by exact keyword match ("error code XYZ-123"), others by semantic similarity ("how to fix authentication issues"). Production systems often compute both scores and combine them with a weighted sum or a learned reranker.

import numpy as np

def hybrid_search(query, docs, embeddings, model,
                  alpha=0.7, top_k=5):
    """Combine semantic search with keyword matching.

    alpha: weight for semantic score (1-alpha for keyword)
    """
    # Semantic scores
    q_emb = model.encode([query], normalize_embeddings=True)
    sem_scores = (embeddings @ q_emb.T).flatten()

    # Simple keyword scores (normalized word overlap)
    q_words = set(query.lower().split())
    kw_scores = np.array([
        len(q_words & set(doc.lower().split())) / max(len(q_words), 1)
        for doc in docs
    ])

    # Normalize both to [0, 1] range
    if sem_scores.max() > sem_scores.min():
        sem_norm = ((sem_scores - sem_scores.min()) /
                    (sem_scores.max() - sem_scores.min()))
    else:
        sem_norm = np.zeros_like(sem_scores)

    if kw_scores.max() > kw_scores.min():
        kw_norm = ((kw_scores - kw_scores.min()) /
                   (kw_scores.max() - kw_scores.min()))
    else:
        kw_norm = np.zeros_like(kw_scores)

    # Weighted combination
    combined = alpha * sem_norm + (1 - alpha) * kw_norm
    top_idx = np.argsort(combined)[::-1][:top_k]

    print(f"Query: '{query}' (alpha={alpha})")
    for i in top_idx:
        print(f"  [{combined[i]:.3f}] sem={sem_norm[i]:.2f} "
              f"kw={kw_norm[i]:.2f} | {docs[i][:50]}")

# This is the pattern used by Weaviate, Elasticsearch, etc.
print("Hybrid search combines semantic understanding with")
print("exact keyword matching for the best of both worlds.")

Most production search systems (Elasticsearch with vector plugin, Weaviate, Vespa) implement this hybrid pattern. Pure semantic search can miss exact matches that keyword search nails trivially, and pure keyword search misses paraphrases that semantic search handles effortlessly. The combination is stronger than either alone.

The bottom line

Text embeddings convert sentences and paragraphs into fixed-size vectors that capture semantic meaning -- similar texts produce similar vectors regardless of word overlap;
Embedding models are trained with contrastive objectives (InfoNCE): push similar pairs close, push dissimilar pairs apart in embedding space;
Cosine similarity is the standard metric for comparing text embeddings -- it measures directional similarity while ignoring vector magnitude;
Vector databases (FAISS, Chroma, Qdrant) enable efficient similarity search over millions of vectors through specialized indexing;
Indexing strategies (IVF, HNSW, PQ) trade small accuracy losses for massive speed gains at scale -- use flat for < 100K, HNSW for 100K-10M, IVF+PQ for 10M+;
Semantic search finds conceptually similar content regardless of keyword overlap -- "How to code in Python" matches "Python programming tutorial" because they mean the same thing;
Hybrid search (semantic + keyword combined) gives the best results in production systems.

Exercises

Exercise 1: Build a document deduplication system. Create a list of 20 short documents where 5 pairs are near-duplicates (paraphrases of each other) and 10 are unique. Write a function find_duplicates(documents, threshold=0.85) that embeds all documents (use all-MiniLM-L6-v2), computes all pairwise cosine similarities, and returns pairs with similarity above the threshold. Print a report showing which pairs were flagged as duplicates, their similarity scores, and whether the flagging was correct (compare against your ground truth). Experiment with thresholds from 0.7 to 0.95 and report precision/recall at each level.

Exercise 2: Implement IVF from scratch (no FAISS). Using only NumPy, write a class SimpleIVF with: (a) train(vectors, n_clusters) that runs k-means to find cluster centroids (use your own k-means from episode #22 or sklearn.cluster.KMeans), (b) add(vectors) that assigns each vector to its nearest cluster, (c) search(query, k, nprobe) that finds the nprobe nearest cluster centroids to the query, then does brute-force search only within those clusters. Benchmark your implementation against brute-force exact search on 50,000 random vectors (dimension 128): measure query time and recall@10 for nprobe values [1, 3, 5, 10, 25].

Exercise 3: Build a semantic search evaluation harness. Create a small test collection of 50 documents spanning 5 topics (10 documents per topic). Write 10 queries (2 per topic). For each query, define the relevant documents (ground truth). Implement the standard IR metrics: Precision@k, Recall@k, Mean Reciprocal Rank (MRR), and NDCG@k. Run your semantic search system (from the episode) on all 10 queries, compute all four metrics at k=1, k=3, k=5, and k=10, and print a summary table. Compare: does semantic search do better on some topics than others? Why might that be?

Bedankt en tot de volgende keer!

Hive account@scipio

Learn AI Series (#63) - Embeddings and Vector Search

Learn AI Series (#63) - Embeddings and Vector Search

What will I learn

Requirements

Difficulty

Curriculum (of the Learn AI Series):

Learn AI Series (#63) - Embeddings and Vector Search

Solutions to Episode #62 Exercises

On to today's episode

From words to sentences

How embedding models are trained

Similarity metrics

Vector databases

FAISS (Facebook AI Similarity Search)

Chroma

Qdrant, Pinecone, Weaviate

Indexing strategies

IVF (Inverted File Index)

HNSW (Hierarchical Navigable Small World)

Product Quantization (PQ)

Building a semantic search system

Scaling considerations

The bottom line

Exercises

Bedankt en tot de volgende keer!

Curriculum (of the `Learn AI Series`):