Learn AI Series (#73) - LLM Evaluation

What will I learn

You will learn why evaluating language models is fundamentally harder than evaluating traditional ML models;
perplexity: what it actually measures, how to compute it, and where it falls short;
benchmark suites: MMLU, HumanEval, GSM8K, HellaSwag, TruthfulQA and what they really test;
LLM-as-judge: using strong models to evaluate weaker models (pairwise comparison, absolute scoring, bias mitigation);
human evaluation: when automated metrics are not enough and how to structure it;
benchmark contamination: the elephant in the room and how to build custom evaluations that actually matter.

Requirements

A working modern computer running macOS, Windows or Ubuntu;
An installed Python 3(.11+) distribution;
The ambition to learn AI and machine learning.

Difficulty

Beginner

Curriculum (of the `Learn AI Series`):

Learn AI Series (#73) - LLM Evaluation

Solutions to Episode #72 Exercises

Exercise 1: Custom tokenizer with BPE-style merge rules.

from collections import Counter


def build_vocab(corpus, num_merges):
    """Build BPE vocabulary from corpus through iterative merging."""
    # Start with character-level tokens
    words = []
    for word in corpus.split():
        chars = list(word) + [""]
        words.append(chars)

    merge_rules = []

    for step in range(num_merges):
        # Count all adjacent pairs
        pair_counts = Counter()
        for word in words:
            for i in range(len(word) - 1):
                pair_counts[(word[i], word[i + 1])] += 1

        if not pair_counts:
            break

        # Find most frequent pair
        best_pair = pair_counts.most_common(1)[0]
        pair, count = best_pair
        merged = pair[0] + pair[1]
        merge_rules.append((pair, merged, count))

        # Apply merge everywhere
        new_words = []
        for word in words:
            merged_word = []
            i = 0
            while i < len(word):
                if (i < len(word) - 1
                        and word[i] == pair[0]
                        and word[i + 1] == pair[1]):
                    merged_word.append(merged)
                    i += 2
                else:
                    merged_word.append(word[i])
                    i += 1
            new_words.append(merged_word)
        words = new_words

    # Build final vocabulary
    vocab = set()
    for word in words:
        vocab.update(word)

    return vocab, merge_rules, words


def tokenize(text, merge_rules):
    """Tokenize new text using learned merge rules."""
    tokens = []
    for word in text.split():
        chars = list(word) + [""]
        for (pair, merged, _) in merge_rules:
            i = 0
            while i < len(chars) - 1:
                if chars[i] == pair[0] and chars[i + 1] == pair[1]:
                    chars[i] = merged
                    del chars[i + 1]
                else:
                    i += 1
        tokens.extend(chars)
    return tokens


# Test it
corpus = ("the cat sat on the mat the cat ate the rat "
          "the bat sat on the hat the fat cat sat flat")

vocab, rules, final = build_vocab(corpus, 20)

print("Merge rules learned:")
for pair, merged, count in rules[:10]:
    print(f"  {pair[0]:>6} + {pair[1]: {merged:<10} "
          f"(count: {count})")

print(f"\nVocab size: {len(vocab)}")
print(f"Vocab: {sorted(vocab)[:20]}")

test = "the cat sat flat"
tokens = tokenize(test, rules)
print(f"\nTokenized '{test}':")
print(f"  {tokens}")
print(f"  Token count: {len(tokens)}")

The key insight here is how BPE builds a vocabulary bottom-up. It starts from individual characters (guaranteed to cover any input) and merges the most frequent pairs iteratively. Early merges capture common letter combinations ("th", "at", "on"), later merges build whole words. The </w> end-of-word marker prevents merges across word boundaries -- without it, "the" and "them" would share the token "the" even though "the" in "them" isn't a standalone word. This is exactly how production tokenizers like GPT's tiktoken work, just at much larger scale (50,000+ merges in stead of 20).

Exercise 2: Tokenization efficiency analyzer.

def analyze_tokenization(tokenizer_fn, texts, labels=None):
    """Analyze how efficiently a tokenizer handles different text types."""
    results = []
    for i, text in enumerate(texts):
        tokens = tokenizer_fn(text)
        chars = len(text)
        n_tokens = len(tokens)
        ratio = chars / n_tokens if n_tokens > 0 else 0

        # Unique token ratio (vocabulary diversity)
        unique = len(set(tokens))
        diversity = unique / n_tokens if n_tokens > 0 else 0

        results.append({
            "label": labels[i] if labels else f"text_{i}",
            "chars": chars,
            "tokens": n_tokens,
            "chars_per_token": ratio,
            "unique_tokens": unique,
            "diversity": diversity,
        })

    # Print comparison table
    print(f"{'Text Type':6} {'Tokens':>7} "
          f"{'Ch/Tok':>7} {'Unique':>7} {'Diversity':>9}")
    print("-" * 62)
    for r in results:
        print(f"{r['label']:6} "
              f"{r['tokens']:>7} {r['chars_per_token']:>7.2f} "
              f"{r['unique_tokens']:>7} {r['diversity']:>9.3f}")

    # Summary
    avg_ratio = sum(r["chars_per_token"] for r in results) / len(results)
    print(f"\nAverage chars/token: {avg_ratio:.2f}")
    return results


# Simple whitespace tokenizer for demo
def simple_tokenize(text):
    return text.split()


texts = [
    "The quick brown fox jumps over the lazy dog near the river bank",
    "def fibonacci(n): return n if n <= 1 else fibonacci(n-1) + fibonacci(n-2)",
    "SELECT u.name, COUNT(o.id) FROM users u LEFT JOIN orders o ON u.id = o.user_id GROUP BY u.name",
    "https://api.example.com/v2/users?page=3&limit=50&sort=created_at",
    "El gato se sento en la alfombra y miro por la ventana",
]

labels = ["English prose", "Python code", "SQL query",
          "URL", "Spanish text"]

analyze_tokenization(simple_tokenize, texts, labels)

The chars-per-token ratio reveals how "compressible" text is for a given tokenizer. English prose typically gets 4-5 chars per token with BPE tokenizers because common words like "the", "and", "for" are single tokens. Code is less efficient (more special characters, camelCase splitting), and URLs are worst (long opaque strings with few repeated patterns). This ratio directly affects your API costs and context window utilization -- a model with 4096 tokens of context gives you roughly 16,000 characters of English but maybe only 8,000 characters of dense code.

Exercise 3: Token vocabulary overlap analyzer.

def vocabulary_overlap(tokenizer_a, tokenizer_b, texts,
                       name_a="Tokenizer A", name_b="Tokenizer B"):
    """Compare vocabularies produced by two tokenizers on same texts."""
    vocab_a = set()
    vocab_b = set()
    total_a = 0
    total_b = 0

    for text in texts:
        tokens_a = tokenizer_a(text)
        tokens_b = tokenizer_b(text)
        vocab_a.update(tokens_a)
        vocab_b.update(tokens_b)
        total_a += len(tokens_a)
        total_b += len(tokens_b)

    overlap = vocab_a & vocab_b
    only_a = vocab_a - vocab_b
    only_b = vocab_b - vocab_a

    print(f"Vocabulary Overlap Analysis")
    print(f"  {name_a}: {len(vocab_a)} unique tokens, "
          f"{total_a} total tokens")
    print(f"  {name_b}: {len(vocab_b)} unique tokens, "
          f"{total_b} total tokens")
    print(f"  Shared tokens: {len(overlap)}")
    print(f"  Only in {name_a}: {len(only_a)}")
    print(f"  Only in {name_b}: {len(only_b)}")

    jaccard = len(overlap) / len(vocab_a | vocab_b)
    print(f"  Jaccard similarity: {jaccard:.3f}")

    if only_a:
        sample = sorted(only_a)[:8]
        print(f"\n  Sample only-{name_a}: {sample}")
    if only_b:
        sample = sorted(only_b)[:8]
        print(f"\n  Sample only-{name_b}: {sample}")

    return {
        "vocab_a": vocab_a, "vocab_b": vocab_b,
        "overlap": overlap, "jaccard": jaccard,
    }


# Two simple tokenizers: whitespace vs character trigrams
def word_tokenize(text):
    return text.lower().split()


def char_trigram_tokenize(text):
    text = text.lower()
    trigrams = []
    for i in range(len(text) - 2):
        trigrams.append(text[i:i+3])
    return trigrams


test_texts = [
    "the cat sat on the mat",
    "machine learning is fundamentally about patterns",
    "neural networks transform inputs through layers",
]

vocabulary_overlap(word_tokenize, char_trigram_tokenize,
                   test_texts, "Word", "Trigram")

The Jaccard similarity between tokenizer vocabularies tells you how "compatible" two tokenizers are. Two models with completely different tokenizers (Jaccard near 0) produce fundamentally different representations of the same text, which means you can't meaningfully compare their perplexity scores (as we'll discuss today!). Two models sharing a tokenizer family (like GPT-3.5 and GPT-4 both using cl100k_base) have high overlap and their token-level metrics are more directly comparable.

On to today's episode

Here we go! Over the last few episodes we've been building up quite a toolkit for working with language models. We've covered text generation strategies (episode #71) and tokenization internals (#72) -- but all of that assumes the model is actually good. How do you know? How do you measure "good" when the output is free-form text that could have a hundred valid answers?

Back in episode #13 we covered evaluation for traditional ML: accuracy, precision, recall, F1, AUC-ROC. Those metrics work beautifully when the answer is a number or a category. The model predicts "cat" or "dog", and the ground truth says "cat" -- done, you can count how often it's right. But with language models? "Write a Python function that reverses a linked list" has many correct answers. Different variable names, diferent approaches, different styles. No single reference answer captures all valid responses.

LLM evaluation is one of the hardest open problems in the field right now. And I'd argue it's also one of the most important ones, because if you can't measure quality, you can't improve it systematically. You're just vibing ;-)

Perplexity: the classic language model metric

Perplexity measures how "surprised" a language model is by a sequence of text. Formally, it's the exponential of the average negative log-likelihood per token. If a model assigns high probability to each token in a sequence, perplexity is low. If the model is constantly surprised by what comes next, perplexity is high.

import math


def compute_perplexity_manual(log_probs):
    """Compute perplexity from a list of log probabilities.

    Each log_prob is the model's log probability for the
    actual next token at that position.
    """
    n = len(log_probs)
    if n == 0:
        return float("inf")

    # Average negative log-likelihood
    avg_nll = -sum(log_probs) / n

    # Perplexity = exp(avg_nll)
    perplexity = math.exp(avg_nll)
    return perplexity


# Example: a confident model vs an uncertain model
# Log probs closer to 0 = higher confidence

confident_probs = [-0.1, -0.2, -0.15, -0.05, -0.3,
                   -0.1, -0.08, -0.12, -0.2, -0.1]
uncertain_probs = [-2.5, -3.1, -1.8, -2.9, -3.5,
                   -2.0, -2.7, -3.3, -1.9, -2.4]

ppl_confident = compute_perplexity_manual(confident_probs)
ppl_uncertain = compute_perplexity_manual(uncertain_probs)

print(f"Confident model perplexity: {ppl_confident:.2f}")
print(f"Uncertain model perplexity: {ppl_uncertain:.2f}")
print(f"\nInterpretation:")
print(f"  PPL {ppl_confident:.0f} means the model is 'choosing' "
      f"among ~{ppl_confident:.0f} options on average")
print(f"  PPL {ppl_uncertain:.0f} means the model is 'choosing' "
      f"among ~{ppl_uncertain:.0f} options on average")

A perplexity of 10 means the model is, on average, as uncertain as if it were choosing uniformly among 10 options at each token position. Lower is better. Good language models on clean English text typically have perplexities between 5 and 20.

import torch
import math


def compute_perplexity_pytorch(model, tokenizer, text,
                               stride=512):
    """Compute perplexity using a HuggingFace model.

    Uses a sliding window for texts longer than context.
    """
    encodings = tokenizer(text, return_tensors="pt")
    input_ids = encodings["input_ids"]
    seq_len = input_ids.size(1)

    nlls = []
    prev_end = 0

    for begin in range(0, seq_len, stride):
        end = min(begin + model.config.max_position_embeddings,
                  seq_len)
        target_len = end - prev_end

        input_chunk = input_ids[:, begin:end]
        target_chunk = input_chunk.clone()
        # Mask tokens we've already scored
        target_chunk[:, :-target_len] = -100

        with torch.no_grad():
            outputs = model(input_chunk, labels=target_chunk)
            nll = outputs.loss.item()

        nlls.append(nll * target_len)
        prev_end = end
        if end == seq_len:
            break

    total_nll = sum(nlls)
    total_tokens = prev_end
    ppl = math.exp(total_nll / total_tokens)
    return ppl


# Usage (requires transformers + a model):
# from transformers import AutoModelForCausalLM, AutoTokenizer
# model = AutoModelForCausalLM.from_pretrained("gpt2")
# tokenizer = AutoTokenizer.from_pretrained("gpt2")
# ppl = compute_perplexity_pytorch(model, tokenizer, "Hello world")

But here's the crucial thing about perplexity: it measures prediction quality, not usefulness. A model that memorizes the training data verbatim would have perfect perplexity on similar text but might be absolutely terrible at following instructions. A model with slightly higher perplexity might be far more helpful in practice because it was trained with RLHF (episode #61) to prioritize helpfulness over raw prediction accuracy.

Perplexity also can't be compared across models with different tokenizers. Remember what we learned in episode #72? A model with a larger vocabulary produces fewer tokens per document, which changes the perplexity calculation fundamentally. Comparing perplexity between GPT-4 and Llama is meaningless unless you normalize very carefully.

So when should you use perplexity? For comparing checkpoints during training (same model, same tokenizer, different weights), for evaluating the same model on different datasets (to detect distribution shift), and for sanity-checking that training is actually improving. Don't use it for comparing different model families or for predicting user satiesfaction.

Benchmark suites: the leaderboard approach

The community has developed standardized benchmarks that test specific capabilities. These are the big ones you'll see cited in every model release blog post:

MMLU (Massive Multitask Language Understanding): 57 subjects from STEM to humanities, 14,000+ multiple-choice questions. Tests broad knowledge and reasoning. Example: "What is the capital of Burkina Faso? (A) Ouagadougou (B) Bobo-Dioulasso (C) Koudougou (D) Banfora"

HumanEval: 164 Python programming problems with unit tests. The model generates code, the tests verify correctness. The metric is pass@k -- the fraction of problems where at least one of k generated solutions passes all tests.

GSM8K: 8,500 grade-school math word problems. Tests multi-step arithmetic reasoning. "Roger has 5 tennis balls. He buys 2 more cans of 3 balls each. How many does he have now?"

HellaSwag: commonsense reasoning through sentence completion. Given a context, choose the most plausible continuation from 4 options.

TruthfulQA: tests whether the model generates truthful answers rather than popular misconceptions. "Is cracking your knuckles bad for you?" -- the truthful answer is "no evidence of harm" rather than the common belief.

def evaluate_multiple_choice(model_fn, questions):
    """Evaluate a model on multiple-choice questions.

    model_fn: callable that takes prompt, returns text
    questions: list of dicts with 'question', 'choices', 'answer'
    """
    correct = 0
    total = 0
    per_subject = {}

    for q in questions:
        # Format the prompt
        prompt = q["question"] + "\n"
        for i, choice in enumerate(q["choices"]):
            prompt += f"({chr(65 + i)}) {choice}\n"
        prompt += "Answer: ("

        response = model_fn(prompt)
        # Extract just the letter
        predicted = response.strip()[0].upper() if response.strip() else ""
        expected = q["answer"].upper()

        is_correct = predicted == expected
        correct += int(is_correct)
        total += 1

        # Track per-subject accuracy
        subject = q.get("subject", "unknown")
        if subject not in per_subject:
            per_subject[subject] = {"correct": 0, "total": 0}
        per_subject[subject]["correct"] += int(is_correct)
        per_subject[subject]["total"] += 1

    overall = correct / total if total > 0 else 0

    print(f"Overall accuracy: {correct}/{total} = {overall:.1%}")
    print(f"\nPer-subject breakdown:")
    for subj, stats in sorted(per_subject.items()):
        acc = stats["correct"] / stats["total"]
        print(f"  {subj:3}/"
              f"{stats['total']:<3} = {acc:.1%}")

    return overall, per_subject


# Demo with simulated questions
demo_questions = [
    {"question": "What is the time complexity of binary search?",
     "choices": ["O(1)", "O(log n)", "O(n)", "O(n log n)"],
     "answer": "B", "subject": "computer_science"},
    {"question": "What does DNA stand for?",
     "choices": ["Deoxyribose Nucleic Acid",
                 "Deoxyribonucleic Acid",
                 "Dinitro Nucleic Acid",
                 "Dynamic Nucleotide Array"],
     "answer": "B", "subject": "biology"},
    {"question": "Which planet has the most moons?",
     "choices": ["Jupiter", "Saturn", "Uranus", "Neptune"],
     "answer": "B", "subject": "astronomy"},
]

# Simulate a model that's 67% accurate
import random
random.seed(42)

def fake_model(prompt):
    return random.choice(["A", "B", "C", "D"])

evaluate_multiple_choice(fake_model, demo_questions)

Benchmark scores are useful for rough comparisons but have serious caveats. A model can be "taught to the test" -- trained on benchmark-similar data without genuine understanding. MMLU scores have become inflated over the past two years as training data increasingly includes similar question formats. Having said that, benchmarks are still the best starting point we have for comparing models at scale ;-)

Code evaluation: pass@k

For code generation (which is increasingly one of the most important use cases), we need a different approach. You can't just check if the output matches a reference -- you need to check if it actually works. The pass@k metric from the HumanEval paper handles this elegantly:

import math
import random


def estimate_pass_at_k(n, c, k):
    """Estimate pass@k from n samples with c correct.

    n: total samples generated
    c: number that passed all tests
    k: number of attempts allowed

    Uses the unbiased estimator from the HumanEval paper:
    pass@k = 1 - C(n-c, k) / C(n, k)
    """
    if n - c < k:
        return 1.0

    # Use log space to avoid overflow with large numbers
    log_result = 0.0
    for i in range(k):
        log_result += math.log(n - c - i) - math.log(n - i)
    return 1.0 - math.exp(log_result)


# Example: model generated 20 solutions, 8 passed
n_samples = 20
n_correct = 8

for k in [1, 5, 10, 20]:
    score = estimate_pass_at_k(n_samples, n_correct, k)
    print(f"  pass@{k:<3} = {score:.3f}")

print(f"\nInterpretation:")
print(f"  pass@1:  probability ONE random attempt works")
print(f"  pass@10: probability at least 1 of 10 attempts works")
print(f"  pass@20: if all 20 were tried, did any work?")


# Simulate full evaluation across multiple problems
def simulate_humaneval(num_problems, samples_per_problem,
                       base_solve_rate):
    """Simulate HumanEval-style evaluation."""
    results = []
    for p in range(num_problems):
        # Each problem has a slightly different difficulty
        difficulty = random.uniform(0.5, 1.5)
        solve_prob = min(1.0, base_solve_rate / difficulty)

        # Generate n samples, count how many pass
        passes = sum(1 for _ in range(samples_per_problem)
                     if random.random() < solve_prob)
        results.append({
            "problem": p,
            "n": samples_per_problem,
            "c": passes,
        })

    # Compute pass@k for different k values
    print(f"\nSimulated HumanEval ({num_problems} problems, "
          f"{samples_per_problem} samples each):")
    for k in [1, 5, 10]:
        scores = [estimate_pass_at_k(r["n"], r["c"], k)
                  for r in results]
        avg = sum(scores) / len(scores)
        print(f"  pass@{k:<3} = {avg:.1%}")

    return results


random.seed(42)
simulate_humaneval(50, 20, 0.4)

The pass@k metric is clever because it separates the model's capability from the sampling strategy. pass@1 tells you how reliably the model gets it right on the first try. pass@10 tells you if the model can solve it when given multiple attempts. A model with pass@1=30% but pass@10=85% knows how to solve the problem but isn't consistent about it -- which suggests that better sampling (higher temperature, more diverse prompts) could help quite a lot.

LLM-as-judge: models evaluating models

For open-ended generation tasks, traditional metrics like BLEU and ROUGE don't correlate well with human judgment. Those metrics compare word overlap with a reference -- but a brilliant response that phrases things completely differently would score low. A much more effective approach has emerged: use a strong model to evaluate a weaker model's outputs.

def llm_judge_absolute(judge_fn, question, response,
                       criteria=None):
    """Use an LLM to score a response on multiple criteria."""
    if criteria is None:
        criteria = [
            "Accuracy: Is the information factually correct?",
            "Completeness: Does it address all parts?",
            "Clarity: Is the explanation easy to follow?",
            "Conciseness: Appropriately brief without "
            "omitting important details?",
        ]

    criteria_text = "\n".join(f"- {c}" for c in criteria)

    prompt = (
        f"Evaluate this response on a scale of 1-5 for "
        f"each criterion.\n\n"
        f"Question: {question}\n\n"
        f"Response: {response}\n\n"
        f"Criteria:\n{criteria_text}\n\n"
        f"For each criterion provide:\n"
        f"- Score (1-5)\n"
        f"- One-sentence justification\n\n"
        f"Then provide an overall score (1-5)."
    )

    evaluation = judge_fn(prompt)
    return evaluation


def llm_judge_pairwise(judge_fn, question,
                       response_a, response_b):
    """Pairwise comparison -- more reliable than absolutes."""
    prompt = (
        f"Compare these two responses to the same question.\n\n"
        f"Question: {question}\n\n"
        f"Response A:\n{response_a}\n\n"
        f"Response B:\n{response_b}\n\n"
        f"Which response is better? Consider accuracy, "
        f"helpfulness, and clarity.\n"
        f"First explain your reasoning in 2-3 sentences, "
        f"then state your choice: A or B."
    )
    judgment = judge_fn(prompt)
    return judgment


def mitigate_position_bias(judge_fn, question,
                           response_a, response_b):
    """Run pairwise eval twice with swapped positions."""
    # First evaluation: A then B
    j1 = llm_judge_pairwise(judge_fn, question,
                            response_a, response_b)
    # Second evaluation: B then A
    j2 = llm_judge_pairwise(judge_fn, question,
                            response_b, response_a)

    # If both agree, we have a confident winner
    # If they disagree, it's a tie
    return {"eval_ab": j1, "eval_ba": j2}


# Example usage (with a simulated judge)
def mock_judge(prompt):
    return "Response A is more accurate and concise. Choice: A"

result = llm_judge_pairwise(
    mock_judge,
    "What causes rain?",
    "Water evaporates, forms clouds, condenses, and falls.",
    "Rain happens when clouds get heavy with water droplets."
)
print("Judge verdict:", result)

Pairwise comparison is almost always more reliable than absolute scoring. In stead of asking "rate this response 1-5" (where different judges calibrate their scales wildly differently), ask "which of these two responses is better?" Humans find relative judgments easier than absolute ones, and so do LLM judges.

LLM-as-judge has known biases though -- three big ones you need to watch for:

Verbosity bias: models prefer longer responses, even when shorter would be better
Self-preference bias: a model used as judge tends to prefer outputs that resemble its own style
Position bias: the response listed first gets a slight advantage

Mitigation: swap positions and run twice (as shown above), explicitly instruct the judge to prefer concise answers, and use a different model family as judge than the models being evaluated.

Human evaluation: the expensive ground truth

For high-stakes applications, there's no substitute for human evaluation. Humans catch things automated metrics miss entirely: factual errors that sound plausible, culturally inapropriate suggestions, subtle logical flaws, and whether the response actually helps with the user's real intent (which is sometimes different from what they literally asked).

class HumanEvalPipeline:
    """Structure for organizing human evaluation campaigns."""

    def __init__(self, criteria):
        self.criteria = criteria
        self.annotations = []

    def create_task(self, prompt, response, annotator_id):
        """Create a single annotation task."""
        task = {
            "prompt": prompt,
            "response": response,
            "annotator": annotator_id,
            "scores": {},
            "comments": "",
        }
        return task

    def submit_annotation(self, task, scores, comments=""):
        """Record a completed annotation."""
        task["scores"] = scores
        task["comments"] = comments
        self.annotations.append(task)

    def compute_agreement(self, annotator_a, annotator_b):
        """Compute inter-annotator agreement (Cohen's kappa)."""
        a_scores = {(a["prompt"], a["response"]): a["scores"]
                    for a in self.annotations
                    if a["annotator"] == annotator_a}
        b_scores = {(b["prompt"], b["response"]): b["scores"]
                    for b in self.annotations
                    if b["annotator"] == annotator_b}

        shared_keys = set(a_scores.keys()) & set(b_scores.keys())

        if not shared_keys:
            return {"kappa": None, "n": 0}

        agreements = 0
        total = 0
        for key in shared_keys:
            for criterion in self.criteria:
                sa = a_scores[key].get(criterion, 0)
                sb = b_scores[key].get(criterion, 0)
                # "Agreement" = within 1 point
                if abs(sa - sb) <= 1:
                    agreements += 1
                total += 1

        observed = agreements / total if total > 0 else 0
        # Simplified kappa (expected agreement for 5-point scale)
        expected = 0.36
        kappa = ((observed - expected) / (1 - expected)
                 if expected < 1 else 0)

        return {"kappa": kappa, "observed_agreement": observed,
                "n": len(shared_keys)}

    def summary(self):
        """Aggregate all annotations into a report."""
        if not self.annotations:
            return "No annotations yet."

        scores_by_criterion = {c: [] for c in self.criteria}
        for ann in self.annotations:
            for c in self.criteria:
                if c in ann["scores"]:
                    scores_by_criterion[c].append(ann["scores"][c])

        print("Human Evaluation Summary")
        print(f"Total annotations: {len(self.annotations)}")
        print(f"\n{'Criterion':6} {'Std':>6} {'N':>4}")
        print("-" * 45)

        for c in self.criteria:
            vals = scores_by_criterion[c]
            if vals:
                mean = sum(vals) / len(vals)
                variance = sum((v - mean) ** 2 for v in vals) / len(vals)
                std = variance ** 0.5
                print(f"{c:6.2f} {std:>6.2f} "
                      f"{len(vals):>4}")


# Demo
pipeline = HumanEvalPipeline(
    criteria=["accuracy", "helpfulness", "tone"]
)

# Simulate two annotators rating the same responses
import random
random.seed(42)

for i in range(20):
    prompt = f"Question {i}"
    response = f"Response to question {i}"

    for annotator in ["alice", "bob"]:
        task = pipeline.create_task(prompt, response, annotator)
        scores = {
            "accuracy": random.randint(2, 5),
            "helpfulness": random.randint(2, 5),
            "tone": random.randint(3, 5),
        }
        pipeline.submit_annotation(task, scores)

pipeline.summary()
agreement = pipeline.compute_agreement("alice", "bob")
print(f"\nInter-annotator agreement (kappa): "
      f"{agreement['kappa']:.3f}")
print(f"Observed agreement: "
      f"{agreement['observed_agreement']:.1%}")

The inter-annotator agreement metric is critical. If two human reviewers can't agree on whether a response is good, your evaluation criteria are ambiguous. A kappa below 0.4 means your criteria need reworking before you can trust any results from the evaluation.

Human evaluation is expensive and slow. Use it for: validating that automated metrics actually correlate with quality, evaluating models before high-stakes deployment, and calibrating LLM-as-judge systems. Don't use it for routine A/B testing during development or comparing every checkpoint during training -- that's what perplexity and automated benchmarks are for.

Benchmark contamination: the elephant in the room

Here's an uncomfortable truth about those impressive benchmark scores: many of them are inflated because the benchmark data leaked into the model's training set. If MMLU questions appear in Common Crawl (the massive web scrape used to train most LLMs), the model doesn't need to actually "know" physics -- it just needs to memorize the answer. And Common Crawl is BIG. Really big. It contains quite some benchmark data that was posted publicly on forums, study guides, and educational websites.

def check_memorization(model_fn, examples, n_continuation=50):
    """Detect potential benchmark contamination.

    Give the model the question + first few tokens of the
    canonical answer. If it can perfectly continue,
    it likely memorized the example.
    """
    suspicious = 0
    clean = 0

    for ex in examples:
        question = ex["question"]
        full_answer = ex["answer"]

        # Give model a running start with the first 20 chars
        prefix = full_answer[:20]
        prompt = f"{question}\n{prefix}"

        generated = model_fn(prompt)

        # Check overlap with the rest of the answer
        remainder = full_answer[20:]
        overlap = compute_ngram_overlap(generated, remainder, n=4)

        if overlap > 0.8:
            suspicious += 1
            print(f"  SUSPICIOUS: {question[:50]}...")
            print(f"    Overlap: {overlap:.1%}")
        else:
            clean += 1

    total = suspicious + clean
    print(f"\nContamination scan: {suspicious}/{total} "
          f"suspicious ({suspicious/total:.0%})")
    return suspicious, clean


def compute_ngram_overlap(text_a, text_b, n=4):
    """Compute n-gram overlap between two texts."""
    def get_ngrams(text, n):
        words = text.lower().split()
        return set(tuple(words[i:i+n])
                   for i in range(len(words) - n + 1))

    if not text_a.strip() or not text_b.strip():
        return 0.0

    ngrams_a = get_ngrams(text_a, n)
    ngrams_b = get_ngrams(text_b, n)

    if not ngrams_a or not ngrams_b:
        return 0.0

    overlap = ngrams_a & ngrams_b
    return len(overlap) / min(len(ngrams_a), len(ngrams_b))


# Simulated contamination check
def mock_model_memorized(prompt):
    return "the answer is clearly B because Ouagadougou is the capital"

def mock_model_clean(prompt):
    return "I think the answer involves the largest city"

print("Testing potentially contaminated model:")
examples = [
    {"question": "Capital of Burkina Faso?",
     "answer": "the answer is clearly B because Ouagadougou "
               "is the capital city of Burkina Faso since 1960"},
]
check_memorization(mock_model_memorized, examples)

Contamination is hard to detect definitively. Models can memorize paraphrases, not just exact text. Some labs publish contamination analyses alongside their models (good practice); many don't.

The solution is dynamic evaluation: create new benchmarks regularly, use held-out test sets that are never published on the internet, and test on real user tasks rather than static benchmarks. The Chatbot Arena leaderboard addresses this brilliantly by using real user conversations and live head-to-head comparisons -- it's much harder to game than static benchmarks because the test set is constantly changing.

Building custom evaluations for your application

For production applications, public benchmarks are a starting point but not the destination. Your users have specific needs that no general benchmark captures. A model that scores 90% on MMLU might be useless for your particular customer support chatbot if it hallucinates product features that don't exist.

class ProductionEvalSuite:
    """Custom evaluation framework for YOUR specific use case."""

    def __init__(self):
        self.test_cases = []
        self.results = []

    def add_case(self, name, prompt, checks, category="general"):
        """Add a test case with automated checks."""
        self.test_cases.append({
            "name": name,
            "prompt": prompt,
            "checks": checks,
            "category": category,
        })

    def evaluate(self, model_fn):
        """Run all test cases against a model."""
        self.results = []
        for case in self.test_cases:
            response = model_fn(case["prompt"])
            check_results = {}
            for check_name, check_fn in case["checks"].items():
                try:
                    check_results[check_name] = check_fn(response)
                except Exception as e:
                    check_results[check_name] = False

            passed = all(check_results.values())
            self.results.append({
                "name": case["name"],
                "category": case["category"],
                "passed": passed,
                "checks": check_results,
                "response_preview": response[:150],
            })

        return self.summary()

    def summary(self):
        """Print evaluation report grouped by category."""
        by_cat = {}
        for r in self.results:
            cat = r["category"]
            if cat not in by_cat:
                by_cat[cat] = {"passed": 0, "total": 0, "fails": []}
            by_cat[cat]["total"] += 1
            if r["passed"]:
                by_cat[cat]["passed"] += 1
            else:
                by_cat[cat]["fails"].append(r["name"])

        total_passed = sum(c["passed"] for c in by_cat.values())
        total_tests = sum(c["total"] for c in by_cat.values())

        print(f"Evaluation Report")
        print(f"{'='*50}")
        print(f"Overall: {total_passed}/{total_tests} passed "
              f"({total_passed/total_tests:.0%})\n")

        for cat, stats in sorted(by_cat.items()):
            rate = stats["passed"] / stats["total"]
            print(f"  {cat}: {stats['passed']}/{stats['total']} "
                  f"({rate:.0%})")
            for fail in stats["fails"]:
                print(f"    FAIL: {fail}")

        return total_passed / total_tests


# Build a suite for a customer support chatbot
suite = ProductionEvalSuite()

suite.add_case(
    "password_reset",
    "How do I reset my password?",
    {
        "mentions_email": lambda r: "email" in r.lower(),
        "has_steps": lambda r: any(c.isdigit() for c in r),
        "reasonable_length": lambda r: 30 < len(r.split()) < 300,
        "no_hallucinated_urls": lambda r: (
            "http" not in r or "example.com" in r
        ),
    },
    category="account_management",
)

suite.add_case(
    "refund_request",
    "I want to cancel my subscription and get a refund.",
    {
        "acknowledges": lambda r: "cancel" in r.lower(),
        "mentions_policy": lambda r: (
            "policy" in r.lower() or "refund" in r.lower()
        ),
        "empathetic": lambda r: any(
            w in r.lower()
            for w in ["sorry", "understand", "appreciate", "help"]
        ),
    },
    category="billing",
)

suite.add_case(
    "out_of_scope",
    "What is the meaning of life?",
    {
        "stays_on_topic": lambda r: any(
            w in r.lower()
            for w in ["support", "help", "assist", "question"]
        ),
        "not_philosophical": lambda r: "42" not in r,
    },
    category="guardrails",
)


# Test with a mock model
def mock_support_bot(prompt):
    if "password" in prompt.lower():
        return ("To reset your password: 1) Go to the login page "
                "2) Click 'Forgot Password' 3) Enter your email "
                "address 4) Check your inbox for the reset link.")
    elif "cancel" in prompt.lower():
        return ("I understand you'd like to cancel. Per our refund "
                "policy, we can process a full refund within 30 days. "
                "Let me help you with that.")
    else:
        return ("I'm here to help with support questions! Could "
                "you tell me more about what you need assistance with?")


suite.evaluate(mock_support_bot)

The best evaluation suites combine three layers:

Exact checks: does the output contain required information? Is it valid JSON? Does the code compile?
LLM-as-judge: is the response helpful? Is the tone appropriate? Does it contradict previous statements?
Human review: sample 5-10% of responses for manual inspection. Focus human attention on cases where automated checks are uncertain or disagree.

Start with 50-100 test cases that represent your actual use cases. Expand as you discover failure modes. Every bug report becomes a new test case. Over time, your custom eval suite becomes the most reliable predictor of model quality for your specific application -- way more useful than knowing the model scored 87% on MMLU.

Samengevat

Perplexity measures prediction quality but not usefulness -- use it for comparing training runs and detecting distribution shift, not for comparing model families with different tokenizers;
standard benchmarks (MMLU, HumanEval, GSM8K) provide useful rough comparisons but suffer from contamination and "teaching to the test" -- treat leaderboard scores as a starting point, not gospel;
pass@k separates a model's capability from its consistency -- pass@1 for reliability, pass@10 for "can it solve this at all?";
LLM-as-judge scales well for open-ended evaluation -- pairwise comparison is more reliable than absolute scoring, but you need to actively mitigate verbosity bias and position bias;
human evaluation remains the ground truth for high-stakes decisions -- expensive but irreplaceable for catching subtle failures that automated metrics miss;
benchmark contamination inflates scores; prefer dynamic evaluations and real user tasks over static benchmarks;
custom evaluation suites tailored to your actual use case are the most valuable long-term investment in model quality -- combine automated checks, LLM-as-judge, and sampled human review.

Exercises

Exercise 1: Build a benchmark suite runner. Create a class BenchmarkRunner that manages a collection of evaluation tasks. It should: (a) support both multiple-choice tasks and open-ended generation tasks (use a task_type field), (b) for multiple-choice: format the prompt, parse the model's answer letter, compute accuracy per subject and overall, (c) for generation: run a list of check functions against each response and compute pass rate, (d) produce a formatted report showing accuracy per task type, per subject/category, and overall. Pre-populate it with 15 multiple-choice questions across 3 subjects (math, science, history -- 5 each) and 5 generation tasks with check functions. Use a simulated model function that returns predetermined answers (so you can verify the scoring logic). Print the full report.

Exercise 2: Build a pairwise tournament evaluator. Create a class TournamentEvaluator that: (a) takes a list of "model" functions and a set of test prompts, (b) generates responses from every model for every prompt, (c) runs pairwise comparisons using a simulated judge function (you define the judging logic -- e.g., prefer longer responses that contain keywords from the question), (d) computes an Elo rating for each model (start at 1000, K-factor 32, use standard Elo formula), (e) runs position-bias mitigation by evaluating each pair twice with swapped order. Print the final Elo rankings and a win/loss/draw matrix. Use 4 simulated models with different "quality levels" (one always verbose, one always concise, one accurate but short, one long and inaccurate) and 10 test prompts.

Exercise 3: Build a contamination detector. Create a function detect_contamination(model_fn, test_examples) that: (a) takes a model function and a list of benchmark examples (each with question + canonical answer), (b) for each example, gives the model the question + first 30 characters of the answer as a "running start", (c) measures 4-gram overlap between the model's continuation and the remaining canonical answer, (d) classifies each example as "likely memorized" (>80% overlap), "possibly memorized" (50-80%), or "clean" (<50%), (e) prints a contamination report with counts per category and examples of the most suspicious cases. Test with two simulated models: one that "memorizes" (returns text very close to the canonical answer) and one that generates original responses. Create at least 10 test examples.

De groeten! Thanks for reading.

Hive account@scipio

Learn AI Series (#73) - LLM Evaluation

Learn AI Series (#73) - LLM Evaluation

What will I learn

Requirements

Difficulty

Curriculum (of the Learn AI Series):

Learn AI Series (#73) - LLM Evaluation

Solutions to Episode #72 Exercises

On to today's episode

Perplexity: the classic language model metric

Benchmark suites: the leaderboard approach

Code evaluation: pass@k

LLM-as-judge: models evaluating models

Human evaluation: the expensive ground truth

Benchmark contamination: the elephant in the room

Building custom evaluations for your application

Samengevat

Exercises

De groeten! Thanks for reading.

Curriculum (of the `Learn AI Series`):