Learn AI Series (#72) - Tokenization Deep Dive
What will I learn
- You will learn why tokenization is one of the most underappreciated decisions in language model design;
- character-level, word-level, and subword tokenization -- and why subword won;
- Byte Pair Encoding (BPE): the algorithm step by step, implemented from scratch;
- WordPiece and Unigram models: alternative subword approaches;
- SentencePiece: language-agnostic tokenization from raw text;
- tiktoken and real tokenizer behavior with actual models;
- vocabulary size tradeoffs and why 32K-128K is the sweet spot;
- tokenization artifacts and how they explain many apparent model failures.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution;
- The ambition to learn AI and machine learning.
Difficulty
- Beginner
Curriculum (of the Learn AI Series):
- Learn AI Series (#1) - What Machine Learning Actually Is
- Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
- Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World
- Learn AI Series (#4) - Your First Prediction - No Math, Just Intuition
- Learn AI Series (#5) - Patterns in Data - What "Learning" Actually Looks Like
- Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
- Learn AI Series (#7) - The Training Loop - See It Work Step by Step
- Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra
- Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability
- Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch
- Learn AI Series (#11) - Making Linear Regression Real
- Learn AI Series (#12) - Classification - Logistic Regression From Scratch
- Learn AI Series (#13) - Evaluation - How to Know If Your Model Actually Works
- Learn AI Series (#14) - Data Preparation - The 80% Nobody Talks About
- Learn AI Series (#15) - Feature Engineering and Selection
- Learn AI Series (#16) - Scikit-Learn - The Standard Library of ML
- Learn AI Series (#17) - Decision Trees - How Machines Make Decisions
- Learn AI Series (#18) - Random Forests - Wisdom of Crowds
- Learn AI Series (#19) - Gradient Boosting - The Kaggle Champion
- Learn AI Series (#20) - Support Vector Machines - Drawing the Perfect Boundary
- Learn AI Series (#21) - Mini Project - Predicting Crypto Market Regimes
- Learn AI Series (#22) - K-Means Clustering - Finding Groups
- Learn AI Series (#23) - Advanced Clustering - Beyond K-Means
- Learn AI Series (#24) - Dimensionality Reduction - PCA
- Learn AI Series (#25) - Advanced Dimensionality Reduction - t-SNE and UMAP
- Learn AI Series (#26) - Anomaly Detection - Finding What Doesn't Belong
- Learn AI Series (#27) - Recommendation Systems - "Users Like You Also Liked..."
- Learn AI Series (#28) - Time Series Fundamentals - When Order Matters
- Learn AI Series (#29) - Time Series Forecasting - Predicting What Comes Next
- Learn AI Series (#30) - Natural Language Processing - Text as Data
- Learn AI Series (#31) - Word Embeddings - Meaning in Numbers
- Learn AI Series (#32) - Bayesian Methods - Thinking in Probabilities
- Learn AI Series (#33) - Ensemble Methods Deep Dive - Stacking and Blending
- Learn AI Series (#34) - ML Engineering - From Notebook to Production
- Learn AI Series (#35) - Data Ethics and Bias in ML
- Learn AI Series (#36) - Mini Project - Complete ML Pipeline
- Learn AI Series (#37) - The Perceptron - Where It All Started
- Learn AI Series (#38) - Neural Networks From Scratch - Forward Pass
- Learn AI Series (#39) - Neural Networks From Scratch - Backpropagation
- Learn AI Series (#40) - Training Neural Networks - Practical Challenges
- Learn AI Series (#41) - Optimization Algorithms - SGD, Momentum, Adam
- Learn AI Series (#42) - PyTorch Fundamentals - Tensors and Autograd
- Learn AI Series (#43) - PyTorch Data and Training
- Learn AI Series (#44) - PyTorch nn.Module - Building Real Networks
- Learn AI Series (#45) - Convolutional Neural Networks - Theory
- Learn AI Series (#46) - CNNs in Practice - Classic to Modern Architectures
- Learn AI Series (#47) - CNN Applications - Detection, Segmentation, Style Transfer
- Learn AI Series (#48) - Recurrent Neural Networks - Sequences
- Learn AI Series (#49) - LSTM and GRU - Solving the Memory Problem
- Learn AI Series (#50) - Sequence-to-Sequence Models
- Learn AI Series (#51) - Attention Mechanisms
- Learn AI Series (#52) - The Transformer Architecture (Part 1)
- Learn AI Series (#53) - The Transformer Architecture (Part 2)
- Learn AI Series (#54) - Vision Transformers
- Learn AI Series (#55) - Generative Adversarial Networks
- Learn AI Series (#56) - Mini Project - Building a Transformer From Scratch
- Learn AI Series (#57) - Language Modeling - Predicting the Next Word
- Learn AI Series (#58) - GPT Architecture - Decoder-Only Transformers
- Learn AI Series (#59) - BERT and Encoder Models
- Learn AI Series (#60) - Training Large Language Models
- Learn AI Series (#61) - Instruction Tuning and Alignment
- Learn AI Series (#62) - Prompt Engineering - Getting the Most from LLMs
- Learn AI Series (#63) - Embeddings and Vector Search
- Learn AI Series (#64) - Retrieval-Augmented Generation (RAG) - Basics
- Learn AI Series (#65) - RAG - Advanced Techniques
- Learn AI Series (#66) - Working with LLM APIs
- Learn AI Series (#67) - Building AI Agents (Part 1) - Foundations
- Learn AI Series (#68) - Building AI Agents (Part 2) - Advanced Patterns
- Learn AI Series (#69) - Fine-Tuning Language Models
- Learn AI Series (#70) - Running Local Models
- Learn AI Series (#71) - Text Generation Techniques
- Learn AI Series (#72) - Tokenization Deep Dive (this post)
Learn AI Series (#72) - Tokenization Deep Dive
Solutions to Episode #71 Exercises
Exercise 1: Temperature visualizer.
import torch
import torch.nn.functional as F
import math
def visualize_temperature(logits, temperatures):
"""Show how temperature changes the output distribution."""
print(f"{'Temp':>6} {'Entropy':>9} {'Top-1 Prob':>11} "
f"{'Eff Vocab':>10}")
print("-" * 40)
results = {}
for t in temperatures:
probs = F.softmax(logits / t, dim=-1)
# Shannon entropy
log_p = torch.log2(probs + 1e-10)
entropy = -(probs * log_p).sum().item()
# Top-1 probability
top1 = probs.max().item()
# Effective vocab: tokens with prob > 1% of total
eff_vocab = (probs > 0.01).sum().item()
results[t] = {
"entropy": entropy,
"top1": top1,
"eff_vocab": eff_vocab,
}
print(f"{t:>6.1f} {entropy:>9.3f} {top1:>10.4f} "
f"{eff_vocab:>10}")
return results
def recommend_temperature(task_type):
"""Return recommended temperature range by task."""
ranges = {
"factual": (0.1, 0.4),
"balanced": (0.5, 0.8),
"creative": (0.9, 1.5),
}
if task_type not in ranges:
return None
low, high = ranges[task_type]
print(f" {task_type}: temperature {low} - {high}")
return (low, high)
# Test it
logits = torch.randn(20) * 2.0
temps = [0.1, 0.3, 0.5, 0.7, 1.0, 1.5, 2.0]
visualize_temperature(logits, temps)
print()
for task in ["factual", "balanced", "creative"]:
recommend_temperature(task)
The entropy column is the one to watch here. At temperature 0.1, entropy is near zero -- the distribution is essentially a one-hot vector, so there's almost no randomness. At temperature 2.0, entropy approaches its maximum value, meaning the model is practically sampling uniformly. The "effective vocabulary" metric gives you a quick intuitive sense for how many tokens the model is genuinely considering at each temperature setting.
Exercise 2: Repetition detector and auto-penalizer.
from collections import Counter
class RepetitionGuard:
"""Monitor generated text for repetitive patterns."""
def __init__(self, window=50):
self.window = window
def check(self, generated_ids):
recent = generated_ids[-self.window:]
if len(recent) < 4:
return {"token_repeats": [], "bigram_repeats": [],
"longest_repeat": 0, "severity": 0.0}
# Token frequencies
counts = Counter(recent)
total = len(recent)
token_repeats = [
(tok, c) for tok, c in counts.most_common(5)
if c > total * 0.15
]
# Bigram frequencies
bigrams = [(recent[i], recent[i + 1])
for i in range(len(recent) - 1)]
bi_counts = Counter(bigrams)
bigram_repeats = [
(bg, c) for bg, c in bi_counts.most_common(5)
if c > total * 0.08
]
# Longest repeated consecutive substring
longest = self._longest_repeat(recent)
# Severity: combine signals
tok_score = min(len(token_repeats) * 0.15, 0.4)
bi_score = min(len(bigram_repeats) * 0.2, 0.4)
len_score = min(longest / 10.0, 0.4)
severity = min(tok_score + bi_score + len_score, 1.0)
return {
"token_repeats": token_repeats,
"bigram_repeats": bigram_repeats,
"longest_repeat": longest,
"severity": severity,
}
def _longest_repeat(self, seq):
"""Find longest substring that appears more than once."""
n = len(seq)
best = 0
for length in range(2, n // 2 + 1):
for start in range(n - length):
pattern = tuple(seq[start:start + length])
for j in range(start + length, n - length + 1):
if tuple(seq[j:j + length]) == pattern:
best = max(best, length)
break
return best
def adaptive_penalty(self, severity):
"""Map severity [0,1] to repetition penalty [1.0,1.5]."""
return 1.0 + severity * 0.5
# Test with intentional repetition
import random
random.seed(42)
tokens = [random.randint(0, 100) for _ in range(50)]
# Insert repeating pattern at positions 50-70
pattern = [10, 20, 30]
for i in range(20):
tokens.append(pattern[i % 3])
# Add more random tokens
tokens.extend([random.randint(0, 100) for _ in range(30)])
guard = RepetitionGuard(window=50)
result = guard.check(tokens)
print(f"Token repeats: {result['token_repeats']}")
print(f"Bigram repeats: {result['bigram_repeats']}")
print(f"Longest repeat: {result['longest_repeat']}")
print(f"Severity: {result['severity']:.2f}")
print(f"Recommended penalty: {guard.adaptive_penalty(result['severity']):.2f}")
The repeating [10, 20, 30] pattern gets flagged immediately by both the bigram detector (the pair (30, 10) appears way too often) and the longest-repeat finder. A severity of ~0.6-0.8 maps to a penalty around 1.3-1.4, which is aggressive enough to break the loop without making the output incoherent. Sensible defaults.
Exercise 3: Decoding strategy comparator.
import torch
import torch.nn.functional as F
import math
from collections import Counter
class DecodingComparator:
"""Compare sampling strategies on a fixed distribution."""
def __init__(self, probs, n_trials=1000):
self.probs = probs
self.n_trials = n_trials
def greedy(self):
token = torch.argmax(self.probs).item()
return [token] * self.n_trials
def pure_sample(self):
return torch.multinomial(
self.probs.unsqueeze(0).expand(self.n_trials, -1),
num_samples=1).squeeze(-1).tolist()
def top_k_sample(self, k):
top_vals, top_idx = torch.topk(self.probs, k)
renorm = top_vals / top_vals.sum()
samples = torch.multinomial(
renorm.unsqueeze(0).expand(self.n_trials, -1),
num_samples=1).squeeze(-1)
return [top_idx[s].item() for s in samples.tolist()]
def top_p_sample(self, p):
sorted_p, sorted_i = torch.sort(
self.probs, descending=True)
cumsum = torch.cumsum(sorted_p, dim=-1)
mask = cumsum - sorted_p > p
sorted_p[mask] = 0.0
sorted_p = sorted_p / sorted_p.sum()
samples = torch.multinomial(
sorted_p.unsqueeze(0).expand(self.n_trials, -1),
num_samples=1).squeeze(-1)
return [sorted_i[s].item() for s in samples.tolist()]
def analyze(self, samples, name):
counts = Counter(samples)
n_unique = len(counts)
# Empirical distribution
emp = torch.zeros_like(self.probs)
for tok, c in counts.items():
emp[tok] = c / self.n_trials
# Entropy of empirical distribution
log_e = torch.log2(emp + 1e-10)
entropy = -(emp * log_e).sum().item()
# Top-1 from original selected at least once?
top1 = torch.argmax(self.probs).item()
top1_hit = top1 in counts
# KL divergence: D_KL(original || empirical)
kl = F.kl_div(
torch.log(emp + 1e-10),
self.probs,
reduction="sum"
).item()
return {
"name": name, "unique": n_unique,
"entropy": entropy, "top1_hit": top1_hit,
"kl_div": kl,
}
def compare_all(self):
strategies = [
(self.greedy(), "greedy"),
(self.pure_sample(), "pure"),
(self.top_k_sample(10), "top-k=10"),
(self.top_k_sample(20), "top-k=20"),
(self.top_k_sample(50), "top-k=50"),
(self.top_p_sample(0.5), "top-p=0.5"),
(self.top_p_sample(0.9), "top-p=0.9"),
(self.top_p_sample(0.95), "top-p=0.95"),
]
results = [self.analyze(s, n) for s, n in strategies]
print(f"{'Strategy':7} {'Entropy':>8} "
f"{'Top-1?':>7} {'KL Div':>8}")
print("-" * 46)
for r in results:
hit = "yes" if r["top1_hit"] else "no"
print(f"{r['name']:7} "
f"{r['entropy']:>8.3f} {hit:>7} "
f"{r['kl_div']:>8.4f}")
return results
# Zipf distribution over 1000 tokens
vocab = 1000
ranks = torch.arange(1, vocab + 1, dtype=torch.float32)
probs = (1.0 / ranks)
probs = probs / probs.sum()
comp = DecodingComparator(probs, n_trials=1000)
comp.compare_all()
The KL divergence column tells the real story. Greedy has infinite KL divergence from the original (it completely ignores 999 tokens). Pure sampling matches the original distribution almost perfectly (near-zero KL). Top-p=0.9 sits in a nice middle ground -- it cuts the tail off but keeps the core distribution intact. That's exactly why top-p is the default for most production systems: it preserves the model's beliefs where they matter while trimming the noise.
On to today's episode
Here we go! We touched on tokenization briefly back in episode #57 when we discussed language models. But tokenization deserves its own deep dive because it's one of those deceptively simple-looking steps that has profound effects on everything downstream. How you chop text into tokens determines what the model can learn, how efficiently it processes language, and what kinds of mistakes it makes.
Here's a question that trips up most people: why can modern language models solve complex calculus but struggle to count the number of "r"s in "strawberry"? The answer is tokenization. The model doesn't see individual characters -- it sees tokens, and "strawberry" gets split into pieces that hide the individual letters. We'll get to exactly how and why by the end of this episode.
If you remember from episode #30 (NLP fundamentals), we already dealt with text preprocessing -- cleaning, lowercasing, removing stopwords. Tokenization sits one layer below all of that. It's the very first transformation that raw text goes through before the model ever sees it. And unlike most preprocessing choices (which you can change during training), the tokenizer is baked into the model. Change the tokenizer and you have a completely different model. No exceptions ;-)
The three levels of tokenization
There are three fundamental approaches to splitting text into pieces, and understanding the tradeoffs between them is critical.
Character-level tokenization splits text into individual characters. "Hello" becomes ["H", "e", "l", "l", "o"]. The vocabulary is tiny -- a few hundred characters covers basically every language. Any text whatsoever can be represented. Sounds great, right? The problem is that sequences become extremely long. A 1000-word document turns into roughly 5000 characters. The model has to learn everything from scratch: spelling, word boundaries, grammar, idioms, all from individual characters. It works (and some research models do use character-level tokenization), but training is slow and the model needs to spend a lot of its capacity just learning to spell.
Word-level tokenization goes the opposite direction: split on whitespace and punctuation. "Hello world" becomes ["Hello", "world"]. Sequences are short and each token is already a meaningful unit. But the vocabulary explodes. English alone has hundreds of thousands of words, and you still can't handle misspellings, new words, made-up compound words, or source code. Any word not in the vocabulary becomes an <UNK> (unknown) token -- a dead end where all information about that word is lost.
Subword tokenization is the compromise that won, and it won decisively. Split common words into single tokens ("the", "hello") while decomposing rare words into meaningful pieces ("un" + "expect" + "edly"). Vocabulary size is bounded (typically 32K-100K tokens), sequences stay reasonably short, and any text can be represented because unknown words naturally decompose into known subwords.
# See it in action with a real tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
text = "Tokenization is surprisingly important for LLMs"
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)
print(f"Text: {text}")
print(f"Tokens: {tokens}")
print(f"IDs: {ids}")
print(f"Number of tokens: {len(ids)}")
# Tokens: ['Token', 'ization', ' is', ' surprisingly',
# ' important', ' for', ' LL', 'Ms']
# Common words -> single tokens
# "Tokenization" -> 2 pieces, "LLMs" -> 2 pieces
Every modern language model -- GPT-2, GPT-4, Llama, Mistral, all of them -- uses some form of subword tokenization. The three main algorithms are BPE (Byte Pair Encoding), WordPiece, and Unigram. Let's build each one from scratch.
Byte Pair Encoding: the algorithm that powers GPT
BPE (Sennrich et al., 2015) is the most widely used subword tokenization algorithm. GPT-2, GPT-3, GPT-4, Llama, and most modern LLMs use BPE or a variant of it. The algorithm itself is genuinely elegant.
Training phase (building the vocabulary):
- Start with a base vocabulary of all individual characters (or bytes)
- Count all adjacent pairs of tokens in the training corpus
- Merge the most frequent pair into a single new token
- Repeat from step 2 until the vocabulary reaches the desired size
That's it. Four steps. Let's implement it:
def get_vocab(words):
"""Extract the current vocabulary from tokenized words."""
vocab = set()
for word in words:
for token in word:
vocab.add(token)
return vocab
def train_bpe(corpus, vocab_size):
"""Train BPE from scratch on a corpus."""
# Start with character-level tokens + end-of-word marker
words = {}
for word in corpus.split():
chars = list(word) + [""]
words[tuple(chars)] = words.get(tuple(chars), 0) + 1
merges = []
while len(get_vocab(words)) < vocab_size:
# Count all adjacent pairs across the corpus
pairs = {}
for word, freq in words.items():
for i in range(len(word) - 1):
pair = (word[i], word[i + 1])
pairs[pair] = pairs.get(pair, 0) + freq
if not pairs:
break
# Find the most frequent pair
best_pair = max(pairs, key=pairs.get)
merges.append(best_pair)
# Apply the merge everywhere
new_words = {}
for word, freq in words.items():
new_word = merge_pair(word, best_pair)
new_words[new_word] = freq
words = new_words
return merges
def merge_pair(word, pair):
"""Merge all occurrences of pair in word."""
new_word = []
i = 0
while i < len(word):
if (i < len(word) - 1
and (word[i], word[i + 1]) == pair):
new_word.append(word[i] + word[i + 1])
i += 2
else:
new_word.append(word[i])
i += 1
return tuple(new_word)
The beauty of BPE is that it naturally discovers meaningful subwords from raw frequency statistics. After training on English text, the merge list will look something like: individual characters -> common two-letter combos ("th", "in", "er") -> common words ("the", "and", "is") -> longer words ("because", "important"). Rare words naturally decompose into common pieces.
Encoding phase (tokenizing new text): apply the learned merges in order. Start with characters, apply merge 1, then merge 2, and so on until no more merges apply:
def encode_bpe(text, merges):
"""Tokenize text using trained BPE merges."""
tokens = list(text)
for merge_pair_item in merges:
i = 0
while i < len(tokens) - 1:
if (tokens[i], tokens[i + 1]) == merge_pair_item:
tokens[i] = tokens[i] + tokens[i + 1]
del tokens[i + 1]
else:
i += 1
return tokens
# Quick demo
corpus = ("the cat sat on the mat the cat ate the rat "
"the dog sat on the log")
merges = train_bpe(corpus, vocab_size=30)
print("Learned merges:")
for i, m in enumerate(merges[:10]):
print(f" {i+1}: {m[0]!r} + {m[1]!r}")
encoded = encode_bpe("the cat", merges)
print(f"\n'the cat' -> {encoded}")
Notice the merge order matters. If merge #3 is ("t", "h") and merge #7 is ("th", "e"), then "the" will first become ["th", "e"] after merge #3, then ["the"] after merge #7. The order of merges IS the tokenizer -- it's all you need to reproduce the exact same tokenization on any new text.
WordPiece: how BERT tokenizes
WordPiece (used by BERT and its derivatives) is similar to BPE but uses a different criterion for deciding which pair to merge. In stead of picking the most frequent pair, WordPiece merges the pair that maximizes the likelihood of the training data.
The practical difference: BPE merges ("t", "h") because the pair appears 50,000 times. WordPiece merges ("t", "h") because the combined token "th" is much more likely than "t" and "h" appearing independently. This subtle distinction means WordPiece tends to create slightly more linguistically meaningful subwords.
# WordPiece uses "##" prefix for continuation tokens
from transformers import BertTokenizer
bert_tok = BertTokenizer.from_pretrained("bert-base-uncased")
tokens = bert_tok.tokenize("unbelievable preprocessing")
print(tokens)
# ['un', '##bel', '##ie', '##va', '##ble',
# 'pre', '##pro', '##ces', '##sing']
# The ## prefix means: "this is a continuation of the
# previous token, NOT a new word"
tokens2 = bert_tok.tokenize("I love transformers")
print(tokens2)
# ['i', 'love', 'transformers']
# Common words stay as single tokens
That ## prefix is WordPiece's way of marking word-internal boundaries. When you see ['un', '##bel', '##ie', '##va', '##ble'], you know it's one word ("unbelievable") split into five subword pieces. The first piece has no prefix (it's a word start), and the rest have ## (they're continuations). This explicit boundary information helps the model understand word structure -- which is why BERT is so good at tasks that require word-level understanding like named entity recognition.
Unigram: the probabilistic approach
Unigram (Kudo, 2018) works in the opposite direction from BPE. In stead of starting small and building up, Unigram starts with a very large vocabulary and prunes it down:
- Start with a huge initial vocabulary (all substrings up to some length)
- For each token, compute how much the overall corpus likelihood drops if you remove it
- Remove the tokens that cause the smallest drop (they're the most redundant)
- Repeat until the vocabulary reaches the target size
During encoding, Unigram finds the most probable segmentation of the input. A single word might have dozens of valid segmentations, and Unigram picks the one that maximizes the product of individual token probabilities:
# Unigram can produce different segmentations
# "preprocessing" might become:
# ["pre", "process", "ing"] <- most probable
# ["prep", "rocess", "ing"] <- less probable
# ["p", "r", "e", "p", ...] <- character fallback
# The advantage: Unigram naturally handles ambiguity
# and can even SAMPLE different segmentations, which
# works as a form of data augmentation during training
The cool thing about Unigram is that by sampling different segmentations of the same word during training, you effectively augment your data for free. The model sees "preprocessing" as ["pre", "process", "ing"] in one batch and ["pre", "proc", "essing"] in another. This makes the model more robust to unusual tokenizations at inference time. SentencePiece uses Unigram as one of its two backends.
SentencePiece: language-agnostic tokenization
BPE and WordPiece both assume whitespace separates words. That works for English and most European languages, but it falls apart for Chinese, Japanese, Thai, or any language where words aren't space-delimited. SentencePiece (Kudo and Richardson, 2018) solves this by treating the input as raw bytes -- including spaces.
import sentencepiece as spm
# Train a SentencePiece model from a text file
spm.SentencePieceTrainer.train(
input="training_corpus.txt",
model_prefix="my_tokenizer",
vocab_size=32000,
model_type="bpe", # or "unigram"
character_coverage=0.9995,
byte_fallback=True # handle ANY Unicode char
)
# Load and use the trained model
sp = spm.SentencePieceProcessor(
model_file="my_tokenizer.model")
text = "This handles any language"
pieces = sp.encode(text, out_type=str)
print(pieces)
# Something like: ['_This', '_handles', '_any', '_language']
# The _ character represents a space
Notice that _ (the Unicode lower one eighth block character) represents a space in SentencePiece output. By treating spaces as explicit characters rather than word delimiters, SentencePiece can tokenize any language uniformly. Llama and most multilingual models use SentencePiece.
Byte fallback is a critical feature: if a character isn't in the vocabulary (rare emoji, unusual script, a Unicode codepoint the tokenizer has never seen), SentencePiece falls back to encoding the raw UTF-8 bytes. This means it can represent absolutely any valid text without ever producing an unknown token. No more <UNK> dead ends.
Tiktoken: OpenAI's fast BPE
OpenAI's tiktoken is a fast BPE implementation used by GPT-3.5, GPT-4 and related models. It's worth knowing because when you're counting tokens for API calls (and paying per token), you need to know exactly how your input will be tokenized:
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
text = "Tokenization affects everything downstream"
tokens = enc.encode(text)
decoded = [enc.decode([t]) for t in tokens]
print(f"Token count: {len(tokens)}")
print(f"Tokens: {decoded}")
print(f"Token IDs: {tokens}")
# Interesting comparison: code is expensive in tokens
code = ("def fibonacci(n): return n if n <= 1 "
"else fibonacci(n-1) + fibonacci(n-2)")
print(f"\nCode tokens: {len(enc.encode(code))}")
# Non-English text uses MORE tokens per character
dutch = "Tokenisatie is verrassend belangrijk"
english = "Tokenization is surprisingly important"
print(f"Dutch: {len(enc.encode(dutch))} tokens")
print(f"English: {len(enc.encode(english))} tokens")
That Dutch vs English comparison is a real eye-opener. Same semantic content, but Dutch (and most non-English languages) costs more tokens because the tokenizer was trained primarily on English text. English words appear more frequently in the training data, so they get merged into fewer, larger tokens. Dutch words are rarer, so they stay split into more pieces. This has real consequences: the model's effective context window is smaller for non-English text, and API costs are higher per word.
As someone who reads and writes Dutch daily, I find this both annoying and fascinatiing from a technical perspective. The tokenizer is making an implicit economic judgment: English gets premium treatment because it dominated the training corpus ;-)
Vocabulary size: the fundamental tradeoff
Why do most models pick 32K or 100K tokens? Why not 500K or 5K? Vocabulary size is a fundamental engineering tradeoff:
Larger vocabulary (100K+): common words and phrases become single tokens, so sequences are shorter. The model processes text faster and can fit more content in its context window. But the embedding matrix grows (vocab_size x hidden_dim parameters), rare tokens get very few training examples each, and the output softmax layer becomes more expensive to compute. Remember from episode #58 (GPT architecture): the embedding matrix and the output projection share parameters in most models. Making the vocabulary bigger makes both of those bigger.
Smaller vocabulary (8K-16K): sequences are longer because each word takes more tokens. The model has to do more work to process the same text. But every single token appears frequently in the training data (better learned embeddings), the embedding matrix is compact, and the softmax is cheaper.
The sweet spot for most LLMs today is 32K-128K tokens. GPT-4 uses roughly 100K. Llama 3 uses 128K. Smaller models (like GPT-2) used 50K. These numbers all reflect a balance between sequence efficiency and model overhead.
# Quick calculation: how vocabulary affects model size
hidden = 4096 # typical hidden dimension
for vocab in [8000, 32000, 50257, 100000, 128000]:
embed_params = vocab * hidden
embed_mb = embed_params * 2 / 1e6 # fp16
print(f"vocab={vocab:>7,}: embedding = {embed_params:>12,} "
f"params ({embed_mb:>7.1f} MB)")
# vocab= 8,000: embedding = 32,768,000 params ( 65.5 MB)
# vocab= 32,000: embedding = 131,072,000 params ( 262.1 MB)
# vocab= 50,257: embedding = 205,852,672 params ( 411.7 MB)
# vocab=100,000: embedding = 409,600,000 params ( 819.2 MB)
# vocab=128,000: embedding = 524,288,000 params (1048.6 MB)
Going from 32K to 128K vocabulary adds nearly 800MB just for the embedding matrix alone. That's not nothing, especially for models you want to run on consumer hardware (as we discussed in episode #70). Having said that, the benefit of shorter sequences usually outweighs the parameter cost -- which is why the trend has been toward larger vocabularies in recent models.
Tokenization artifacts: when the tokenizer fights you
This is the section that really matters for practical work. Tokenization creates subtle behaviors that look like model failures but are really tokenizer artifacts:
The strawberry problem: "How many r's in strawberry?" is a classic failure case for language models. Why? Because "strawberry" tokenizes as something like ["str", "aw", "berry"]. The model never sees individual "r" characters directly. To answer correctly, it would need to know that "str" contains one "r", "aw" contains zero, and "berry" contains two -- knowledge that has to be learned from training data rather than being directly observable in the token sequence.
Arithmetic gone wrong: "What is 7842 + 1359?" is harder than it should be because numbers tokenize inconsistently. "7842" might be one token, but "78423" could become ["784", "23"]. The model can't do digit-by-digit addition the way we were taught in school because it doesn't see individual digits as separate tokens. This is why chain-of-thought prompting helps with math -- it forces the model to work through the problem step by step in stead of trying to produce the answer in one shot.
The non-English penalty: tokenizers trained primarily on English text require more tokens to represent the same content in other languages. A sentence in Hindi might take 3x more tokens than its English translation. That means the model's effective context window is 3x smaller for Hindi, and inference is 3x more expensive per word.
Whitespace sensitivity: "Hello world" and "Hello world" (two spaces) produce different token sequences. Leading spaces, trailing spaces, tabs, newlines -- they all tokenize differently and can affect model behavior in unexpected ways. This is why prompt engineering sometimes feels like black magic: adding or removing a single space can change the tokenization enough to alter the model's output.
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
# Same meaning, very different token counts
texts = [
"the cat sat on the mat",
"THE CAT SAT ON THE MAT", # uppercase = more tokens
"thecatsatonthemat", # no spaces = different split
"the cat sat on the mat", # double spaces = extra tokens
]
for t in texts:
toks = enc.encode(t)
decoded = [enc.decode([x]) for x in toks]
print(f"{len(toks):2d} tokens: {t!r}")
print(f" split: {decoded}")
print()
Code formatting: tabs vs spaces tokenize completely differently in Python code. A model might handle 4-space indentation well but struggle with tab indentation (or vice versa) purely because the token sequences are different for the same logical structure. If you're using a language model for code generation and getting weird indentation, check how your whitespace is tokenizing.
Understanding these artifacts is essential for debugging model behavior. When a model fails at a task, the first question to ask yourself is: "What does this look like after tokenization?" More often than not, the answer explains the failure.
Building your own tokenizer from scratch
Let's put everything together and build a complete BPE tokenizer that trains on any corpus and can encode/decode arbitrary text. This is a simplified version of what tiktoken and SentencePiece do under the hood:
class SimpleBPE:
"""A complete BPE tokenizer you can train and use."""
def __init__(self):
self.merges = []
self.vocab = {}
def train(self, text, vocab_size=256 + 50):
"""Train BPE on a text corpus."""
# Start with byte-level vocab (covers all UTF-8)
self.vocab = {i: bytes([i]) for i in range(256)}
next_id = 256
# Convert text to byte sequence
data = list(text.encode("utf-8"))
while next_id < vocab_size:
# Count adjacent pairs
pairs = {}
for i in range(len(data) - 1):
pair = (data[i], data[i + 1])
pairs[pair] = pairs.get(pair, 0) + 1
if not pairs:
break
# Find most common pair
best = max(pairs, key=pairs.get)
if pairs[best] < 2:
break
# Create new token for this pair
self.merges.append(best)
merged = best[0] * 256 + best[1] # unique ID
self.vocab[next_id] = (
self.vocab.get(best[0], bytes([best[0]]))
+ self.vocab.get(best[1], bytes([best[1]])))
# Replace all occurrences in data
new_data = []
i = 0
while i < len(data):
if (i < len(data) - 1
and (data[i], data[i + 1]) == best):
new_data.append(next_id)
i += 2
else:
new_data.append(data[i])
i += 1
data = new_data
next_id += 1
print(f"Trained: {len(self.merges)} merges, "
f"{len(self.vocab)} vocab entries")
def encode(self, text):
"""Encode text to token IDs."""
data = list(text.encode("utf-8"))
for pair in self.merges:
new_data = []
i = 0
while i < len(data):
if (i < len(data) - 1
and (data[i], data[i + 1]) == pair):
# Find the merge target ID
idx = self.merges.index(pair) + 256
new_data.append(idx)
i += 2
else:
new_data.append(data[i])
i += 1
data = new_data
return data
def decode(self, ids):
"""Decode token IDs back to text."""
raw = b""
for tid in ids:
if tid in self.vocab:
raw += self.vocab[tid]
else:
raw += bytes([tid])
return raw.decode("utf-8", errors="replace")
# Train on some text
corpus = ("the quick brown fox jumps over the lazy dog "
* 100)
corpus += "the tokenizer splits text into pieces " * 50
bpe = SimpleBPE()
bpe.train(corpus, vocab_size=300)
# Encode and decode
test = "the quick tokenizer"
ids = bpe.encode(test)
back = bpe.decode(ids)
print(f"Original: {test!r}")
print(f"Token IDs: {ids}")
print(f"Decoded: {back!r}")
print(f"Match: {test == back}")
This tokenizer starts from raw bytes (covering all of UTF-8) and builds up from there. That's the same approach used by GPT-4's tokenizer -- start from bytes, not from characters. This means you never need an <UNK> token because any possible byte sequence can be represented by the base vocabulary alone. The merges just make common patterns more efficient.
Visualizing how text gets tokenized
One last practical tool. When you're debugging model behavior or optimizing prompts, being able to see exactly how text breaks down into tokens is invaluable:
import tiktoken
def visualize_tokens(text, model="gpt-4"):
"""Show exactly how text splits into tokens."""
enc = tiktoken.encoding_for_model(model)
ids = enc.encode(text)
pieces = [enc.decode([i]) for i in ids]
print(f"Text ({len(text)} chars, {len(ids)} tokens):")
print(f" {text!r}")
print(f"\nTokens:")
for i, (tid, piece) in enumerate(zip(ids, pieces)):
# Show whitespace explicitly
display = piece.replace(" ", "_").replace("\n", "\\n")
print(f" [{i:3d}] id={tid:>6d} {display!r}")
# Token efficiency
ratio = len(text) / len(ids)
print(f"\nEfficiency: {ratio:.1f} chars/token")
return ids
# Compare different types of content
print("=" * 50)
print("English prose:")
visualize_tokens("The quick brown fox jumps over the lazy dog.")
print("\n" + "=" * 50)
print("Python code:")
visualize_tokens("def fib(n):\n if n <= 1:\n return n")
print("\n" + "=" * 50)
print("Mixed language:")
visualize_tokens("Hello wereld, hoe gaat het vandaag?")
You'll notice that common English words like "the", "quick", "brown" are typically single tokens, while code keywords, punctuation, and non-English words often split into multiple pieces. The chars-per-token ratio gives you a quick estimate of how efficiently the tokenizer handles different types of content -- higher is better (fewer tokens for the same text means cheaper API calls and more room in the context window).
Before you close this tab...
- Subword tokenization (BPE, WordPiece, Unigram) won over character-level and word-level because it balances vocabulary size, sequence length, and coverage. Every modern LLM uses some variant;
- BPE iteratively merges the most frequent adjacent pairs, naturally discovering meaningful subwords from character-level building blocks. It's the most widely used approach (GPT-2/3/4, Llama);
- WordPiece merges based on likelihood gain in stead of raw frequency, producing slightly more linguistically meaningful tokens. Used by BERT and its derivatives;
- Unigram starts large and prunes down, selecting the vocabulary that maximizes corpus likelihood. Can sample different segmentations for data augmentation;
- SentencePiece treats text as raw bytes (including spaces), making tokenization truly language-agnostic -- critical for multilingual models like Llama;
- vocabulary size is a real tradeoff: larger vocabularies mean shorter sequences but bigger embedding matrices and sparser training signal for rare tokens. The sweet spot is 32K-128K for most models;
- tokenization artifacts explain many apparent model failures: character counting, arithmetic errors, non-English penalties, and whitespace sensitivity all trace back to how text gets split into tokens. When debugging model behavior, always check the tokenization first.
Exercises
Exercise 1: Build a tokenizer efficiency benchmark. Create a function benchmark_tokenizer(tokenizer_name, texts) that takes a tokenizer (by name, e.g. "gpt2", "gpt-4") and a dictionary of text samples (keys are category names like "english_prose", "python_code", "dutch_text", "json_data", "math_notation"). For each category, compute: (a) average tokens per word (split original on whitespace), (b) average characters per token, (c) the 5 most "expensive" words (words that use the most tokens), and (d) compression ratio (bytes in original / bytes if each token ID were stored as a 2-byte integer). Print a summary table and identify which content type is most token-efficient and which is least. Test with at least 4 categories of content, each with at least 200 words.
Exercise 2: Build a BPE merge visualizer. Create a class BPEVisualizer that trains BPE on a small corpus (provide at least 500 words) and stores the complete merge history. Implement show_merge_history(n) that prints the first n merges with: the pair being merged, the resulting token, how many times that pair appeared, and an example word from the corpus that contains this merge. Implement trace_word(word) that shows step-by-step how a specific word gets tokenized: starting from individual characters, showing which merge applies at each step, until the final tokenization. Implement vocab_growth_curve() that prints the vocabulary size after every 10 merges. Test by training on a paragraph of English text and tracing the words "understanding", "the", and "tokenization".
Exercise 3: Build a cross-lingual tokenization analyzer. Create parallel sentences in 4 languages: English, Dutch, German, and Spanish (at least 5 sentence pairs with identical meaning). Using tiktoken's GPT-4 encoding, for each sentence compute: (a) token count, (b) characters per token ratio, and (c) the individual token breakdown. Print a table showing all 4 languages side by side for each sentence pair. Calculate the "tokenization tax" -- how many extra tokens each non-English language requires compared to English (as a percentage). Finally, find the sentence where the gap between English and the most expensive language is largest, and explain why by examining the token-level breakdown. Which specific words or subwords cause the biggest divergence?