Learn AI Series (#56) - Mini Project - Building a Transformer From Scratch

What will I learn

You will implement every transformer component we've covered into a single working model;
build a small but functional decoder-only transformer language model in PyTorch;
train it on a text corpus and watch it learn to generate coherent text;
generate text with temperature and top-k sampling;
analyze attention patterns to see what the model actually learned;
understand what this teaches us about large language models.

Requirements

A working modern computer running macOS, Windows or Ubuntu;
An installed Python 3(.11+) distribution;
The ambition to learn AI and machine learning.

Difficulty

Beginner

Curriculum (of the `Learn AI Series`):

Learn AI Series (#56) - Mini Project - Building a Transformer From Scratch

Solutions to Episode #55 Exercises

Exercise 1: Complete GAN training pipeline on MNIST with visualization.

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

class Generator(nn.Module):
    def __init__(self, z_dim=64):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(z_dim, 256), nn.ReLU(),
            nn.Linear(256, 512), nn.ReLU(),
            nn.Linear(512, 784), nn.Tanh()
        )

    def forward(self, z):
        return self.net(z)

class Discriminator(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(784, 512), nn.LeakyReLU(0.2),
            nn.Linear(512, 256), nn.LeakyReLU(0.2),
            nn.Linear(256, 1), nn.Sigmoid()
        )

    def forward(self, x):
        return self.net(x)

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])
mnist = datasets.MNIST('.', train=True, download=True, transform=transform)
loader = DataLoader(mnist, batch_size=64, shuffle=True)

z_dim = 64
G = Generator(z_dim)
D = Discriminator()
opt_G = optim.Adam(G.parameters(), lr=2e-4, betas=(0.5, 0.999))
opt_D = optim.Adam(D.parameters(), lr=2e-4, betas=(0.5, 0.999))
criterion = nn.BCELoss()

# Fixed noise for tracking progress
fixed_z = torch.randn(16, z_dim)
all_samples = []

for epoch in range(50):
    d_total, g_total, d_real_total, d_fake_total, n = 0, 0, 0, 0, 0
    for real_imgs, _ in loader:
        batch = real_imgs.size(0)
        real_flat = real_imgs.view(batch, -1)
        real_lbl = torch.ones(batch, 1) * 0.9
        fake_lbl = torch.zeros(batch, 1)

        # Train D
        z = torch.randn(batch, z_dim)
        fake = G(z).detach()
        d_real_score = D(real_flat)
        d_fake_score = D(fake)
        d_loss = criterion(d_real_score, real_lbl) + criterion(d_fake_score, fake_lbl)
        opt_D.zero_grad()
        d_loss.backward()
        opt_D.step()

        # Train G
        z = torch.randn(batch, z_dim)
        fake = G(z)
        g_loss = criterion(D(fake), torch.ones(batch, 1))
        opt_G.zero_grad()
        g_loss.backward()
        opt_G.step()

        d_total += d_loss.item()
        g_total += g_loss.item()
        d_real_total += d_real_score.mean().item()
        d_fake_total += d_fake_score.mean().item()
        n += 1

    if epoch % 10 == 0 or epoch == 49:
        with torch.no_grad():
            samples = G(fixed_z).view(16, 1, 28, 28)
            all_samples.append(samples)
        print(f"Epoch {epoch:>2d}: D_loss={d_total/n:.3f}, G_loss={g_total/n:.3f}, "
              f"D(real)={d_real_total/n:.3f}, D(fake)={d_fake_total/n:.3f}")

# Save final samples as image grid
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

fig, axes = plt.subplots(4, 4, figsize=(8, 8))
for i, ax in enumerate(axes.flat):
    ax.imshow(all_samples[-1][i, 0].numpy(), cmap='gray')
    ax.axis('off')
plt.suptitle("Generated MNIST after 50 epochs")
plt.tight_layout()
plt.savefig('/tmp/gan_mnist_samples.png', dpi=100)
print("Samples saved to /tmp/gan_mnist_samples.png")

By epoch 50 you should see recognizable digits -- blurry compared to real MNIST, but structurally correct. The D(real) score should hover around 0.7-0.9 and D(fake) around 0.3-0.6. If D(fake) drops to near zero and stays there, the discriminator is winning too easily and the generator isn't learning.

Exercise 2: Mode collapse detection via pairwise similarity.

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

transform = transforms.Compose([
    transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])
mnist = datasets.MNIST('.', train=True, download=True, transform=transform)
loader = DataLoader(mnist, batch_size=64, shuffle=True)

z_dim = 64

def train_gan(g_steps_per_d=1, epochs=50, label="standard"):
    G = Generator(z_dim)
    D = Discriminator()
    opt_G = optim.Adam(G.parameters(), lr=2e-4, betas=(0.5, 0.999))
    opt_D = optim.Adam(D.parameters(), lr=2e-4, betas=(0.5, 0.999))
    criterion = nn.BCELoss()

    diversity_scores = []
    pixel_stds = []

    for epoch in range(epochs):
        for real_imgs, _ in loader:
            batch = real_imgs.size(0)
            real_flat = real_imgs.view(batch, -1)

            # Train D (1 step)
            z = torch.randn(batch, z_dim)
            fake = G(z).detach()
            d_loss = (criterion(D(real_flat), torch.ones(batch, 1) * 0.9) +
                      criterion(D(fake), torch.zeros(batch, 1)))
            opt_D.zero_grad()
            d_loss.backward()
            opt_D.step()

            # Train G (g_steps_per_d steps)
            for _ in range(g_steps_per_d):
                z = torch.randn(batch, z_dim)
                fake = G(z)
                g_loss = criterion(D(fake), torch.ones(batch, 1))
                opt_G.zero_grad()
                g_loss.backward()
                opt_G.step()

        # Diversity measurement
        with torch.no_grad():
            z = torch.randn(100, z_dim)
            samples = G(z)  # (100, 784)
            normed = F.normalize(samples, dim=1)
            sim_matrix = normed @ normed.T
            # Exclude diagonal (self-similarity = 1.0)
            mask = ~torch.eye(100, dtype=torch.bool)
            avg_sim = sim_matrix[mask].mean().item()
            diversity_scores.append(avg_sim)
            pixel_stds.append(samples.std(dim=0).mean().item())

        if epoch % 10 == 0 or epoch == epochs - 1:
            print(f"[{label}] Epoch {epoch:>2d}: avg_cosine_sim={avg_sim:.4f}, "
                  f"pixel_std={pixel_stds[-1]:.4f}")

    return diversity_scores, pixel_stds

print("=== Standard training (1 G step per D step) ===")
div_std, pstd_std = train_gan(g_steps_per_d=1, label="standard")

print("\n=== Collapse-prone training (5 G steps per D step) ===")
div_col, pstd_col = train_gan(g_steps_per_d=5, label="collapse")

print(f"\nFinal similarity -- standard: {div_std[-1]:.4f}, collapse: {div_col[-1]:.4f}")
print(f"Final pixel std  -- standard: {pstd_std[-1]:.4f}, collapse: {pstd_col[-1]:.4f}")
print("Higher similarity + lower pixel std = less diversity = more mode collapse")

The collapse-prone configuration (5 G steps per 1 D step) should show higher average cosine similarity and lower pixel standard deviation -- the generator converges to producing fewer distinct outputs. The standard configuration maintains more diversity because the discriminator keeps up with the generator and provides diverse gradient signals.

Exercise 3: DCGAN vs MLP GAN comparison on MNIST.

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

class DCGANGenerator(nn.Module):
    def __init__(self, z_dim=100):
        super().__init__()
        self.net = nn.Sequential(
            nn.ConvTranspose2d(z_dim, 256, 7, 1, 0, bias=False),
            nn.BatchNorm2d(256), nn.ReLU(),
            nn.ConvTranspose2d(256, 128, 4, 2, 1, bias=False),
            nn.BatchNorm2d(128), nn.ReLU(),
            nn.ConvTranspose2d(128, 1, 4, 2, 1, bias=False),
            nn.Tanh()
        )

    def forward(self, z):
        return self.net(z.view(-1, z.size(1), 1, 1))

class DCGANDiscriminator(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Conv2d(1, 64, 4, 2, 1, bias=False),
            nn.LeakyReLU(0.2),
            nn.Conv2d(64, 128, 4, 2, 1, bias=False),
            nn.BatchNorm2d(128), nn.LeakyReLU(0.2),
            nn.Conv2d(128, 1, 7, 1, 0, bias=False),
            nn.Sigmoid()
        )

    def forward(self, x):
        return self.net(x).view(-1, 1)

import torch.nn.functional as F

transform = transforms.Compose([
    transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])
mnist = datasets.MNIST('.', train=True, download=True, transform=transform)
loader = DataLoader(mnist, batch_size=64, shuffle=True)

def train_and_evaluate(G, D, z_dim, name, use_conv=False, epochs=20):
    opt_G = optim.Adam(G.parameters(), lr=2e-4, betas=(0.5, 0.999))
    opt_D = optim.Adam(D.parameters(), lr=2e-4, betas=(0.5, 0.999))
    criterion = nn.BCELoss()

    for epoch in range(epochs):
        for real_imgs, _ in loader:
            batch = real_imgs.size(0)
            if use_conv:
                real_input = real_imgs
            else:
                real_input = real_imgs.view(batch, -1)
            real_lbl = torch.ones(batch, 1) * 0.9
            fake_lbl = torch.zeros(batch, 1)

            z = torch.randn(batch, z_dim)
            fake = G(z).detach()
            if not use_conv:
                fake_input = fake
            else:
                fake_input = fake
            d_loss = (criterion(D(real_input), real_lbl) +
                      criterion(D(fake_input), fake_lbl))
            opt_D.zero_grad()
            d_loss.backward()
            opt_D.step()

            z = torch.randn(batch, z_dim)
            fake = G(z)
            g_loss = criterion(D(fake), torch.ones(batch, 1))
            opt_G.zero_grad()
            g_loss.backward()
            opt_G.step()

    # Evaluate
    with torch.no_grad():
        z = torch.randn(100, z_dim)
        samples = G(z)
        if use_conv:
            flat = samples.view(100, -1)
        else:
            flat = samples
        # Rescale from [-1,1] to [0,1] for stats
        raw = (flat + 1) / 2
        d_scores = D(samples).mean().item()
        normed = F.normalize(flat, dim=1)
        sim = normed @ normed.T
        mask = ~torch.eye(100, dtype=torch.bool)
        avg_sim = sim[mask].mean().item()
        pixel_mean = raw.mean().item()
        pixel_std = raw.std().item()

    print(f"\n{name} results after {epochs} epochs:")
    print(f"  D score on generated: {d_scores:.4f}")
    print(f"  Pixel mean: {pixel_mean:.4f} (MNIST ~0.13)")
    print(f"  Pixel std:  {pixel_std:.4f} (MNIST ~0.31)")
    print(f"  Diversity (avg cosine sim): {avg_sim:.4f}")

mlp_G = Generator(z_dim=64)
mlp_D = Discriminator()
print("Training MLP GAN...")
train_and_evaluate(mlp_G, mlp_D, 64, "MLP GAN", use_conv=False)

dc_G = DCGANGenerator(z_dim=100)
dc_D = DCGANDiscriminator()
print("\nTraining DCGAN...")
train_and_evaluate(dc_G, dc_D, 100, "DCGAN", use_conv=True)

The DCGAN should produce sharper images with better pixel statistics (closer to real MNIST mean/std) because transposed convolutions are better suited for generating spatial data than fully connected layers. The MLP GAN's output tends to be blurrier and the pixel distribution is often more concentrated around zero. The diversity metric should be comparable between the two if both train stably.

On to today's episode

Here we go! This is where it all comes together. For the last 19 episodes -- from episode #37 where we built a single perceptron that couldn't even learn XOR, through backpropagation, PyTorch, CNNs, RNNs, LSTMs, attention, all the way to transformers and GANs -- we've been building tools and understanding. Today we use all of it to build something that actually does something: a decoder-only transformer language model that generates text from a prompt.

This is a miniature GPT. Seriously. Same architecture, same training objective (predict the next token), same generation procedure (autoregressive sampling with temperature and top-k). The only difference is scale -- our model has a few hundred thousand parameters where production models have hundreds of billions. But the engineering principles are identical, and that's the point.

I remember the first time I got a character-level language model to produce actual English words from noise, and it was one of those wait, that actually works? moments. You'll see what I mean ;-)

The complete model

We're building a decoder-only transformer -- the GPT architecture. No encoder, no cross-attention. Just masked self-attention, feed-forward layers, and next-token prediction. If you followed episodes #52 and #53, you already know every piece. Today we assemble them into a single coherent system and train it end-to-end.

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.d_k = d_model // n_heads
        self.n_heads = n_heads
        self.qkv = nn.Linear(d_model, 3 * d_model)
        self.out = nn.Linear(d_model, d_model)

    def forward(self, x, mask=None):
        B, T, C = x.shape
        qkv = self.qkv(x).reshape(B, T, 3, self.n_heads, self.d_k)
        qkv = qkv.permute(2, 0, 3, 1, 4)  # (3, B, heads, T, d_k)
        Q, K, V = qkv[0], qkv[1], qkv[2]
        scores = (Q @ K.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        attn = F.softmax(scores, dim=-1)
        out = (attn @ V).transpose(1, 2).contiguous().view(B, T, C)
        return self.out(out), attn

class TransformerBlock(nn.Module):
    def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
        super().__init__()
        self.ln1 = nn.LayerNorm(d_model)
        self.attn = MultiHeadAttention(d_model, n_heads)
        self.ln2 = nn.LayerNorm(d_model)
        self.ff = nn.Sequential(
            nn.Linear(d_model, d_ff), nn.GELU(),
            nn.Linear(d_ff, d_model), nn.Dropout(dropout)
        )
        self.drop = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        attn_out, attn_weights = self.attn(self.ln1(x), mask)
        x = x + self.drop(attn_out)
        x = x + self.ff(self.ln2(x))
        return x, attn_weights

class MiniGPT(nn.Module):
    def __init__(self, vocab_size, d_model=128, n_heads=4, n_layers=4,
                 d_ff=512, max_len=256, dropout=0.1):
        super().__init__()
        self.tok_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb = nn.Embedding(max_len, d_model)
        self.layers = nn.ModuleList([
            TransformerBlock(d_model, n_heads, d_ff, dropout)
            for _ in range(n_layers)
        ])
        self.ln_f = nn.LayerNorm(d_model)
        self.head = nn.Linear(d_model, vocab_size, bias=False)
        # Weight tying: share embedding and output projection weights
        self.head.weight = self.tok_emb.weight
        self.max_len = max_len
        self._init_weights()

    def _init_weights(self):
        for p in self.parameters():
            if p.dim() > 1:
                nn.init.xavier_uniform_(p)

    def forward(self, idx):
        B, T = idx.shape
        assert T <= self.max_len, f"Sequence length {T} exceeds max {self.max_len}"
        tok = self.tok_emb(idx)
        pos = self.pos_emb(torch.arange(T, device=idx.device))
        x = tok + pos
        mask = torch.tril(torch.ones(T, T, device=idx.device)).unsqueeze(0).unsqueeze(0)
        all_attn = []
        for layer in self.layers:
            x, attn_w = layer(x, mask)
            all_attn.append(attn_w)
        x = self.ln_f(x)
        logits = self.head(x)
        return logits, all_attn

model = MiniGPT(vocab_size=256, d_model=128, n_heads=4, n_layers=4, d_ff=512)
n_params = sum(p.numel() for p in model.parameters())
print(f"MiniGPT parameters: {n_params:,}")

Let me walk through the design decisions because they're all deliberate.

QKV fusion: computing Q, K, V in a single linear projection (3 * d_model) is more efficient than three separate projections. One matrix multiplication in stead of three. Every production transformer does this.

Pre-norm: LayerNorm before attention and feed-forward, not after. We discussed why in episode #53 -- pre-norm creates a more direct gradient path through the residual connections, which makes training more stable for deep models. GPT-2 switched to pre-norm and every model since has followed.

GELU: smoother than ReLU, the standard activation for transformers since GPT-2. It's roughly x * sigmoid(1.702 * x) -- it doesn't have the hard zero cutoff that kills gradients in ReLU, but it still provides nonlinearity. We covered activation functions in episode #40.

Weight tying: the output projection matrix shares weights with the token embedding. This means "the embedding for token X" and "the prediction score for token X" use the same learned vector. Standard in language models (GPT-2, GPT-3 both do this). It acts as a regularizer and saves parameters -- our model would have vocab_size * d_model extra parameters without it.

Learned positional embeddings: in stead of the sinusoidal encoding from the original transformer paper (episode #52), we use learned embeddings. Same approach as ViT from episode #54. The model figures out what position information it needs during training.

Preparing the data

We'll use character-level tokenization -- each character is a token. This keeps things simple (no BPE or WordPiece tokenizer needed) while demonstrating the exact same training procedure used in production LLMs. The model learns to predict the next character given all previous characters.

# Download some text, or use any .txt file you have
# For this example we create a focused dataset
text = """The transformer architecture has revolutionized artificial intelligence.
Self-attention allows every token to attend to every other token in the sequence.
Multi-head attention lets the model look at different aspects simultaneously.
Positional encodings give the model a sense of word order without recurrence.
Layer normalization and residual connections enable deep stacking of blocks.
The decoder uses causal masking to prevent looking at future tokens during training.
GPT models use decoder-only transformers for text generation tasks.
BERT models use encoder-only transformers for understanding tasks.
Training involves predicting the next token given all previous context tokens.
The loss function is cross-entropy between predicted and actual next tokens.
Temperature controls the randomness of text generation during inference.
Top-k filtering restricts sampling to the k most probable next tokens.
Weight tying shares parameters between the embedding and output projection.
The feed-forward network expands and contracts the hidden dimension.
Attention scores are scaled by the square root of the key dimension.
Dropout provides regularization during the training of transformer models.
"""
# In practice you would load a much larger corpus:
# text = open('shakespeare.txt').read()  # ~1MB of Shakespeare works well

chars = sorted(set(text))
vocab_size = len(chars)
ch_to_idx = {c: i for i, c in enumerate(chars)}
idx_to_ch = {i: c for i, c in enumerate(chars)}

def encode(s):
    return torch.tensor([ch_to_idx[c] for c in s], dtype=torch.long)

def decode(ids):
    return ''.join(idx_to_ch[i] for i in ids)

data = encode(text)
print(f"Vocabulary: {vocab_size} characters")
print(f"Dataset: {len(data)} tokens")
print(f"Sample: '{decode(data[:60].tolist())}'")

With character-level tokenization, the "vocabulary" is just the set of unique characters in the text. For ASCII English text, that's typically 50-80 characters (letters, digits, punctuation, whitespace). A production model like GPT uses byte pair encoding (BPE) with a vocabulary of 50,000-100,000 subword tokens -- this is vastly more efficient (one token might represent an entire common word in stead of individual characters), but the training mechanics are identical. Predict the next token, whatever "token" means in your tokenization scheme.

For training, we create overlapping windows of context:

def get_batch(data, batch_size=32, block_size=64):
    ix = torch.randint(len(data) - block_size - 1, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

x, y = get_batch(data, batch_size=4, block_size=32)
print(f"Input:  {x.shape}")   # (4, 32)
print(f"Target: {y.shape}")   # (4, 32) - shifted right by 1
print(f"Input:  '{decode(x[0].tolist())}'")
print(f"Target: '{decode(y[0].tolist())}'")

Each training example: given characters 0 through 63, predict characters 1 through 64. At position t, the model sees characters 0..t and must predict character t+1. The causal mask (that lower-triangular matrix from episode #53) ensures position t cannot see positions t+1, t+2, etc. -- exactly the autoregressive constraint we need.

This is teacher forcing (episode #50): during training, we feed the real target sequence as input and predict the shifted version. The mask simulates autoregressive generation without actually doing it sequentially. All 64 positions are predicted in parallel.

The training loop

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = MiniGPT(vocab_size=vocab_size, d_model=128, n_heads=4,
                n_layers=4, d_ff=512, max_len=128).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
data = data.to(device)

for step in range(2000):
    x, y = get_batch(data, batch_size=32, block_size=64)
    x, y = x.to(device), y.to(device)
    logits, _ = model(x)
    loss = F.cross_entropy(logits.view(-1, vocab_size), y.view(-1))
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if step % 200 == 0:
        print(f"Step {step}: loss = {loss.item():.4f}")

print(f"\nFinal loss: {loss.item():.4f}")

AdamW instead of Adam -- AdamW decouples weight decay from the gradient update, which works better for transformer training. We covered Adam in episode #41; AdamW is the variant that every modern transformer uses.

The .view(-1, vocab_size) flattens both the batch and sequence dimensions so PyTorch's cross_entropy can process all positions at once. If the input is (32, 64, vocab_size), it becomes (2048, vocab_size) -- 2048 individual next-token predictions, each compared against the actual next token via cross-entropy.

With a small training corpus like ours, you'll see the loss drop from roughly 4.0 (random guess: -log(1/vocab_size) which is -log(1/50) ~ 3.9) to well under 1.0 within a few hundred steps. The model learns character-level patterns fast: common words, sentence structures, punctuation conventions. On a larger corpus like Shakespeare (~1MB), convergence takes longer but the generated text becomes genuinely impressive for a model this small.

Generating text

At inference time, we start with a prompt and generate one character at a time, feeding each generated character back as input. This is the autoregressive loop -- the same procedure that every LLM uses:

@torch.no_grad()
def generate(model, prompt, max_new=200, temperature=0.8, top_k=40):
    model.eval()
    idx = encode(prompt).unsqueeze(0).to(device)
    for _ in range(max_new):
        idx_cond = idx[:, -model.max_len:]  # crop to max context
        logits, _ = model(idx_cond)
        logits = logits[:, -1, :] / temperature  # last position, scale

        # Top-k filtering: zero out everything below k-th highest
        if top_k > 0:
            v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
            logits[logits < v[:, -1:]] = float('-inf')

        probs = F.softmax(logits, dim=-1)
        next_id = torch.multinomial(probs, num_samples=1)
        idx = torch.cat([idx, next_id], dim=1)
    return decode(idx[0].tolist())

print("=== temperature=0.5 (conservative) ===")
print(generate(model, "The transformer", temperature=0.5))
print("\n=== temperature=1.0 (standard) ===")
print(generate(model, "The transformer", temperature=1.0))
print("\n=== temperature=1.5 (creative) ===")
print(generate(model, "The transformer", temperature=1.5))

Temperature controls randomness. The logits (raw scores) are divided by temperature before softmax:

temperature = 1.0: sample from the model's learned distribution as-is
temperature < 1.0 (e.g. 0.5): sharpen the distribution -- the model picks higher-probability tokens more often, producing more predictable text
temperature > 1.0 (e.g. 1.5): flatten the distribution -- more random, more "creative", but also more likely to produce gibberish
temperature -> 0: greedy decoding (always picks the single most likely token)

Top-k filtering restricts sampling to the k most likely tokens at each step. This prevents the model from occasionally sampling very unlikely characters that would derail the text. With top_k=40, any token not in the top 40 is set to probability zero before sampling. This is a simple but effective way to keep generation coherent without making it boring.

Having said that, the interaction between temperature and top-k is important. Low temperature + low top-k gives you very deterministic output (practically greedy). High temperature + high top-k gives you the most variety. In practice, temperature=0.7-0.9 with top_k=40-50 is a reasonable default for most text generation tasks. Production systems also use top-p (nucleus sampling, not implemented here but conceptually similar -- sample from the smallest set of tokens whose cumulative probability exceeds p).

Comparing different configurations

One thing I want to show you that really drives the architecture concepts home: how changing the model's hyperparameters affects what it learns. Let's train a few variants and compare:

configs = [
    {"name": "tiny",   "d_model": 32,  "n_heads": 2, "n_layers": 2, "d_ff": 128},
    {"name": "small",  "d_model": 64,  "n_heads": 4, "n_layers": 3, "d_ff": 256},
    {"name": "medium", "d_model": 128, "n_heads": 4, "n_layers": 4, "d_ff": 512},
]

for cfg in configs:
    m = MiniGPT(vocab_size=vocab_size, d_model=cfg["d_model"],
                n_heads=cfg["n_heads"], n_layers=cfg["n_layers"],
                d_ff=cfg["d_ff"], max_len=128).to(device)
    opt = torch.optim.AdamW(m.parameters(), lr=3e-4)
    n_p = sum(p.numel() for p in m.parameters())

    for step in range(1000):
        x, y = get_batch(data, batch_size=32, block_size=64)
        x, y = x.to(device), y.to(device)
        logits, _ = m(x)
        loss = F.cross_entropy(logits.view(-1, vocab_size), y.view(-1))
        opt.zero_grad()
        loss.backward()
        opt.step()

    m.eval()
    with torch.no_grad():
        sample = generate(m, "The ", max_new=100, temperature=0.7)

    print(f"{cfg['name']:>8s} ({n_p:>8,} params): loss={loss.item():.3f}")
    print(f"         Sample: {sample[:80]}...")
    print()

You'll see a clear pattern: the bigger model reaches lower loss and produces more coherent text. On our small corpus, even the tiny model memorizes quite well, but the difference becomes dramatic on larger datasets. This is the scaling phenomenon -- the same observation that drives the entire LLM industry. More parameters + more data = better predictions, with remarkably predictable relationships between model size, dataset size, and loss. We'll explore these scaling laws in detail when we get to LLMs proper ;-)

Analyzing attention patterns

One of the nicest things about building from scratch is that we have access to the raw attention weights. We can look inside the model and see what it actually learned:

model.eval()
test_text = "The model learns"
idx = encode(test_text).unsqueeze(0).to(device)
_, all_attn = model(idx)

import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 2, figsize=(14, 14))
chars_list = list(test_text)

for layer_idx, ax in enumerate(axes.flat):
    attn = all_attn[layer_idx][0, 0].cpu().detach()  # head 0 of each layer
    ax.imshow(attn.numpy(), cmap='Blues')
    ax.set_xticks(range(len(chars_list)))
    ax.set_xticklabels(chars_list, fontsize=8)
    ax.set_yticks(range(len(chars_list)))
    ax.set_yticklabels(chars_list, fontsize=8)
    ax.set_title(f"Layer {layer_idx}, Head 0")
plt.tight_layout()
plt.savefig('/tmp/attn_patterns.png', dpi=100)
print("Attention patterns saved to /tmp/attn_patterns.png")

Different heads learn different patterns. Even in a model this small, you'll observe some common heads that emerge consistently:

Previous-token head: strong diagonal pattern (each position attends heavily to the position just before it). This head captures bigram statistics -- what character typically follows another character.
Space/delimiter head: strong attention on space characters. The model learns that word boundaries carry important information about what comes next.
Copy head: attention on earlier occurrences of the same character. If 'e' appeared earlier in the sequence, the model might attend to that position when deciding whether to output 'e' again.
Positional head: attention concentrated at the beginning of the sequence. Some heads learn to always check the start of the input for context.

In large language models, these patterns become far more sophisticated -- some heads learn syntactic parsing, others track pronoun reference chains ("she" -> attends to "Alice" from 200 tokens ago), others handle arithmetic by attending to operands. But the fundamental mechanism is the same. Attention is just a learned soft lookup: "given what I'm processing now, which previous positions should I consult?"

Let's also look at how different heads in the same layer specialize:

fig, axes = plt.subplots(1, 4, figsize=(20, 5))

for head_idx, ax in enumerate(axes.flat):
    attn = all_attn[0][0, head_idx].cpu().detach()  # layer 0, all 4 heads
    ax.imshow(attn.numpy(), cmap='Blues')
    ax.set_xticks(range(len(chars_list)))
    ax.set_xticklabels(chars_list, fontsize=7)
    ax.set_yticks(range(len(chars_list)))
    ax.set_yticklabels(chars_list, fontsize=7)
    ax.set_title(f"Layer 0, Head {head_idx}")
plt.suptitle("Multi-head attention: each head attends differently")
plt.tight_layout()
plt.savefig('/tmp/multihead_patterns.png', dpi=100)
print("Multi-head patterns saved")

This is multi-head attention doing exactly what it was designed to do (episode #52): different heads attend to different relationships. Where a single-head attention would have to compress all patterns into one attention matrix, multi-head attention allocates separate heads for separate aspects. Head 0 might focus on the previous token, head 1 on spaces, head 2 on copying, and head 3 on positional context. Together they give the model a much richer view of the sequence than any single head could provide.

What this teaches about LLMs

Our tiny model and GPT-4 share the same DNA. Let me be very explicit about what's the same and what's different, because I think this is one of the most important takeaways from the entire series so far.

Same architecture: decoder-only transformer with masked self-attention. GPT-4 reportedly has ~120 layers and uses mixture-of-experts, but the forward pass through each expert block is structurally identical to our TransformerBlock.

Same training objective: predict the next token. That's it. No labels for sentiment, no explicit grammar rules, no alignment targets (initially -- that comes later with RLHF, which we'll cover). Just: given this sequence of tokens, what comes next? The finding that this simple objective, scaled up sufficiently, produces models that can reason, code, translate, write poetry, and have conversations is one of the most remarkable discoveries in the history of computer science.

Same generation procedure: autoregressive sampling with temperature and top-k/top-p. Every single word that any LLM produces goes through the same generate loop we just built -- predict probability distribution, sample, feed back, repeat.

Different scale: our model has ~300K parameters trained on ~1KB of text. GPT-3 has 175B parameters trained on ~300B tokens. GPT-4 is reportedly even larger. The gap is roughly 6 orders of magnitude in model size and 9 orders of magnitude in data. Whether this difference in scale produces a qualitative difference (genuine "understanding" vs sophisticated pattern matching) or just a quantitative one (better pattern matching) is -- in my honest opinion -- one of the most interesting open questions in AI research right now.

Different tokenization: we use characters; production models use BPE subword tokens (50K-100K vocabulary). BPE is more efficient because common words become single tokens ("the" = 1 token in stead of 3 characters), which means the model can process longer texts in the same context window.

Different training infrastructure: we train on a single CPU/GPU in seconds. Production LLMs train on thousands of GPUs for weeks to months, using distributed training techniques (which we'll cover in a later episode).

What you should take away from Arc 3

Over the last 20 episodes, starting from episode #37, you built neural networks from nothing:

Episode #37: a single perceptron that can't even learn XOR
Episode #39: backpropagation, making multi-layer learning possible
Episode #42: PyTorch, replacing manual gradient computation with autograd
Episodes #45-47: CNNs, exploiting spatial structure in images
Episodes #48-49: RNNs and LSTMs, modeling sequential data with memory
Episodes #50-51: seq2seq and attention, connecting sequences to sequences
Episodes #52-53: the transformer, replacing recurrence with parallel attention
Episode #54: Vision Transformers, proving the architecture generalizes beyond text
Episode #55: GANs, neural networks competing to generate new data
Episode #56: a working language model that generates text from a prompt

The trajectory from "weighted sum with a step function" to "generates coherent English text from a prompt" is genuinly one of the most remarkable progressions in computing history. And you built every step of it.

Each piece was necessary. The transformer uses linear projections (from basic neural networks), nonlinear activations (episode #40), residual connections (from CNNs, episode #46), layer normalization (related to batch norm, episode #45), attention (from seq2seq, episode #51), and positional encoding (because we dropped recurrence). Nothing appeared from thin air -- it all connects back to concepts you learned episodes ago.

Arc 4 starts next

Arc 4 takes us into Large Language Models -- the systems that use this exact architecture at massive scale and have transformed what computers can do. We'll go from our character-level toy to understanding how production models like GPT and BERT actually work, how they're trained on internet-scale data, and how you can use them through APIs and fine-tuning. The transformer you just built is the foundation that everything in the next arc rests on. The pieces are in place, and the picture is about to get very interesting.

Did it click? Let's check

A decoder-only transformer (the GPT architecture) uses masked self-attention for next-token prediction -- no encoder needed;
Weight tying shares parameters between the token embedding and output projection, improving efficiency and acting as regularization;
Character-level tokenization is simple but demonstrates the same training procedure used for production LLMs with BPE tokenization;
Temperature scales logits before softmax: lower = more deterministic, higher = more random;
Top-k filtering restricts sampling to the k most likely tokens, preventing low-probability tokens from derailing generation;
Attention patterns reveal what the model learned: previous-token heads, delimiter heads, copy heads emerge even in tiny models;
This architecture IS what powers GPT-4 and every other LLM -- the only difference is scale;
Arc 3 took you from a single perceptron to a working text generator -- every concept from the last 20 episodes is present in today's model.

Bedankt en tot de volgende keer!

Hive account@scipio

Learn AI Series (#56) - Mini Project - Building a Transformer From Scratch

Learn AI Series (#56) - Mini Project - Building a Transformer From Scratch

What will I learn

Requirements

Difficulty

Curriculum (of the Learn AI Series):

Learn AI Series (#56) - Mini Project - Building a Transformer From Scratch

Solutions to Episode #55 Exercises

On to today's episode

The complete model

Preparing the data

The training loop

Generating text

Comparing different configurations

Analyzing attention patterns

What this teaches about LLMs

What you should take away from Arc 3

Arc 4 starts next

Did it click? Let's check

Bedankt en tot de volgende keer!

Curriculum (of the `Learn AI Series`):