Learn AI Series (#59) - BERT and Encoder Models

What will I learn

You will learn masked language modeling -- predicting blanked-out tokens in stead of next tokens;
bidirectional attention -- why seeing the full context (both left and right) matters for understanding;
the pre-training + fine-tuning paradigm that BERT pioneered and why it changed NLP overnight;
BERT for classification, named entity recognition, and question answering;
the BERT family: RoBERTa, ALBERT, DistilBERT, ELECTRA and what each improved;
when to use encoder models vs decoder models in practice.

Requirements

A working modern computer running macOS, Windows or Ubuntu;
An installed Python 3(.11+) distribution;
The ambition to learn AI and machine learning.

Difficulty

Beginner

Curriculum (of the `Learn AI Series`):

Learn AI Series (#59) - BERT and Encoder Models

Solutions to Episode #58 Exercises

Exercise 1: Build a complete GPT (small and medium), train on character-level text, compare loss and generation quality.

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class CausalSelfAttention(nn.Module):
    def __init__(self, d_model, n_heads, max_len=256, dropout=0.1):
        super().__init__()
        self.n_heads = n_heads
        self.d_k = d_model // n_heads
        self.qkv = nn.Linear(d_model, 3 * d_model)
        self.proj = nn.Linear(d_model, d_model)
        self.drop = nn.Dropout(dropout)
        mask = torch.tril(torch.ones(max_len, max_len))
        self.register_buffer('mask', mask.view(1, 1, max_len, max_len))

    def forward(self, x):
        B, T, C = x.shape
        qkv = self.qkv(x).reshape(B, T, 3, self.n_heads, self.d_k)
        qkv = qkv.permute(2, 0, 3, 1, 4)
        Q, K, V = qkv[0], qkv[1], qkv[2]
        scores = (Q @ K.transpose(-2, -1)) / math.sqrt(self.d_k)
        scores = scores.masked_fill(self.mask[:, :, :T, :T] == 0, float('-inf'))
        attn = self.drop(F.softmax(scores, dim=-1))
        out = (attn @ V).transpose(1, 2).contiguous().view(B, T, C)
        return self.proj(out)

class GPTBlock(nn.Module):
    def __init__(self, d_model, n_heads, d_ff, max_len=256, dropout=0.1):
        super().__init__()
        self.ln1 = nn.LayerNorm(d_model)
        self.attn = CausalSelfAttention(d_model, n_heads, max_len, dropout)
        self.ln2 = nn.LayerNorm(d_model)
        self.ff = nn.Sequential(
            nn.Linear(d_model, d_ff), nn.GELU(),
            nn.Linear(d_ff, d_model), nn.Dropout(dropout))

    def forward(self, x):
        x = x + self.attn(self.ln1(x))
        x = x + self.ff(self.ln2(x))
        return x

class GPT(nn.Module):
    def __init__(self, vocab_size, d_model=64, n_heads=4, n_layers=2,
                 d_ff=256, max_len=256, dropout=0.1):
        super().__init__()
        self.tok_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb = nn.Embedding(max_len, d_model)
        self.drop = nn.Dropout(dropout)
        self.blocks = nn.ModuleList([
            GPTBlock(d_model, n_heads, d_ff, max_len, dropout)
            for _ in range(n_layers)])
        self.ln_f = nn.LayerNorm(d_model)
        self.head = nn.Linear(d_model, vocab_size, bias=False)
        self.head.weight = self.tok_emb.weight
        self.max_len = max_len

    def forward(self, idx):
        B, T = idx.shape
        tok = self.tok_emb(idx)
        pos = self.pos_emb(torch.arange(T, device=idx.device))
        x = self.drop(tok + pos)
        for block in self.blocks:
            x = block(x)
        return self.head(self.ln_f(x))

# Load text (use any text file >= 100KB)
text = open('shakespeare.txt').read()
chars = sorted(set(text))
vocab_size = len(chars)
ch2i = {c: i for i, c in enumerate(chars)}
i2ch = {i: c for c, i in ch2i.items()}
data = torch.tensor([ch2i[c] for c in text], dtype=torch.long)
split = int(0.9 * len(data))
train_data, test_data = data[:split], data[split:]

def get_batch(source, batch_size=32, block_size=64):
    ix = torch.randint(len(source) - block_size - 1, (batch_size,))
    x = torch.stack([source[i:i+block_size] for i in ix])
    y = torch.stack([source[i+1:i+block_size+1] for i in ix])
    return x, y

@torch.no_grad()
def generate(model, prompt_ids, length=200, temperature=0.8):
    model.eval()
    ids = prompt_ids.unsqueeze(0)
    for _ in range(length):
        logits = model(ids[:, -model.max_len:])
        logits = logits[:, -1, :] / temperature
        probs = F.softmax(logits, dim=-1)
        nxt = torch.multinomial(probs, 1)
        ids = torch.cat([ids, nxt], dim=1)
    return ''.join(i2ch[i.item()] for i in ids[0])

configs = {
    "small":  {"d_model": 64,  "n_layers": 2, "n_heads": 4, "d_ff": 256},
    "medium": {"d_model": 128, "n_layers": 4, "n_heads": 4, "d_ff": 512},
}

for name, cfg in configs.items():
    model = GPT(vocab_size, **cfg)
    n_params = sum(p.numel() for p in model.parameters())
    opt = torch.optim.AdamW(model.parameters(), lr=3e-4)
    model.train()
    for step in range(2000):
        x, y = get_batch(train_data)
        logits = model(x)
        loss = F.cross_entropy(logits.view(-1, vocab_size), y.view(-1))
        opt.zero_grad()
        loss.backward()
        opt.step()
        if step % 500 == 0:
            print(f"[{name}] step {step}: loss={loss.item():.3f}")

    print(f"\n{name}: {n_params:,} params, final loss={loss.item():.3f}")
    prompt = torch.tensor([ch2i[c] for c in "The "], dtype=torch.long)
    print(f"Generated: {generate(model, prompt)[:200]}\n")

The medium model (128d, 4 layers) should reach a noticeably lower loss (around 1.3-1.5) compared to the small model (64d, 2 layers, around 1.6-1.8). The generated text from the medium model produces more coherent English -- longer recognizable words, occasional real phrases. The small model generates plausible character sequences but with more nonsense. The parameter count ratio is roughly 4:1, and the quality improvement is clear but with diminishing returns per parameter added.

Exercise 2: Causal vs bidirectional attention visualization and entropy comparison.

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

def attention_weights(Q, K, mask=None):
    d_k = Q.size(-1)
    scores = Q @ K.transpose(-2, -1) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))
    return F.softmax(scores, dim=-1)

def entropy(weights):
    """Shannon entropy of attention distribution (per position)."""
    w = weights.clamp(min=1e-10)
    return -(w * w.log()).sum(dim=-1)

tokens = ["The", "cat", "sat", "on", "the", "mat"]
T = len(tokens)
d_k = 16
torch.manual_seed(42)
Q = torch.randn(1, T, d_k)
K = torch.randn(1, T, d_k)

# Causal mask (lower triangular)
causal_mask = torch.tril(torch.ones(T, T)).unsqueeze(0)
causal_attn = attention_weights(Q, K, mask=causal_mask)
bidir_attn = attention_weights(Q, K, mask=None)

def print_attn(attn, name, tokens):
    print(f"\n{name} attention weights:")
    print(f"{'':>6}", end="")
    for t in tokens:
        print(f"{t:>6}", end="")
    print()
    for i, tok in enumerate(tokens):
        print(f"{tok:>6}", end="")
        for j in range(len(tokens)):
            print(f"{attn[0, i, j].item():>6.2f}", end="")
        print()

print_attn(causal_attn, "CAUSAL (GPT-style)", tokens)
print_attn(bidir_attn, "BIDIRECTIONAL (BERT-style)", tokens)

causal_ent = entropy(causal_attn[0])
bidir_ent = entropy(bidir_attn[0])
print(f"\nAttention entropy per position:")
for i, tok in enumerate(tokens):
    print(f"  {tok:>4}: causal={causal_ent[i].item():.3f}, "
          f"bidir={bidir_ent[i].item():.3f}")
print(f"  Mean: causal={causal_ent.mean():.3f}, bidir={bidir_ent.mean():.3f}")
print(f"\nBidirectional entropy is higher because each position")
print(f"distributes attention across all {T} positions, not just past ones.")

The key observation: in causal attention, position 0 ("The") has zero entropy -- it can only attend to itself. Position 5 ("mat") has the highest entropy because it attends to all 6 positions. In bidirectional attention, every position has roughly similar (and higher) entropy because each token distributes attention across all positions equally. This entropy difference is what makes bidirectional models better at understanding -- they aggregate context from everywhere, not just the past.

Exercise 3: Simple few-shot in-context learning test with a small GPT.

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

# Reuse the GPT class from Exercise 1
# Train on text containing "X -> Y" patterns

# Create synthetic training data with mapping patterns
patterns = """
apple -> fruit
carrot -> vegetable
banana -> fruit
broccoli -> vegetable
grape -> fruit
spinach -> vegetable
mango -> fruit
celery -> vegetable
cherry -> fruit
lettuce -> vegetable
"""

train_text = (patterns * 200).strip()
chars = sorted(set(train_text))
vocab_size = len(chars)
ch2i = {c: i for i, c in enumerate(chars)}
i2ch = {i: c for c, i in ch2i.items()}
data = torch.tensor([ch2i[c] for c in train_text], dtype=torch.long)

model = GPT(vocab_size, d_model=128, n_heads=4, n_layers=4,
            d_ff=512, max_len=256)
opt = torch.optim.AdamW(model.parameters(), lr=3e-4)

block_size = 128
for step in range(3000):
    ix = torch.randint(len(data) - block_size - 1, (32,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    logits = model(x)
    loss = F.cross_entropy(logits.view(-1, vocab_size), y.view(-1))
    opt.zero_grad()
    loss.backward()
    opt.step()
    if step % 500 == 0:
        print(f"Step {step}: loss={loss.item():.3f}")

model.eval()
test_cases = [("peach", "fruit"), ("kale", "vegetable")]

for n_examples in [1, 2, 3, 5]:
    examples = [
        "apple -> fruit", "carrot -> vegetable",
        "banana -> fruit", "broccoli -> vegetable",
        "grape -> fruit",
    ][:n_examples]
    correct = 0
    for test_word, expected in test_cases:
        prompt = "\n".join(examples) + f"\n{test_word} -> "
        ids = torch.tensor([ch2i[c] for c in prompt if c in ch2i],
                           dtype=torch.long).unsqueeze(0)
        with torch.no_grad():
            logits = model(ids)
        pred_id = logits[0, -1].argmax().item()
        pred_char = i2ch[pred_id]
        exp_char = expected[0]
        if pred_char == exp_char:
            correct += 1
        print(f"  {n_examples} examples: '{test_word}' -> "
              f"predicted '{pred_char}', expected '{exp_char}'")
    print(f"  Accuracy with {n_examples} examples: "
          f"{correct}/{len(test_cases)}\n")

At this small scale, ICL accuracy is unreliable -- sometimes the model gets it right (especially with more examples in the prompt), sometimes it doesn't. The training data contains these patterns so the model knows the mapping format, but generalizing to unseen words requires capacity that a 128-dim model barely has. With GPT-3 scale (175B params), this kind of in-context pattern recognition works consistently. The point is that ICL improves with both model scale and number of in-context examples.

On to today's episode

Here we go! Last episode we went deep into the GPT architecture -- the decoder-only transformer that turned next-token prediction into one of the most consequential technologies of the decade. We built the causal self-attention mechanism, the full GPT model with weight tying, traced the scaling story from GPT-1's 117M parameters to GPT-4's trillion+, and explored the surprising emergent abilities that appear at scale. The decoder-only approach won the generation race decisevly.

But here's the thing about GPT's causal mask: every token can only look left. Position 5 sees tokens 0 through 5, never tokens 6, 7, 8. For generation, this is exactly what you want -- you can't peek at the future when you're writing it. But for understanding tasks? Looking only left is a genuine handicap.

Consider this sentence: "I went to the bank to deposit money." When you (a human) read the word "bank," you instantly understand it means a financial institution -- because "deposit money" comes after it and resolves the ambiguity. A left-to-right model has to make its best guess about "bank" before it ever sees "deposit." Maybe it guesses correctly from the general context, maybe it doesn't. But it never gets to use the right-side context that a human reader uses without even thinking about it.

In October 2018 -- just five months after GPT-1 -- a team at Google published BERT: Bidirectional Encoder Representations from Transformers. The idea was almost confrontationally simple: drop the causal mask entirely. Let every token attend to every other token, both left and right. The result was a model that absolutely dominated every NLP benchmark it was tested on, and it changed the field overnight ;-)

Why bidirectional matters

Let me be very concrete about why this matters, because it connects directly to the transformer architecture we built in episodes #52-53.

In GPT, the attention mask is a lower-triangular matrix -- position i can attend to positions 0 through i. In BERT, there's no mask at all (well, except for padding). Every position attends to every other position. The attention computation is identical -- same Q, K, V projections, same scaled dot-product attention, same multi-head setup. The only difference is whether you apply a causal mask or not.

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

def attention_compare(seq_len=6, d_model=32, n_heads=2):
    """Compare GPT (causal) vs BERT (bidirectional) attention."""
    d_k = d_model // n_heads
    torch.manual_seed(42)

    # Same input, same weights
    x = torch.randn(1, seq_len, d_model)
    W_q = nn.Linear(d_model, d_model, bias=False)
    W_k = nn.Linear(d_model, d_model, bias=False)
    W_v = nn.Linear(d_model, d_model, bias=False)

    Q = W_q(x).view(1, seq_len, n_heads, d_k).transpose(1, 2)
    K = W_k(x).view(1, seq_len, n_heads, d_k).transpose(1, 2)
    V = W_v(x).view(1, seq_len, n_heads, d_k).transpose(1, 2)

    scores = (Q @ K.transpose(-2, -1)) / math.sqrt(d_k)

    # GPT: causal mask
    causal_mask = torch.tril(torch.ones(seq_len, seq_len))
    gpt_scores = scores.masked_fill(causal_mask == 0, float('-inf'))
    gpt_attn = F.softmax(gpt_scores, dim=-1)
    gpt_out = (gpt_attn @ V).transpose(1, 2).contiguous().view(1, seq_len, d_model)

    # BERT: no mask (full bidirectional)
    bert_attn = F.softmax(scores, dim=-1)
    bert_out = (bert_attn @ V).transpose(1, 2).contiguous().view(1, seq_len, d_model)

    print("GPT output for position 3 (sees tokens 0-3 only):")
    print(f"  {gpt_out[0, 3, :8].tolist()}")
    print("BERT output for position 3 (sees ALL tokens 0-5):")
    print(f"  {bert_out[0, 3, :8].tolist()}")
    print(f"\nDifference: {(gpt_out - bert_out).abs().mean():.4f}")
    print("Different outputs from identical weights -- only the mask changes!")

attention_compare()

Same weights, same input, different mask -- completely different representations. BERT's representation of position 3 incorporates information from positions 4 and 5 (the future), while GPT's doesn't. For a task like sentiment classification where you have the full review text available, BERT can build a richer representation because it sees everything at once.

Masked language modeling

But wait -- if BERT sees the entire sequence, how do you train it? You can't use next-token prediction (like GPT) because the model already sees every token. If you ask it to predict the token at position 5 and it can attend to position 5... well, it just copies the answer. Zero learning happens.

BERT's solution is masked language modeling (MLM). During training, 15% of input tokens are randomly selected for masking. Of those selected tokens:

80% are replaced with a special [MASK] token
10% are replaced with a random word from the vocabulary
10% are left unchanged

The model then has to predict the original token at each masked position. Only the predictions at masked positions contribute to the loss -- the rest are ignored.

import torch
import random

def apply_bert_masking(tokens, vocab_size, mask_id, mask_prob=0.15):
    """BERT-style masking: 80% [MASK], 10% random, 10% unchanged."""
    masked = tokens.clone()
    labels = torch.full_like(tokens, -100)  # -100 = ignore in CrossEntropyLoss

    for i in range(len(tokens)):
        if random.random() < mask_prob:
            labels[i] = tokens[i]  # remember the original
            r = random.random()
            if r < 0.8:
                masked[i] = mask_id          # 80%: replace with [MASK]
            elif r < 0.9:
                masked[i] = random.randint(0, vocab_size - 1)  # 10%: random
            # else: 10%: keep the original token (masked[i] stays as is)
    return masked, labels

# Example with a simple sentence
vocab = {"[CLS]": 101, "[SEP]": 102, "[MASK]": 103,
         "the": 2, "cat": 3, "sat": 4, "on": 5,
         "a": 6, "mat": 7, "big": 8, "red": 9}
tokens = torch.tensor([101, 2, 3, 4, 5, 6, 7, 102])
#                      CLS  the cat sat on  a  mat SEP

random.seed(42)
masked, labels = apply_bert_masking(tokens, vocab_size=30000, mask_id=103)
print(f"Original: {tokens.tolist()}")
print(f"Masked:   {masked.tolist()}")
print(f"Labels:   {labels.tolist()}")
print("(-100 means 'don't compute loss here')")

Why the 80/10/10 split? This is a genuinly clever design choice. If BERT only ever saw [MASK] tokens during training, there would be a train-test mismatch: during fine-tuning and inference, there are no [MASK] tokens in the input. The model's representations would be calibrated for inputs containing [MASK] tokens but would have to process inputs without them. The 10% random replacement and 10% unchanged cases teach the model to produce good representations regardless of whether a token was masked or not. It's a form of regularization that bridges the gap between pre-training and downstream use.

Having said that, the 15% masking rate is itself a tradeoff. Only 15% of tokens contribute to the training loss per example. GPT computes a loss at every single position (predicting the next token at each step). This makes GPT roughly 6-7x more sample-efficient per token of training data. BERT compensates by using bidirectional context (arguably learning more per masked token than GPT learns per next-token prediction), but the sample efficiency difference is real and motivated later work like ELECTRA.

The BERT architecture

BERT is a standard transformer encoder -- exactly what we built in episode #52 -- with no decoder and no causal mask. If you understood the encoder from that episode, you already understand 90% of BERT. The remaining 10% is BERT-specific input formatting:

import torch
import torch.nn as nn

class BERTEmbedding(nn.Module):
    """BERT uses three kinds of embeddings, summed together."""
    def __init__(self, vocab_size, d_model, max_len, n_segments=2):
        super().__init__()
        self.tok_emb = nn.Embedding(vocab_size, d_model)    # token identity
        self.pos_emb = nn.Embedding(max_len, d_model)       # position
        self.seg_emb = nn.Embedding(n_segments, d_model)    # segment (A or B)
        self.norm = nn.LayerNorm(d_model)
        self.drop = nn.Dropout(0.1)

    def forward(self, tokens, segments):
        pos_ids = torch.arange(tokens.size(1), device=tokens.device)
        x = self.tok_emb(tokens) + self.pos_emb(pos_ids) + self.seg_emb(segments)
        return self.drop(self.norm(x))

class BERTModel(nn.Module):
    """BERT: transformer encoder + MLM prediction head."""
    def __init__(self, vocab_size, d_model=768, n_heads=12,
                 n_layers=12, d_ff=3072, max_len=512):
        super().__init__()
        self.embedding = BERTEmbedding(vocab_size, d_model, max_len)
        encoder_layer = nn.TransformerEncoderLayer(
            d_model, n_heads, d_ff, batch_first=True, activation='gelu')
        self.encoder = nn.TransformerEncoder(encoder_layer, n_layers)
        # MLM head: project back to vocabulary
        self.mlm_head = nn.Sequential(
            nn.Linear(d_model, d_model), nn.GELU(),
            nn.LayerNorm(d_model),
            nn.Linear(d_model, vocab_size)
        )

    def forward(self, tokens, segments):
        x = self.embedding(tokens, segments)
        x = self.encoder(x)
        return self.mlm_head(x)

    def get_hidden(self, tokens, segments):
        """Get encoder output without the MLM head (for fine-tuning)."""
        x = self.embedding(tokens, segments)
        return self.encoder(x)

# BERT-base configuration
bert = BERTModel(vocab_size=30522, d_model=768, n_heads=12,
                 n_layers=12, d_ff=3072)
n_params = sum(p.numel() for p in bert.parameters())
print(f"BERT-base parameters: {n_params:,}")  # ~110M

# Dummy forward pass
tokens = torch.randint(0, 30522, (2, 128))
segments = torch.zeros_like(tokens)
logits = bert(tokens, segments)
print(f"Input: {tokens.shape}")
print(f"MLM logits: {logits.shape}")  # (2, 128, 30522)

One detail unique to BERT that GPT doesn't have: segment embeddings. BERT was designed to process pairs of sentences (for tasks like "does sentence B follow sentence A?"). Each token gets a segment embedding indicating whether it belongs to sentence A (segment 0) or sentence B (segment 1). The input format is [CLS] sentence_A [SEP] sentence_B [SEP], where [CLS] is a special classification token and [SEP] separates the sentences.

BERT-base has 110 million parameters (12 layers, 768 dimensions, 12 attention heads). BERT-large has 340 million parameters (24 layers, 1024 dimensions, 16 heads). Both were trained on BookCorpus + English Wikipedia -- roughly 3.3 billion words total. Compared to GPT-3's 300 billion training tokens, that's tiny. But BERT's bidirectional context means it extracts more information per token.

Pre-training and fine-tuning: the paradigm shift

BERT's training has two distinct phases, and understanding this split is critical because it became the dominant paradigm for NLP (and later, for practically all of deep learning):

Phase 1 -- Pre-training: train the model on masked language modeling using a massive unlabeled text corpus. This is expensive (the original BERT-large took 4 days on 64 TPU v2 chips) but you do it only once. The result is a model that produces rich, contextual word representations. It "understands" language in the sense that it can fill in blanks correctly, which requires knowing grammar, semantics, and quite some world knowledge.

Phase 2 -- Fine-tuning: take the pre-trained model, add a small task-specific head on top, and train on a much smaller labeled dataset for your specific task. This is cheap -- a few hours on a single GPU. The pre-trained model already understands language; fine-tuning just teaches it the task format.

import torch
import torch.nn as nn

class BERTForClassification(nn.Module):
    """Fine-tune BERT for text classification.
    Uses the [CLS] token's output as the sentence representation."""
    def __init__(self, bert_model, n_classes, d_model=768):
        super().__init__()
        self.bert = bert_model
        self.dropout = nn.Dropout(0.1)
        self.classifier = nn.Linear(d_model, n_classes)

    def forward(self, tokens, segments):
        hidden = self.bert.get_hidden(tokens, segments)
        cls_output = hidden[:, 0, :]  # [CLS] token = position 0
        return self.classifier(self.dropout(cls_output))

class BERTForNER(nn.Module):
    """Fine-tune BERT for Named Entity Recognition.
    Predicts a tag for EVERY token in the sequence."""
    def __init__(self, bert_model, n_tags, d_model=768):
        super().__init__()
        self.bert = bert_model
        self.dropout = nn.Dropout(0.1)
        self.tag_head = nn.Linear(d_model, n_tags)

    def forward(self, tokens, segments):
        hidden = self.bert.get_hidden(tokens, segments)
        return self.tag_head(self.dropout(hidden))

class BERTForQA(nn.Module):
    """Fine-tune BERT for extractive question answering.
    Predicts start and end positions of the answer span."""
    def __init__(self, bert_model, d_model=768):
        super().__init__()
        self.bert = bert_model
        self.qa_head = nn.Linear(d_model, 2)  # start + end logits

    def forward(self, tokens, segments):
        hidden = self.bert.get_hidden(tokens, segments)
        logits = self.qa_head(hidden)
        start_logits = logits[:, :, 0]
        end_logits = logits[:, :, 1]
        return start_logits, end_logits

# All three tasks use the SAME pre-trained BERT
bert = BERTModel(vocab_size=30522)
clf = BERTForClassification(bert, n_classes=2)
ner = BERTForNER(bert, n_tags=9)  # BIO tagging: B-PER, I-PER, B-ORG, ...
qa = BERTForQA(bert)

tokens = torch.randint(0, 30522, (2, 64))
segs = torch.zeros_like(tokens)

clf_out = clf(tokens, segs)
ner_out = ner(tokens, segs)
qa_start, qa_end = qa(tokens, segs)

print(f"Classification: {clf_out.shape}")    # (2, 2) -- 2 classes
print(f"NER tags:       {ner_out.shape}")    # (2, 64, 9) -- tag per token
print(f"QA start:       {qa_start.shape}")   # (2, 64) -- start position
print(f"QA end:         {qa_end.shape}")     # (2, 64) -- end position

The [CLS] token is the key design pattern here. Its final hidden state serves as a sequence-level representation -- a single vector that summarizes the entire input. For classification, you put a linear layer on top of [CLS]. For token-level tasks (NER, POS tagging), you put a head on top of every token's output. For question answering, you predict start and end positions of the answer span.

This paradigm -- pre-train once on a general objective, then fine-tune cheaply for dozens of different tasks -- is what made BERT so genuinly impactful. Before BERT, each NLP task required its own architecture, its own training procedure, and its own large labeled dataset. After BERT, you start with a powerful general-purpose language encoder and adapt it with minimal task-specific data. A paradigm shift in the truest sense of the word.

BERT for specific tasks

Let me walk through the main task types to make the input/output format concrete:

Sentiment classification: feed [CLS] This movie was terrible and boring [SEP], take the [CLS] output, classify as positive/negative/neutral. Fine-tune on a few thousand labeled reviews. BERT-base achieves ~93% accuracy on SST-2 (Stanford Sentiment Treebank) with minimal effort -- a HUGE improvement over previous approaches that needed task-specific architectures.

Named Entity Recognition: feed [CLS] Apple is headquartered in Cupertino [SEP], predict a BIO tag for each token. "Apple" -> B-ORG, "is" -> O, "headquartered" -> O, "in" -> O, "Cupertino" -> B-LOC. Each token's hidden state goes through the tag head independently. The bidirectional context helps enormously here -- knowing what comes after an entity often disambiguates its type.

Extractive question answering: feed [CLS] Where is Apple headquartered? [SEP] Apple Inc. is headquartered in Cupertino, California. [SEP]. Two output heads predict the start position and end position of the answer span within the passage. The model learns to output high logits at "Cupertino" (start) and "California" (end). The segment embeddings tell the model which part is the question (segment 0) and which is the passage (segment 1).

Sentence similarity: feed [CLS] A dog runs through the park [SEP] A puppy is playing outside [SEP], use [CLS] to predict a similarity score. Both sentences are encoded together, and the cross-attention between them (which happens automatically in the bidirectional encoder) captures semantic relationships.

# Demonstrating the input format for each task type
task_examples = {
    "Classification": {
        "input": "[CLS] This movie was fantastic! [SEP]",
        "segments": [0]*7,
        "output": "positive (from [CLS] hidden state)"
    },
    "NER": {
        "input": "[CLS] Apple acquired Beats in California [SEP]",
        "segments": [0]*7,
        "output": "Apple=B-ORG, Beats=B-ORG, California=B-LOC (per-token)"
    },
    "QA": {
        "input": "[CLS] Where was it founded? [SEP] The company was founded in Cupertino. [SEP]",
        "segments": [0]*6 + [1]*7,
        "output": "start=Cupertino, end=Cupertino (span extraction)"
    },
    "Similarity": {
        "input": "[CLS] The cat sat on the mat [SEP] A feline rested on the rug [SEP]",
        "segments": [0]*8 + [1]*7,
        "output": "0.87 similarity score (from [CLS])"
    }
}

for task, ex in task_examples.items():
    print(f"\n{task}:")
    print(f"  Input: {ex['input']}")
    print(f"  Segments: {ex['segments'][:5]}...")
    print(f"  Output: {ex['output']}")

All four tasks use the same pre-trained BERT -- only the small head on top differs. The vast majority of parameters (the 12 or 24 transformer layers) are shared. This is quit elegant from an engineering perspective: you maintain one pre-trained model and swap task heads as needed.

The BERT family

BERT spawned a family of variants, each addressing a specific limitation. Understanding these is important because you'll encounter them constantly in practice:

RoBERTa (2019, Meta): "A Robustly Optimized BERT Pretraining Approach." Same architecture as BERT, zero structural changes. But trained better: longer training (500K steps vs 100K), bigger batches (8K vs 256), more data (160GB vs 16GB), and dynamic masking (re-randomize the mask each epoch in stead of using the same fixed masks). They also dropped Next Sentence Prediction (NSP) -- BERT's second pre-training objective where the model predicted whether two sentences were consecutive -- because it turned out to not help and possibly hurt. RoBERTa outperformed BERT on every single benchmark, simply through better training recipes. The lesson, once again: training recipe matters as much as architecture (we saw this same story with DeiT vs ViT in episode #54).

# RoBERTa vs BERT: same architecture, different training
comparison = {
    "Training steps": ("100K", "500K"),
    "Batch size": ("256", "8K"),
    "Training data": ("16GB", "160GB"),
    "Masking": ("Static", "Dynamic"),
    "NSP task": ("Yes", "Removed"),
    "MNLI accuracy": ("86.6%", "90.2%"),
    "SQuAD F1": ("90.9", "94.6"),
}

print(f"{'':>20} {'BERT':>12} {'RoBERTa':>12}")
print("-" * 46)
for key, (bert_val, roberta_val) in comparison.items():
    print(f"{key:>20} {bert_val:>12} {roberta_val:>12}")

ALBERT (2020, Google): "A Lite BERT." Reduced parameters through two tricks. First, cross-layer parameter sharing: all 12 transformer layers share the same weights. Conceptually this is like applying the same transformer block 12 times (similar to a recurrent computation). Second, factorized embeddings: separate the vocabulary embedding dimension (128) from the hidden dimension (768), with a small projection layer between them. ALBERT-xxlarge achieves BERT-large level performance with far fewer unique parameters.

DistilBERT (2019, Hugging Face): knowledge distillation applied to BERT. A 6-layer student model is trained to match the 12-layer teacher's output distributions (not just the hard labels). 40% smaller, 60% faster, retains 97% of BERT's performance on downstream tasks. DistilBERT is the go-to choice when you need a language encoder in production and inference cost matters -- which is most of the time, honestly.

ELECTRA (2020, Google): a fundamentally different pre-training approach. In stead of masking, ELECTRA uses a small generator network to produce corrupted tokens and a discriminator network to detect which tokens were replaced. The discriminator (which becomes the final model) learns from every token position, not just the 15% that were masked. This makes ELECTRA significantly more sample-efficient -- it reaches BERT-level performance with 1/4 the compute.

import torch
import torch.nn as nn

def electra_pretraining_step(original_tokens, generator, discriminator,
                              mask_prob=0.15, vocab_size=30000, mask_id=103):
    """ELECTRA: generator creates fakes, discriminator detects them."""
    batch_size, seq_len = original_tokens.shape

    # Step 1: mask some tokens (same as BERT)
    mask = torch.rand(batch_size, seq_len) < mask_prob
    masked_tokens = original_tokens.clone()
    masked_tokens[mask] = mask_id

    # Step 2: generator predicts replacements at masked positions
    gen_logits = generator(masked_tokens)
    gen_preds = gen_logits.argmax(dim=-1)

    # Step 3: create "corrupted" input (original with generator's guesses)
    corrupted = original_tokens.clone()
    corrupted[mask] = gen_preds[mask]

    # Step 4: discriminator predicts which tokens were replaced
    disc_logits = discriminator(corrupted)  # (batch, seq_len, 2)
    is_replaced = (corrupted != original_tokens).long()

    # Key difference: discriminator loss at EVERY position
    disc_loss = nn.CrossEntropyLoss()(
        disc_logits.view(-1, 2), is_replaced.view(-1))

    # Generator loss: MLM at masked positions only
    gen_loss = nn.CrossEntropyLoss()(
        gen_logits[mask].view(-1, vocab_size),
        original_tokens[mask].view(-1))

    print(f"Generator loss (masked only):     {gen_loss.item():.4f}")
    print(f"Discriminator loss (ALL tokens):  {disc_loss.item():.4f}")
    print(f"Tokens contributing to gen loss:  {mask.sum().item()}")
    print(f"Tokens contributing to disc loss: {mask.numel()}")
    return gen_loss, disc_loss

The efficiency gain is substantial. BERT learns from ~15% of tokens per example (the masked ones). ELECTRA's discriminator learns from 100% of tokens. At the same compute budget, ELECTRA sees effectively 6-7x more training signal, which explains why it outperforms BERT with much less compute.

Encoder vs decoder: when to use which

This is the practical question that matters most when you're actually building something. The landscape as of 2026:

Encoder models (BERT, RoBERTa, DistilBERT) are best for:

Classification -- sentiment, topic, spam detection, content moderation
Token classification -- named entity recognition, POS tagging
Extractive QA -- finding answer spans in a passage
Sentence embeddings -- converting text to vectors for search and similarity
Any task where the output is a label or a selection from the input

Decoder models (GPT, LLaMA, Mistral) are best for:

Text generation -- stories, code, conversations, translations
Open-ended QA -- generating answers not limited to the input
Instruction following -- general-purpose assistants
Any task where the output is new text

Encoder-decoder models (T5, BART) sit in between:

Summarization -- compress long text into short summaries
Translation -- structured input-to-output mapping
Tasks naturally framed as "input text -> output text"

# Decision tree for choosing a model type
print("=== Which model type do I need? ===\n")
print("Q: Does my task require generating NEW text?")
print("  YES -> Use a decoder model (GPT, LLaMA, Mistral)")
print("  NO  -> Continue...\n")
print("Q: Is my output a label, a score, or a span from the input?")
print("  YES -> Use an encoder model (BERT, RoBERTa, DistilBERT)")
print("  NO  -> Continue...\n")
print("Q: Is my task 'input text -> output text' (summarization, translation)?")
print("  YES -> Use an encoder-decoder (T5, BART)")
print("  NO  -> An encoder model probably works. Try BERT first.\n")
print("Q: But GPT-4 can do classification too, right?")
print("  YES, but a fine-tuned DistilBERT is 100x cheaper to run")
print("  and often just as accurate for specific tasks.")

Now, I want to be honest about the current state of things. In practice, the decoder-only approach has mostly won the generality race. Modern LLMs can do classification (just prompt "Is this positive or negative?"), NER (just ask "What entities are mentioned?"), and QA (just ask the question directly). They can do these things without any fine-tuning at all -- just in-context learning, which we explored in the previous episode.

But -- and this is a big but -- efficiency matters enormously in production. A fine-tuned 66-million parameter DistilBERT processes a sentiment classification request in ~2ms on a CPU. A 70-billion parameter LLM doing the same task through prompting takes ~500ms on an expensive GPU. If you're processing 10 million reviews per day, the DistilBERT approach costs a few dollars. The LLM approach costs thousands. For specific, well-defined tasks with labeled data available, encoder models remain the economically rational choice ;-)

BERT-style models also remain the standard for text embeddings -- converting text into dense vectors for semantic search, similarity matching, and retrieval systems. Models like Sentence-BERT and the various embedding models on Hugging Face are all encoder-based. The bidirectional context produces better sentence representations than decoder models for this purpose, and we'll explore embeddings and vector search in more detail later in the series.

What BERT taught us

BERT's lasting contibution isn't the specific architecture (which is just a transformer encoder, nothing new) or the specific training objective (masked language modeling was inspired by older techniques like the Cloze test from the 1950s). Its lasting contribution is the pre-training + fine-tuning paradigm itself.

Before BERT, the workflow for a new NLP task was: collect labeled data, design a task-specific architecture, train from scratch, evaluate, iterate. Each task was an isolated engineering effort.

After BERT, the workflow became: download a pre-trained model (5 seconds), add a small head, fine-tune on your labeled data (a few hours), deploy. The pre-trained model did the heavy lifting -- it already understood language, word relationships, syntax, semantics, and a lot of world knowledge. Fine-tuning just aligned that understanding with your specific task.

This paradigm now dominates ALL of deep learning, not just NLP. In computer vision, you fine-tune pre-trained ViTs (we covered this in episode #54). In audio, you fine-tune pre-trained Whisper models. In robotics, you fine-tune foundation models. BERT didn't invent transfer learning (we saw early forms of it back when we discussed feature extraction in episode #46), but it demonstrated its power so convincingly that the entire field pivoted almost overnight. That's quit remarkable for a model that's "just" a transformer encoder with no causal mask.

The bottom line

BERT uses masked language modeling: randomly mask 15% of tokens (80% [MASK], 10% random, 10% unchanged) and predict the originals. This allows bidirectional attention because the model can't cheat by looking at the token it's supposed to predict;
Bidirectional context produces richer representations for understanding tasks than left-to-right models like GPT, because every token aggregates information from both directions;
The pre-training + fine-tuning paradigm changed NLP: pre-train once on a massive unlabeled corpus, then adapt cheaply to dozens of downstream tasks by adding small task-specific heads;
The [CLS] token serves as a sequence-level representation for classification tasks; individual token outputs serve for token-level tasks like NER and extractive QA;
RoBERTa showed that training recipe matters more than architecture (same model, better training, much better results). DistilBERT compressed BERT to 40% smaller / 60% faster while keeping 97% accuracy. ELECTRA improved sample efficiency by training on all tokens, not just 15%;
Decoder-only LLMs can handle most BERT tasks through prompting, but encoder models remain the rational choice for production systems where cost and latency matter -- a fine-tuned DistilBERT is 100x cheaper to run than prompting a large LLM for classification.

Exercises

Exercise 1: Build a complete BERT-style masked language model and train it on a small text corpus. Implement the BERTEmbedding and BERTModel classes from this episode (use d_model=128, 4 heads, 4 layers, d_ff=512, vocab_size=5000, max_len=128). Create a simple word-level vocabulary from your training text. Implement the apply_bert_masking function with the 80/10/10 split. Train for 50 epochs on sequences of length 32, using CrossEntropyLoss on only the masked positions (labels=-100 means ignore). Print the training loss every 10 epochs. After training, mask a single word in a test sentence and check the model's top-5 predictions -- does it predict reasonable words?

Exercise 2: Compare bidirectional vs causal representations for a fill-in-the-blank task. Build two small transformer encoders with identical architectures (d_model=128, 4 layers, 4 heads). Train both on the same text data with MLM, but one uses bidirectional attention (no mask -- BERT-style) and the other uses causal attention (lower-triangular mask -- GPT-style, but still trained with MLM). After training, mask the same word in 20 test sentences and compare: (a) how often the correct word appears in each model's top-5 predictions, and (b) the average rank of the correct word. The bidirectional model should perform better because it can use both left and right context to resolve ambiguity.

Exercise 3: Implement fine-tuning for text classification. Using your pre-trained BERT model from Exercise 1, add a classification head on top of the [CLS] token output. Create a simple synthetic classification dataset (e.g., sentences containing "good/great/excellent" are positive, sentences containing "bad/terrible/awful" are negative). Fine-tune the full model (BERT encoder + classification head) for 20 epochs. Then freeze the BERT encoder and train only the classification head for another 20 epochs with a fresh head. Compare the two approaches: which reaches higher accuracy, and how fast? Print per-epoch accuracy for both. The full fine-tuning approach should converge faster because the encoder can adapt its representations.

Bedankt en tot de volgende keer!

Hive account@scipio

Learn AI Series (#59) - BERT and Encoder Models

Learn AI Series (#59) - BERT and Encoder Models

What will I learn

Requirements

Difficulty

Curriculum (of the Learn AI Series):

Learn AI Series (#59) - BERT and Encoder Models

Solutions to Episode #58 Exercises

On to today's episode

Why bidirectional matters

Masked language modeling

The BERT architecture

Pre-training and fine-tuning: the paradigm shift

BERT for specific tasks

The BERT family

Encoder vs decoder: when to use which

What BERT taught us

The bottom line

Exercises

Bedankt en tot de volgende keer!

Curriculum (of the `Learn AI Series`):