Learn AI Series (#61) - Instruction Tuning and Alignment
What will I learn
- You will learn the gap between a pre-trained language model and a useful assistant;
- supervised fine-tuning (SFT) on instruction-response pairs;
- Reinforcement Learning from Human Feedback (RLHF) -- training models to follow human preferences;
- Direct Preference Optimization (DPO) -- achieving RLHF results without reinforcement learning;
- Constitutional AI -- self-supervised alignment through principles;
- the alignment tax: capability vs safety tradeoffs.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution;
- The ambition to learn AI and machine learning.
Difficulty
- Beginner
Curriculum (of the Learn AI Series):
- Learn AI Series (#1) - What Machine Learning Actually Is
- Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
- Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World
- Learn AI Series (#4) - Your First Prediction - No Math, Just Intuition
- Learn AI Series (#5) - Patterns in Data - What "Learning" Actually Looks Like
- Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
- Learn AI Series (#7) - The Training Loop - See It Work Step by Step
- Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra
- Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability
- Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch
- Learn AI Series (#11) - Making Linear Regression Real
- Learn AI Series (#12) - Classification - Logistic Regression From Scratch
- Learn AI Series (#13) - Evaluation - How to Know If Your Model Actually Works
- Learn AI Series (#14) - Data Preparation - The 80% Nobody Talks About
- Learn AI Series (#15) - Feature Engineering and Selection
- Learn AI Series (#16) - Scikit-Learn - The Standard Library of ML
- Learn AI Series (#17) - Decision Trees - How Machines Make Decisions
- Learn AI Series (#18) - Random Forests - Wisdom of Crowds
- Learn AI Series (#19) - Gradient Boosting - The Kaggle Champion
- Learn AI Series (#20) - Support Vector Machines - Drawing the Perfect Boundary
- Learn AI Series (#21) - Mini Project - Predicting Crypto Market Regimes
- Learn AI Series (#22) - K-Means Clustering - Finding Groups
- Learn AI Series (#23) - Advanced Clustering - Beyond K-Means
- Learn AI Series (#24) - Dimensionality Reduction - PCA
- Learn AI Series (#25) - Advanced Dimensionality Reduction - t-SNE and UMAP
- Learn AI Series (#26) - Anomaly Detection - Finding What Doesn't Belong
- Learn AI Series (#27) - Recommendation Systems - "Users Like You Also Liked..."
- Learn AI Series (#28) - Time Series Fundamentals - When Order Matters
- Learn AI Series (#29) - Time Series Forecasting - Predicting What Comes Next
- Learn AI Series (#30) - Natural Language Processing - Text as Data
- Learn AI Series (#31) - Word Embeddings - Meaning in Numbers
- Learn AI Series (#32) - Bayesian Methods - Thinking in Probabilities
- Learn AI Series (#33) - Ensemble Methods Deep Dive - Stacking and Blending
- Learn AI Series (#34) - ML Engineering - From Notebook to Production
- Learn AI Series (#35) - Data Ethics and Bias in ML
- Learn AI Series (#36) - Mini Project - Complete ML Pipeline
- Learn AI Series (#37) - The Perceptron - Where It All Started
- Learn AI Series (#38) - Neural Networks From Scratch - Forward Pass
- Learn AI Series (#39) - Neural Networks From Scratch - Backpropagation
- Learn AI Series (#40) - Training Neural Networks - Practical Challenges
- Learn AI Series (#41) - Optimization Algorithms - SGD, Momentum, Adam
- Learn AI Series (#42) - PyTorch Fundamentals - Tensors and Autograd
- Learn AI Series (#43) - PyTorch Data and Training
- Learn AI Series (#44) - PyTorch nn.Module - Building Real Networks
- Learn AI Series (#45) - Convolutional Neural Networks - Theory
- Learn AI Series (#46) - CNNs in Practice - Classic to Modern Architectures
- Learn AI Series (#47) - CNN Applications - Detection, Segmentation, Style Transfer
- Learn AI Series (#48) - Recurrent Neural Networks - Sequences
- Learn AI Series (#49) - LSTM and GRU - Solving the Memory Problem
- Learn AI Series (#50) - Sequence-to-Sequence Models
- Learn AI Series (#51) - Attention Mechanisms
- Learn AI Series (#52) - The Transformer Architecture (Part 1)
- Learn AI Series (#53) - The Transformer Architecture (Part 2)
- Learn AI Series (#54) - Vision Transformers
- Learn AI Series (#55) - Generative Adversarial Networks
- Learn AI Series (#56) - Mini Project - Building a Transformer From Scratch
- Learn AI Series (#57) - Language Modeling - Predicting the Next Word
- Learn AI Series (#58) - GPT Architecture - Decoder-Only Transformers
- Learn AI Series (#59) - BERT and Encoder Models
- Learn AI Series (#60) - Training Large Language Models
- Learn AI Series (#61) - Instruction Tuning and Alignment (this post)
Learn AI Series (#61) - Instruction Tuning and Alignment
Solutions to Episode #60 Exercises
Exercise 1: Build a data quality classifier that distinguishes high-quality from low-quality text.
import torch
import torch.nn as nn
import torch.nn.functional as F
import random
import math
random.seed(42)
torch.manual_seed(42)
def make_high_quality():
templates = [
"The transformer architecture processes input sequences using "
"multi-head self-attention, allowing each position to attend to "
"all other positions in the sequence simultaneously.",
"Natural language processing has advanced significantly since the "
"introduction of pre-trained language models. These models learn "
"rich representations from large text corpora.",
"Gradient descent optimizes the loss function by computing partial "
"derivatives with respect to each parameter and updating weights "
"in the direction of steepest descent.",
"Convolutional neural networks extract hierarchical features from "
"images by applying learned filters at multiple scales. Each layer "
"captures increasingly abstract patterns.",
"Reinforcement learning agents learn optimal policies by interacting "
"with an environment and receiving reward signals that indicate "
"the quality of their actions over time.",
]
t = random.choice(templates)
words = t.split()
start = random.randint(0, max(0, len(words) - 20))
return ' '.join(words[start:start + random.randint(15, len(words) - start)])
def make_low_quality():
kind = random.choice(["spam", "repetitive", "boilerplate", "salad"])
if kind == "spam":
return "BUY NOW!!! BEST DEAL!! Click here!! Limited offer!! " * 3
elif kind == "repetitive":
word = random.choice(["hello", "test", "spam", "buy"])
return (word + " ") * random.randint(20, 40)
elif kind == "boilerplate":
return ("Cookie Policy | Privacy Policy | Terms of Service | "
"All Rights Reserved | Subscribe to our newsletter | "
"Click here to accept cookies")
else:
vocab = ["xyz", "qqq", "!!!", "...", "###", "aaa", "bbb"]
return ' '.join(random.choices(vocab, k=25))
def extract_features(text):
words = text.split()
if len(words) == 0:
return [0.0] * 6
chars = list(text)
avg_word_len = sum(len(w) for w in words) / len(words)
unique_ratio = len(set(words)) / len(words)
punct_count = sum(1 for c in chars if c in '!?.,:;')
punct_density = punct_count / len(chars) if chars else 0
upper_ratio = sum(1 for c in chars if c.isupper()) / len(chars)
line_breaks = text.count('\n') / (len(chars) + 1)
word_count_norm = min(len(words) / 50.0, 1.0)
return [avg_word_len / 10.0, unique_ratio, punct_density,
upper_ratio, line_breaks, word_count_norm]
# Build dataset
X_train, y_train = [], []
for _ in range(250):
X_train.append(extract_features(make_high_quality()))
y_train.append(1)
X_train.append(extract_features(make_low_quality()))
y_train.append(0)
X_test, y_test = [], []
for _ in range(50):
X_test.append(extract_features(make_high_quality()))
y_test.append(1)
X_test.append(extract_features(make_low_quality()))
y_test.append(0)
X_tr = torch.tensor(X_train, dtype=torch.float32)
y_tr = torch.tensor(y_train, dtype=torch.long)
X_te = torch.tensor(X_test, dtype=torch.float32)
y_te = torch.tensor(y_test, dtype=torch.long)
model = nn.Sequential(nn.Linear(6, 32), nn.ReLU(), nn.Linear(32, 2))
opt = torch.optim.Adam(model.parameters(), lr=0.01)
for epoch in range(30):
logits = model(X_tr)
loss = F.cross_entropy(logits, y_tr)
opt.zero_grad()
loss.backward()
opt.step()
if (epoch + 1) % 10 == 0:
acc = (model(X_te).argmax(1) == y_te).float().mean()
print(f"Epoch {epoch+1}: loss={loss.item():.3f}, test_acc={acc:.3f}")
# Score 10 new snippets
snippets = [
"Machine learning models require careful hyperparameter tuning.",
"BUY CHEAP LAPTOPS NOW!!! CLICK HERE!!!",
"The attention mechanism computes weighted sums of value vectors.",
"test test test test test test test test test test",
"Neural networks learn hierarchical feature representations.",
"Cookie Policy Privacy Terms Subscribe Click Accept",
"Backpropagation computes gradients through the chain rule.",
"aaa bbb ccc !!! ??? ### $$$ &&& @@@ ~~~",
"Transfer learning allows models pre-trained on large corpora.",
"!!!!! AMAZING DEAL BUY NOW SALE SALE SALE !!!!!",
]
model.eval()
print("\nQuality scores:")
for s in snippets:
feats = torch.tensor([extract_features(s)], dtype=torch.float32)
with torch.no_grad():
score = F.softmax(model(feats), dim=1)[0, 1].item()
label = "HIGH" if score > 0.5 else "LOW"
print(f" [{label}] {score:.3f}: '{s[:55]}...'")
The classifier picks up on features that distinguish real writing from spam: unique word ratio (spam repeats), punctuation density (spam overuses !), and average word length (real writing uses varied, longer words). Test accuracy should be high (>90%) because the quality signals are strong even with simple features. Real production classifiers use TF-IDF or small language models, but the principle is the same.
Exercise 2: Gradient accumulation comparison.
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
torch.manual_seed(42)
class TinyTransformer(nn.Module):
def __init__(self, vocab_size=100, d_model=128, n_heads=4,
n_layers=2, max_len=64):
super().__init__()
self.tok = nn.Embedding(vocab_size, d_model)
self.pos = nn.Embedding(max_len, d_model)
layer = nn.TransformerEncoderLayer(
d_model, n_heads, d_model * 4, batch_first=True)
self.enc = nn.TransformerEncoder(layer, n_layers)
self.head = nn.Linear(d_model, vocab_size)
self.max_len = max_len
def forward(self, x):
T = x.size(1)
mask = nn.Transformer.generate_square_subsequent_mask(T, device=x.device)
h = self.tok(x) + self.pos(torch.arange(T, device=x.device))
h = self.enc(h, mask=mask)
return self.head(h)
# Synthetic data
data = torch.randint(0, 100, (10000,))
seq_len = 32
def get_batch(batch_size):
ix = torch.randint(len(data) - seq_len - 1, (batch_size,))
x = torch.stack([data[i:i+seq_len] for i in ix])
y = torch.stack([data[i+1:i+seq_len+1] for i in ix])
return x, y
# Model A: batch=128, no accumulation
torch.manual_seed(42)
model_a = TinyTransformer()
opt_a = torch.optim.AdamW(model_a.parameters(), lr=3e-4)
fwd_a = 0
for step in range(500):
x, y = get_batch(128)
logits = model_a(x)
loss = F.cross_entropy(logits.view(-1, 100), y.view(-1))
opt_a.zero_grad()
loss.backward()
opt_a.step()
fwd_a += 1
loss_a = loss.item()
# Model B: batch=16, accumulate 8 steps
torch.manual_seed(42)
model_b = TinyTransformer()
opt_b = torch.optim.AdamW(model_b.parameters(), lr=3e-4)
fwd_b = 0
for step in range(500):
opt_b.zero_grad()
for acc_step in range(8):
x, y = get_batch(16)
logits = model_b(x)
loss = F.cross_entropy(logits.view(-1, 100), y.view(-1))
(loss / 8).backward()
fwd_b += 1
opt_b.step()
loss_b = loss.item()
mem_a = 128 * seq_len # proportional to batch activations
mem_b = 16 * seq_len
print(f"{'Metric':12} {'Model B':>12}")
print("-" * 52)
print(f"{'Actual batch size':12} {'16':>12}")
print(f"{'Accumulation steps':12} {'8':>12}")
print(f"{'Effective batch':12} {'128':>12}")
print(f"{'Final loss':12.4f} {loss_b:>12.4f}")
print(f"{'Total forward passes':12} {fwd_b:>12}")
print(f"{'Peak memory (relative)':12} {mem_b:>12}")
print(f"{'Memory reduction':12} {f'{mem_a/mem_b:.0f}x smaller':>12}")
Both models end up at a similar final loss because they use the same effective batch size (128). The key difference: model B uses 8x less memory per forward pass but requires 8x more forward passes total. This is exactly the tradeoff in practice -- gradient accumulation lets you simulate large batches when GPU memory is the bottleneck.
Exercise 3: Pipeline parallelism simulation.
def pipeline_schedule(n_stages, n_microbatches):
total_time = n_stages + n_microbatches - 1
active = 0
total = 0
print(f"\nPipeline: {n_stages} stages, {n_microbatches} micro-batches")
print(f"{'Time':>5}", end="")
for s in range(n_stages):
print(f" GPU{s:>2}", end="")
print()
for t in range(total_time):
print(f"{t:>5}", end="")
for s in range(n_stages):
mb = t - s
total += 1
if 0 <= mb < n_microbatches:
print(f" mb{mb:>1}", end="")
active += 1
else:
print(f" --", end="")
print()
util = active / total
idle = total - active
expected = n_microbatches / (n_stages + n_microbatches - 1)
return n_stages, n_microbatches, total_time, active, idle, util, expected
configs = [
(4, 4), (4, 8), (4, 16), (8, 16),
]
print(f"\n{'Stages':>7} {'MBatch':>7} {'Steps':>7} {'Active':>7} "
f"{'Idle':>7} {'Util%':>7} {'Formula':>8}")
print("-" * 58)
for n_s, n_mb in configs:
_, _, steps, act, idle, util, exp = pipeline_schedule(n_s, n_mb)
match = "MATCH" if abs(util - exp) < 0.001 else "MISMATCH"
print(f"{n_s:>7} {n_mb:>7} {steps:>7} {act:>7} "
f"{idle:>7} {util:>6.1%} {exp:>7.1%} {match}")
The formula utilization = n_microbatches / (n_stages + n_microbatches - 1) holds exactly for all configurations. More micro-batches means higher utilization: 4/7 = 57% with 4 micro-batches jumps to 16/19 = 84% with 16. But adding more stages (8 stages, 16 micro-batches = 16/23 = 70%) shows that deeper pipelines need proportionally more micro-batches to maintain high utilization. This is why production training runs use large batch sizes split into many micro-batches -- it's not just for gradient statistics, it's for pipeline efficiency.
On to today's episode
Here we go! In episode #58 we built the GPT architecture -- the decoder-only transformer that predicts the next token. In episode #59 we explored BERT and encoder models, the other side of the coin. And in episode #60 we covered the entire industrial pipeline for training LLMs at scale -- the data, the distributed computing, the mind-boggling compute budgets.
So now you have a pre-trained model. It's been trained on trillions of tokens, it's seen the entire internet (more or less), and it's excellent at one thing: predicting what text comes next. Ask it "What is the capital of France?" and it might continue with "... is a common geography question found in many textbooks" in stead of just answering "Paris."
This is not a bug. The model is doing exactly what it was trained to do -- predict the most likely continuation of text on the internet. And on the internet, questions are often followed by more discussion, not concise answers.
The gap between "predicts text" and "follows instructions" is the subject of this episode. Bridging that gap is what transforms a raw language model into something you'd actually want to use. And in my opinion, this is one of the most elegant pieces of the entire modern AI stack -- because it turns out you can teach a model to be helpful with surprisingly little additional training, relative to the massive pre-training cost ;-)
The problem: completion vs conversation
Let me be concrete about what the problem looks like. Here's a pre-trained GPT model doing what it was trained to do:
# What a pre-trained model does (next-token prediction)
prompt_and_completions = {
"What is the capital of France?": [
"What is the capital of France? It is a question that...",
"What is the capital of France? I need this for my homework...",
"What is the capital of France?\nThe capital of France is Paris.",
# ^-- this one is correct but the model doesn't prefer it
],
"Write a Python function to reverse a string.": [
"Write a Python function to reverse a string.\n\n"
"This is a common interview question that tests...",
"Write a Python function to reverse a string.\n"
"There are many ways to do this in Python...",
],
"Explain quantum entanglement simply.": [
"Explain quantum entanglement simply.\n\n"
"Quantum entanglement is one of the most discussed...",
],
}
for prompt, completions in prompt_and_completions.items():
print(f"Prompt: '{prompt}'")
for i, c in enumerate(completions):
print(f" Completion {i}: '{c[:70]}...'")
print()
print("Notice: the model continues the text, it doesn't ANSWER the question.")
print("This is correct behavior for a next-token predictor!")
print("The internet has more discussion-about-questions than direct answers.")
The model "knows" the answer -- it has seen enough text containing "The capital of France is Paris" to have that knowledge encoded in its weights. The problem is that it doesn't know the format you want. It's been trained to predict what comes next on the internet, not to be a helpful assistant.
Three stages fix this: Supervised Fine-Tuning (SFT), Reward Modeling, and Reinforcement Learning from Human Feedback (RLHF). Together they transform the raw predictor into something that actually follows instructions and produces useful responses. Having said that, there are also newer approaches (like DPO) that achieve similar results more efficiently, and we'll cover those too.
Stage 1: Supervised Fine-Tuning (SFT)
The first stage is conceptually simple: show the model thousands of examples of "human asks something, assistant responds helpfully" and fine-tune on those. Standard supervised learning -- the same training loop we've been using since episode #7.
# SFT training data format
sft_examples = [
{
"instruction": "What is the capital of France?",
"response": "The capital of France is Paris."
},
{
"instruction": "Write a Python function that checks if a number is prime.",
"response": (
"def is_prime(n):\n"
" if n < 2:\n"
" return False\n"
" for i in range(2, int(n**0.5) + 1):\n"
" if n % i == 0:\n"
" return False\n"
" return True"
)
},
{
"instruction": "Explain quantum entanglement to a 10-year-old.",
"response": "Imagine you have two magic coins..."
}
]
# The actual training format uses chat templates:
# \nWhat is the capital of France?\n\nThe capital is Paris.
# During training, the loss is computed ONLY on the response tokens.
# The instruction tokens are in the context but masked from the loss.
def format_sft_example(example, tokenizer=None):
"""Format an instruction-response pair for SFT training."""
text = (f"\n{example['instruction']}\n"
f"\n{example['response']}")
# In real training:
# tokens = tokenizer.encode(text)
# labels = tokens.copy()
# labels[:instruction_end] = [-100] * instruction_end # mask instruction
return text
for ex in sft_examples:
print(format_sft_example(ex))
print("---")
The model is fine-tuned with the standard language modeling objective (next-token prediction), but only the response tokens contribute to the loss. The instruction tokens are included in the context (the model reads them) but their loss is masked -- we don't want the model to learn to write instructions, only to respond to them.
The data matters enormously here. OpenAI's InstructGPT paper (2022) used roughly 13,000 high-quality human-written demonstrations -- comparatively tiny next to the trillions of pre-training tokens. Meta's LLaMA-2 Chat used over 100,000 examples. Quality trumps quantity by a wide margin: a small dataset of expert-written responses consistently outperforms a large dataset of mediocre ones.
# Dataset quality experiment (conceptual)
datasets = {
"13K expert-written (InstructGPT)": {
"size": 13000,
"quality": "Expert annotators, guidelines, review",
"result": "Strong instruction following",
},
"1M crowd-sourced (noisy)": {
"size": 1000000,
"quality": "MTurk, minimal review, mixed quality",
"result": "Follows format but quality inconsistent",
},
"52K GPT-4 generated (Alpaca-style)": {
"size": 52000,
"quality": "Machine-generated, filtered",
"result": "Decent but ceiling limited by generator",
},
}
print(f"{'Dataset':8} {'Outcome':>35}")
print("-" * 86)
for name, info in datasets.items():
print(f"{name:8,} {info['result']:>35}")
print("\nKey insight: 13K expert examples beats 1M noisy examples.")
print("The model already KNOWS language -- it just needs format guidance.")
After SFT, the model understands the instruction-following format and produces reasonable responses. But "reasonable" isn't the same as "good". The model doesn't know which of several valid responses a human would actually prefer. It might be too verbose, or too terse, or it might give a technically correct but unhelpful answer. It follows instructions, but it doesn't optimise for human satisfaction. That's where the next two stages come in.
Stage 2: Reward Modeling
To teach the model what "good" means in a way that goes beyond "grammatically correct and on-topic", you need a signal for quality. This is where human preferences enter the picture.
The process: present human raters with a prompt and two (or more) model responses. Ask them one simple question: "Which response is better?" Collect thousands of these pairwise comparisons. Then train a separate neural network -- the reward model -- that takes a (prompt, response) pair and outputs a scalar score predicting how much a human would prefer this response.
import torch
import torch.nn as nn
class RewardModel(nn.Module):
"""A reward model that scores (prompt, response) pairs.
Built on top of the same architecture as the LLM."""
def __init__(self, base_model, d_model):
super().__init__()
self.base = base_model # same architecture as the LLM
self.reward_head = nn.Linear(d_model, 1)
def forward(self, input_ids):
hidden = self.base(input_ids) # (batch, seq, d_model)
# Use the last token's hidden state as sequence representation
last_hidden = hidden[:, -1, :]
reward = self.reward_head(last_hidden)
return reward # scalar score per example
# Training uses the Bradley-Terry loss
def reward_loss(reward_chosen, reward_rejected):
"""The chosen response should score higher than the rejected one.
Bradley-Terry model: P(chosen > rejected) = sigmoid(r_chosen - r_rejected)"""
return -torch.log(torch.sigmoid(reward_chosen - reward_rejected)).mean()
# Example preference data
preferences = [
{
"prompt": "Explain gravity simply.",
"chosen": "Gravity is a force that pulls objects toward each other. "
"The more massive an object, the stronger its pull.",
"rejected": "Gravity is described by Einstein's general theory of "
"relativity as the curvature of spacetime caused by "
"mass-energy, governed by the field equations..."
},
{
"prompt": "How do I make pasta?",
"chosen": "Boil water, add salt, cook pasta 8-10 minutes until "
"al dente, drain, and add your sauce.",
"rejected": "Pasta, derived from the Italian word for 'paste', "
"has a rich culinary history dating back to..."
},
]
for pref in preferences:
print(f"Prompt: {pref['prompt']}")
print(f" Chosen: {pref['chosen'][:60]}...")
print(f" Rejected: {pref['rejected'][:60]}...")
print()
The Bradley-Terry loss is elegant in its simplicity: it says the probability that response A is preferred over response B equals sigmoid(reward_A - reward_B). So maximizing the log probability of the observed preferences means pushing the chosen response's score up and the rejected response's score down. The reward model learns relative preferences -- it doesn't need absolute quality ratings, just which one is better.
In practice, human raters disagree a LOT. One person's "helpful" is another's "too verbose". What one rater considers appropriately cautious, another considers annoyingly evasive. The reward model learns the average preference, smoothing out individual variation. Careful rater selection, clear guidelines, and inter-annotator agreement checks are all critical to making this work. NB: the quality of the reward model directly limits the quality of the final aligned model -- garbage preferences in, garbage alignment out.
Stage 3: RLHF -- Reinforcement Learning from Human Feedback
With a reward model in hand, you can now optimise the language model to produce responses that score highly. This is where reinforcement learning enters the picture, and it connects back to concepts we'll explore in more depth later in the series.
The setup: the language model is the policy (an RL agent that takes actions -- generating tokens). The reward model provides the reward signal after the full response is generated. The optimization algorithm is PPO (Proximal Policy Optimization).
import torch
def rlhf_training_step(policy_model, reward_model, ref_model,
prompts, kl_coeff=0.1):
"""Conceptual RLHF training step.
policy_model: the LLM being optimized
reward_model: scores (prompt, response) pairs
ref_model: frozen copy of the SFT model (anchor)
kl_coeff: how strongly to penalize deviation from ref_model
"""
# Step 1: generate responses from the current policy
responses = policy_model.generate(prompts)
# Step 2: score responses with the reward model
rewards = reward_model.score(prompts, responses)
# Step 3: compute KL divergence from the reference model
# This prevents the policy from drifting too far
log_probs_policy = policy_model.log_prob(responses, prompts)
log_probs_ref = ref_model.log_prob(responses, prompts)
kl_penalty = log_probs_policy - log_probs_ref
# Step 4: total reward = reward score - KL penalty
total_reward = rewards - kl_coeff * kl_penalty
# Step 5: PPO update
# Adjust policy to increase total_reward
# (uses clipped surrogate objective -- episode #109 covers PPO in detail)
ppo_update(policy_model, total_reward, log_probs_policy)
return {
"mean_reward": rewards.mean().item(),
"mean_kl": kl_penalty.mean().item(),
"mean_total": total_reward.mean().item(),
}
# Simulate multiple RLHF steps
print("RLHF Training Progress (simulated):")
print(f"{'Step':>6} {'Reward':>8} {'KL':>8} {'Total':>8}")
print("-" * 34)
# In reality these improve over thousands of steps
import random
random.seed(42)
for step in range(10):
reward = 0.3 + step * 0.05 + random.gauss(0, 0.02)
kl = 0.1 + step * 0.02 + random.gauss(0, 0.01)
total = reward - 0.1 * kl
print(f"{step:>6} {reward:>8.3f} {kl:>8.3f} {total:>8.3f}")
The KL penalty is the crucial ingredient. Without it, the model would quickly find degenerate responses that game the reward model -- repetitive text, sycophantic agreement, or responses that exploit blind spots in the reward model's training data. The reward model is an imperfect proxy for human preferences, and an unconstrained RL agent will exploit every imperfection it can find. This is the classic reward hacking problem from RL research, and it's a real issue in practice.
The KL penalty keeps the fine-tuned model close to the SFT model (the "reference" model). It says: "you can improve, but don't deviate too far from the behavior you learned during supervised fine-tuning." The kl_coeff hyperparameter controls this tradeoff -- too low and the model hacks the reward; too high and it barely changes from the SFT baseline.
RLHF is expensive and finicky. You need a reward model (which requires human annotations), a PPO implementation (notoriously hard to get right -- lots of numerical tricks, clipping, value function baselines), and you're training the language model with RL (which is fundamentally less stable than supervised learning). Any of these components can fail silently -- the model achieves a high reward score but produces worse output because the reward model has a blind spot the RL agent learned to exploit.
DPO: achieving RLHF results without the RL
In 2023, Rafailov et al. published Direct Preference Optimization (DPO), which showed that you can skip the reward model and PPO entirely. In stead of the multi-step RLHF pipeline (train reward model -> run PPO against it), DPO directly optimizes the language model on human preference pairs.
The key insight: the optimal policy under the RLHF objective has a closed-form solution. This means you can derive a loss function that directly trains the language model on preference data, with no intermediate reward model and no reinforcement learning.
import torch
import torch.nn.functional as F
def dpo_loss(policy_model, ref_model, chosen_ids, rejected_ids,
chosen_mask, rejected_mask, beta=0.1):
"""Direct Preference Optimization loss.
policy_model: the model being trained
ref_model: frozen SFT model (reference)
chosen_ids: token IDs of the preferred response
rejected_ids: token IDs of the rejected response
beta: temperature parameter (controls deviation from ref)
"""
# Log probabilities under current policy
pi_chosen = compute_log_prob(policy_model, chosen_ids, chosen_mask)
pi_rejected = compute_log_prob(policy_model, rejected_ids, rejected_mask)
# Log probabilities under reference (SFT) model -- frozen, no gradient
with torch.no_grad():
ref_chosen = compute_log_prob(ref_model, chosen_ids, chosen_mask)
ref_rejected = compute_log_prob(ref_model, rejected_ids, rejected_mask)
# DPO loss: push up the log-ratio for chosen, push down for rejected
log_ratio_chosen = pi_chosen - ref_chosen
log_ratio_rejected = pi_rejected - ref_rejected
loss = -F.logsigmoid(beta * (log_ratio_chosen - log_ratio_rejected)).mean()
return loss
def compute_log_prob(model, token_ids, mask):
"""Compute average log probability of token sequence."""
logits = model(token_ids[:, :-1])
log_probs = F.log_softmax(logits, dim=-1)
target_log_probs = log_probs.gather(2, token_ids[:, 1:].unsqueeze(-1)).squeeze(-1)
# Average over non-padding tokens
return (target_log_probs * mask[:, 1:]).sum(1) / mask[:, 1:].sum(1)
# Compare RLHF vs DPO pipeline complexity
print("RLHF Pipeline:")
print(" 1. Collect preference data (human annotations)")
print(" 2. Train reward model on preferences")
print(" 3. Run PPO with reward model + KL constraint")
print(" 4. Tune PPO hyperparameters (clip ratio, value coeff, etc.)")
print(" 5. Monitor for reward hacking")
print()
print("DPO Pipeline:")
print(" 1. Collect preference data (same human annotations)")
print(" 2. Run supervised training with DPO loss")
print(" 3. Done.")
print()
print("Same input data. Similar output quality. Much simpler ;-)")
DPO is simpler (no reward model to train separately, no PPO implementation), more stable (it's a standard supervised training loop -- the same kind we've been using since episode #7), and cheaper (no generation step during training, which is the most expensive part of RLHF). It produces comparable results to RLHF on most benchmarks.
The tradeoff: DPO is a one-shot optimization. You train on a fixed preference dataset. RLHF can iteratively generate new responses, collect fresh preference data on those responses, and keep improving in an online learning loop. In practice, you can run DPO multiple times with new data, but RLHF's online loop is theoretically more powerful for continous improvement.
As of 2025, DPO and its variants (IPO -- Identity Preference Optimization, KTO -- Kahneman-Tversky Optimization, ORPO -- Odds Ratio Preference Optimization) have become the dominant approach for preference alignment. The simplicity wins in practice.
Constitutional AI: self-supervised alignment
Anthropic introduced Constitutional AI (CAI), which takes a different approach to the annotation bottleneck. In stead of collecting thousands of human preference pairs (expensive, slow, requires careful rater management), CAI has the model critique and revise its own responses based on a set of written principles -- "the constitution."
# Constitutional AI workflow
constitution = [
"Choose the response that would be least harmful if shared widely.",
"Choose the response that is most helpful while being honest.",
"Choose the response that best respects individual autonomy.",
"Choose the response that is most accurate and truthful.",
]
cai_steps = [
{
"step": "1. Generate",
"description": "Start with a helpful-only model (SFT, no safety training)",
"example": {
"prompt": "How do I pick a lock?",
"response": "Here's a step-by-step guide to picking locks: First...",
}
},
{
"step": "2. Critique",
"description": "Ask the model to evaluate its own response against principles",
"example": {
"critique": (
"This response provides detailed instructions for an activity "
"that could be used for breaking and entering. According to the "
"principle 'Choose the response that would be least harmful "
"if shared widely', this should be revised."
),
}
},
{
"step": "3. Revise",
"description": "Ask the model to produce a better response",
"example": {
"revision": (
"Lock picking is a legitimate skill used by locksmiths and "
"security professionals. I'd recommend taking a certified "
"locksmith course if you're interested professionally. "
"For personal lockouts, contact a licensed locksmith."
),
}
},
{
"step": "4. Train",
"description": "Use (original, revised) pairs for preference training",
},
]
print("Constitutional AI Process:")
for step_info in cai_steps:
print(f"\n {step_info['step']}: {step_info['description']}")
if 'example' in step_info:
for k, v in step_info['example'].items():
print(f" {k}: {v[:70]}...")
print(f"\nConstitution ({len(constitution)} principles):")
for i, p in enumerate(constitution):
print(f" {i+1}. {p}")
The big advantage: you can align the model to follow nuanced, multi-dimensional principles without hand-labeling millions of preference pairs. The constitution encodes what "good" means in a machine-readable way, and the model bootstraps its own training data through self-critique.
The step from human feedback to AI feedback is called RLAIF (Reinforcement Learning from AI Feedback). In stead of humans ranking responses, the model itself (or a separate judge model) does the ranking based on the constitutional principles. This scales much better -- you can generate millions of preference pairs at the cost of inference compute, which is far cheaper than hiring human annotators.
The limitation is real though: the model's ability to critique itself is bounded by its own capabilties. If it can't recognise a subtle harm, it can't revise against it. CAI works well on clear-cut cases (explicit harmful content, obvious factual errors) but struggles with genuinely nuanced ethical dilemmas where reasonable people disagree. It's a complement to human feedback, not a replacement.
The alignment tax
Every alignment technique involves a tradeoff. The pre-trained model has raw capability -- it's seen the most text, has the broadest knowledge base, and can generate the most diverse outputs. Each alignment step constrains it:
# The alignment tax: what each stage costs in capability
stages = [
{
"name": "Base model (pre-trained)",
"capability": "Maximum (all internet knowledge)",
"constraint": "None -- generates anything",
"lost": "Nothing yet",
},
{
"name": "After SFT",
"capability": "Narrowed to assistant-style responses",
"constraint": "Must follow instruction-response format",
"lost": "Some creative/diverse generation modes",
},
{
"name": "After RLHF/DPO",
"capability": "Optimized for human-preferred responses",
"constraint": "Must produce responses humans rate highly",
"lost": "Brevity (humans prefer longer answers in ratings), "
"uncertainty expression (confident answers rate higher)",
},
{
"name": "After safety training",
"capability": "Refuses harmful requests",
"constraint": "Must decline dangerous/unethical tasks",
"lost": "Some legitimate use cases caught in the safety net",
},
]
print("The Alignment Tax:")
print()
for stage in stages:
print(f" {stage['name']}:")
print(f" Capability: {stage['capability']}")
print(f" Constraint: {stage['constraint']}")
print(f" Lost: {stage['lost']}")
print()
SFT narrows the output distribution from "everything on the internet" to "assistant-style responses." The model becomes less willing to write in unusual formats, generate creative fiction in niche styles, or produce text that doesn't look like a conversation.
RLHF/DPO further narrows the distribution to "responses humans rate highly." This can make the model more verbose (humans tend to prefer longer answers in A/B comparisons, even when shorter would be better), more cautious, and less willing to express genuine uncertainty (confident-sounding answers rate higher than honest "I'm not sure" answers -- even when the model should be uncertain).
Safety training adds refusals. The model declines to help with tasks deemed harmful, which inevitably catches some legitimate use cases. A security researcher asking about vulnerabilities, a novelist writing a villain's dialogue, a medical professional asking about drug interactions -- these can all trigger overly broad safety filters.
This is the alignment tax: the cost in raw capability paid for making the model helpful, harmless, and honest. The goal of ongoing alignment research is to minimise this tax -- being safe and useful without unnecessarily sacrificing capability.
In practice, the alignment tax is shrinking. Modern techniques are more surgical: safety training that targets specific harms without broadly reducing helpfulness, preference optimization that improves both helpfulness and safety simultaneously, and evaluation frameworks that detect both unsafe behaviour AND unnecessary refusals. The conversation has shifted from "safe vs capable" to "how do we get both" ;-)
The full pipeline
Putting it all together, here's what training a modern AI assistant actually looks like end to end:
pipeline = [
("Pre-training", "months", "$millions",
"Train LM on trillions of tokens"),
("Supervised Fine-Tuning", "days", "$thousands",
"Fine-tune on instruction-response demos"),
("Preference Alignment", "days", "$thousands",
"DPO or RLHF on human preference pairs"),
("Safety Training", "ongoing", "varies",
"Red-teaming, CAI, safety-specific data"),
("Evaluation", "continuous", "varies",
"Benchmarks + human eval + adversarial tests"),
]
print(f"{'Stage':8} {'Cost':>10} {'What':>45}")
print("-" * 95)
for name, time, cost, what in pipeline:
print(f"{name:8} {cost:>10} {what:>45}")
print()
print("Pre-training gives knowledge and capability.")
print("Alignment stages make that capability accessible and safe.")
print("Without alignment: a model that CAN answer anything but WON'T")
print("reliably do so in a useful way.")
The pre-training gives the model knowledge and raw capability. The alignment stages make that capability accessible -- they teach the model to respond in formats humans find useful, to prefer outputs humans prefer, and to decline requests that could cause harm.
Without alignment, you have a model that can complete any text -- but might answer your question, or might write an essay about your question, or might generate something completely off-topic. With alignment, you have a model that reliably responds to what you actually asked, in the format you'd expect, at a quality level that matches human preferences. That transformation -- from text predictor to useful tool -- is what makes the difference between a research artefact and a product millions of people use.
Practical implementation: building a simple preference trainer
Let me show you a complete, runnable example that demonstrates the core DPO concept on a small scale. This won't produce GPT-4, but it illustrates the mechanics:
import torch
import torch.nn as nn
import torch.nn.functional as F
class TinyLM(nn.Module):
"""Minimal language model for demonstrating DPO."""
def __init__(self, vocab_size=100, d_model=64, n_layers=2):
super().__init__()
self.emb = nn.Embedding(vocab_size, d_model)
self.layers = nn.ModuleList([
nn.TransformerEncoderLayer(d_model, 4, d_model * 4,
batch_first=True)
for _ in range(n_layers)])
self.head = nn.Linear(d_model, vocab_size)
def forward(self, x):
T = x.size(1)
mask = nn.Transformer.generate_square_subsequent_mask(T, device=x.device)
h = self.emb(x)
for layer in self.layers:
h = layer(h, src_mask=mask)
return self.head(h)
def log_prob_sequence(self, ids):
"""Average log probability of the sequence."""
logits = self.forward(ids[:, :-1])
log_probs = F.log_softmax(logits, dim=-1)
targets = ids[:, 1:]
token_log_probs = log_probs.gather(2, targets.unsqueeze(-1)).squeeze(-1)
return token_log_probs.mean(dim=1)
# Create policy and reference (frozen copy)
torch.manual_seed(42)
vocab_size = 50
policy = TinyLM(vocab_size)
ref = TinyLM(vocab_size)
ref.load_state_dict(policy.state_dict())
for p in ref.parameters():
p.requires_grad = False
# Synthetic preference data: pairs of sequences
# "chosen" sequences have a specific pattern, "rejected" don't
n_pairs = 200
seq_len = 16
chosen_data = torch.randint(0, vocab_size // 2, (n_pairs, seq_len))
rejected_data = torch.randint(vocab_size // 2, vocab_size, (n_pairs, seq_len))
# DPO training
optimizer = torch.optim.Adam(policy.parameters(), lr=1e-4)
beta = 0.1
print(f"{'Epoch':>6} {'DPO Loss':>10} {'Chosen logr':>12} {'Rejected logr':>14}")
print("-" * 45)
for epoch in range(20):
total_loss = 0
for i in range(0, n_pairs, 32):
chosen = chosen_data[i:i+32]
rejected = rejected_data[i:i+32]
pi_chosen = policy.log_prob_sequence(chosen)
pi_rejected = policy.log_prob_sequence(rejected)
with torch.no_grad():
ref_chosen = ref.log_prob_sequence(chosen)
ref_rejected = ref.log_prob_sequence(rejected)
logr_chosen = pi_chosen - ref_chosen
logr_rejected = pi_rejected - ref_rejected
loss = -F.logsigmoid(beta * (logr_chosen - logr_rejected)).mean()
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item()
avg_loss = total_loss / (n_pairs // 32)
with torch.no_grad():
lrc = (policy.log_prob_sequence(chosen_data[:32]) -
ref.log_prob_sequence(chosen_data[:32])).mean().item()
lrr = (policy.log_prob_sequence(rejected_data[:32]) -
ref.log_prob_sequence(rejected_data[:32])).mean().item()
if (epoch + 1) % 5 == 0:
print(f"{epoch+1:>6} {avg_loss:>10.4f} {lrc:>12.4f} {lrr:>14.4f}")
print("\nAfter DPO training:")
print(" Chosen log-ratio should be POSITIVE (policy prefers chosen more)")
print(" Rejected log-ratio should be NEGATIVE (policy prefers rejected less)")
The log-ratio tells you how much the policy has shifted relative to the reference. Positive log-ratio for chosen responses means the policy assigns them higher probability than the reference does. Negative log-ratio for rejected responses means the policy assigns them lower probability. That's exactly the behavior we want -- the model has learned to prefer the chosen responses over the rejected ones, all through a simple supervised loss function with no RL involved.
SFT data formats in practice
One more practical detail that's worth covering: how instruction data is actually formatted. Different model families use different chat templates, and getting this wrong can silently degrade performance:
# Common chat template formats
templates = {
"ChatML (OpenAI)": (
"system\nYou are a helpful assistant.\n\n"
"user\nWhat is 2+2?\n\n"
"assistant\n4\n"
),
"LLaMA-2 Chat": (
"[INST] <>\nYou are a helpful assistant.\n<>\n\n"
"What is 2+2? [/INST] 4 "
),
"Mistral Instruct": (
"[INST] What is 2+2? [/INST] 4"
),
"Alpaca (simple)": (
"### Instruction:\nWhat is 2+2?\n\n"
"### Response:\n4"
),
}
print("Common SFT chat templates:\n")
for name, template in templates.items():
print(f" {name}:")
for line in template.split('\n'):
print(f" {line}")
print()
print("Important: use the EXACT template the model was pre-trained with.")
print("Wrong delimiters = the model doesn't recognise the format.")
Using the wrong template is a common source of poor performance when fine-tuning. The model learns to associate specific tokens ([INST], <|im_start|>, etc.) with the instruction-following behavior during SFT. If you use different delimiters at inference time, the model doesn't "see" the instruction format and reverts to base-model behavior (text completion in stead of instruction following).
Evaluating alignment: how do you know it worked?
Measuring alignment quality is harder than measuring pre-training quality. Pre-training has clear metrics: perplexity on held-out text. Alignment quality is multi-dimensional -- you need to measure helpfulness, harmlessness, and honesty simultaneously.
# Alignment evaluation dimensions
eval_dimensions = {
"Helpfulness": {
"measures": "Does the model answer the question correctly and usefully?",
"metrics": ["MT-Bench scores", "AlpacaEval win rate",
"Human preference ratings"],
"pitfall": "Models that always give long, confident answers "
"score well even when they're wrong",
},
"Harmlessness": {
"measures": "Does the model refuse harmful requests?",
"metrics": ["Red-team attack success rate", "ToxiGen scores",
"BBQ bias benchmark"],
"pitfall": "Over-refusal: declining legitimate requests that "
"trigger broad safety filters",
},
"Honesty": {
"measures": "Does the model express uncertainty when appropriate?",
"metrics": ["Calibration (confidence vs accuracy)", "TruthfulQA",
"Hallucination rate"],
"pitfall": "RLHF rewards confident answers, pushing models to "
"sound certain even when they shouldn't be",
},
}
for dim, info in eval_dimensions.items():
print(f"\n{dim}:")
print(f" What: {info['measures']}")
print(f" How: {', '.join(info['metrics'])}")
print(f" Risk: {info['pitfall']}")
print("\n\nThe challenge: improving one dimension can hurt another.")
print("Making a model more helpful can make it less safe.")
print("Making it safer can make it less helpful (over-refusal).")
print("Good alignment optimises ALL dimensions simultaneously.")
The tricky part is that these dimensions can conflict. Making a model maximally helpful (always attempts to answer) can make it less safe (answers harmful queries too). Making it maximally safe (refuses anything potentially harmful) makes it less helpful (refuses legitimate queries). Good alignment research optimises all dimensions simultaneously, minimising the tradeoffs. This is an active area of research and, honestly, one of the most important problems in AI right now.
What to remember from this one
- Pre-trained LLMs predict text but don't follow instructions -- SFT teaches the instruction-response format using 10K-100K demonstration examples, training only on response tokens;
- Reward models score responses based on human pairwise preferences (Bradley-Terry loss), learning what "good" output looks like without absolute quality ratings;
- RLHF optimizes the LLM to produce high-reward responses using PPO, with a KL penalty that prevents the model from drifting too far from the SFT baseline and hacking the reward;
- DPO achieves similar results to RLHF without a separate reward model or RL -- it directly optimises on preference pairs using a clean supervised loss. Simpler, more stable, and cheaper;
- Constitutional AI reduces the need for human annotations by having the model critique and revise its own responses based on written principles (RLAIF);
- The alignment tax trades some raw capability for helpfulness and safety -- modern techniques are getting better at minimising this tradeoff;
- The full pipeline is: pre-train -> SFT -> preference alignment -> safety training -> evaluation. Each stage builds on the previous one.
Exercises
Exercise 1: Build a complete preference data collection simulator. Create a function generate_preference_pair(prompt) that takes a prompt and returns two responses: a "chosen" (well-formatted, concise, directly answers the question) and a "rejected" (either too verbose, off-topic, or unhelpful). Generate 100 preference pairs for 10 different prompts. Then implement the Bradley-Terry reward model training loop: build a small reward model (2-layer MLP that takes a feature vector extracted from text -- use simple features like response length, question-word overlap, and vocabulary diversity), train it on your 100 pairs using the pairwise loss function from this episode for 50 epochs, and report the training accuracy (fraction of pairs where the model correctly assigns a higher score to "chosen" than "rejected"). Print accuracy every 10 epochs.
Exercise 2: Implement a minimal DPO trainer from scratch. Create two tiny language models (2-layer transformer, d_model=64, vocab_size=50) -- one is the policy, the other is the frozen reference. Create synthetic preference data: 200 pairs of token sequences where "chosen" sequences use tokens 0-24 and "rejected" sequences use tokens 25-49. Train the policy model using the DPO loss from this episode for 20 epochs (beta=0.1, lr=1e-4, batch_size=32). After training, measure and print: (a) the average log-probability ratio (policy vs reference) for chosen sequences, (b) the same ratio for rejected sequences, (c) the "accuracy" -- fraction of pairs where the policy assigns higher log-prob to chosen than rejected. The chosen log-ratio should be positive and the rejected log-ratio should be negative.
Exercise 3: Simulate the Constitutional AI self-critique loop. Write a function constitutional_critique(response, principles) that takes a response string and a list of principle strings, and returns a "critique" identifying which principles the response might violate (use simple keyword matching: if the response contains words like "hack", "exploit", "steal", check the safety principle; if it's very short, check the helpfulness principle; if it contains "definitely" or "certainly" without evidence, check the honesty principle). Generate 20 responses (mix of helpful, harmful, and uncertain ones), run each through the critique function with 4 principles, and print a report showing: for each response, which principles were flagged and a suggested revision direction. Count how many responses were flagged by each principle. This simulates the CAI pipeline at a conceptual level -- real CAI uses the model itself for critique, but the evaluation logic is the same.