Learn AI Series (#69) - Fine-Tuning Language Models

What will I learn

You will learn when fine-tuning beats prompting and when it doesn't;
full fine-tuning vs parameter-efficient methods;
LoRA: low-rank adaptation, the current standard for efficient fine-tuning;
QLoRA: quantized LoRA that works on consumer hardware;
dataset preparation: what good fine-tuning data looks like;
evaluation: how to know if fine-tuning actually helped.

Requirements

A working modern computer running macOS, Windows or Ubuntu;
An installed Python 3(.11+) distribution;
The ambition to learn AI and machine learning.

Difficulty

Beginner

Curriculum (of the `Learn AI Series`):

Learn AI Series (#69) - Fine-Tuning Language Models

Solutions to Episode #68 Exercises

Exercise 1: Two-level hierarchical agent system with manager and 3 specialized workers.

class WorkerAgent:
    """Specialized worker with a focused skill set."""

    def __init__(self, name, skill):
        self.name = name
        self.skill = skill
        self.tasks_done = 0

    def execute(self, task):
        """Execute a subtask within this worker's specialty."""
        self.tasks_done += 1
        # Simulated tool execution
        if self.skill == "research":
            return self._search(task)
        elif self.skill == "code":
            return self._run_code(task)
        elif self.skill == "analysis":
            return self._calculate(task)
        return {"error": f"Unknown skill: {self.skill}"}

    def _search(self, query):
        data = {
            "berlin": 3_645_000, "paris": 2_161_000,
            "madrid": 3_223_000, "rome": 2_873_000,
            "london": 8_982_000,
        }
        for city, pop in data.items():
            if city in query.lower():
                return {"city": city.title(), "population": pop}
        return {"results": "No match found"}

    def _run_code(self, task):
        # Simulated code execution
        return {"output": f"Code executed: {task[:50]}",
                "exit_code": 0}

    def _calculate(self, task):
        if "average" in task.lower():
            nums = [3_645_000, 2_161_000, 3_223_000,
                    2_873_000, 8_982_000]
            avg = sum(nums) / len(nums)
            return {"result": avg, "count": len(nums)}
        return {"result": f"Calculated: {task[:40]}"}


class ManagerAgent:
    """Decomposes tasks and delegates to workers."""

    def __init__(self, workers):
        self.workers = {w.name: w for w in workers}
        self.trace = []

    def run(self, task):
        """Plan, delegate, synthesize."""
        print(f"[Manager] Task: {task[:60]}...")

        # Decompose
        plan = self._plan(task)
        self.trace.append({"phase": "plan", "steps": plan})
        print(f"[Manager] Plan: {len(plan)} subtasks")

        # Delegate
        results = {}
        workers_used = set()
        for worker_name, subtask in plan:
            worker = self.workers.get(worker_name)
            if not worker:
                print(f"  [!] No worker '{worker_name}'")
                continue
            print(f"  [{worker_name}] {subtask[:50]}...")
            result = worker.execute(subtask)
            results[f"{worker_name}:{subtask[:30]}"] = result
            workers_used.add(worker_name)
            self.trace.append({
                "phase": "execute",
                "worker": worker_name,
                "subtask": subtask[:40],
                "result": result,
            })

        # Synthesize
        synthesis = self._synthesize(task, results)
        self.trace.append({"phase": "synthesis", "output": synthesis})
        print(f"[Manager] Workers used: {workers_used}")
        assert len(workers_used) >= 2, "Must use 2+ workers"
        return synthesis

    def _plan(self, task):
        return [
            ("researcher", "Search population of Berlin"),
            ("researcher", "Search population of Paris"),
            ("researcher", "Search population of Madrid"),
            ("researcher", "Search population of Rome"),
            ("researcher", "Search population of London"),
            ("analyst", "Calculate average of populations"),
        ]

    def _synthesize(self, task, results):
        pops = {}
        avg = None
        for key, val in results.items():
            if "city" in val:
                pops[val["city"]] = val["population"]
            if "result" in val and isinstance(val["result"], float):
                avg = val["result"]
        return {
            "populations": pops,
            "average": avg,
            "summary": (f"Average population of "
                        f"{len(pops)} capitals: "
                        f"{avg:,.0f}" if avg else "N/A"),
        }


# Build the team
workers = [
    WorkerAgent("researcher", "research"),
    WorkerAgent("coder", "code"),
    WorkerAgent("analyst", "analysis"),
]
manager = ManagerAgent(workers)
result = manager.run(
    "Find the population of 5 European capitals "
    "and calculate the average")

print(f"\nFinal result:")
print(f"  Populations: {result['populations']}")
print(f"  {result['summary']}")
print(f"\nExecution trace ({len(manager.trace)} entries):")
for entry in manager.trace:
    print(f"  [{entry['phase']}] "
          f"{str(entry.get('worker', entry.get('output', '')))[:50]}")

The manager never touches data directly -- it plans, delegates, and synthesizes. This separation keeps the manager's context clean (remember the context dilution problem from episode #68?) while each worker operates in a focused scope. The assertion at the end is a sanity check: if you're building a hierarchical system and only using one worker, you probably don't need the hierarchy.

Exercise 2: Agent pipeline with quality gates and retry logic.

class PipelineAgent:
    """Agent stage in a pipeline."""

    def __init__(self, name, transform_fn):
        self.name = name
        self.transform = transform_fn

    def run(self, input_text, extra_prompt=""):
        """Process input, optionally with retry context."""
        if extra_prompt:
            return self.transform(
                f"[RETRY: {extra_prompt}] {input_text}")
        return self.transform(input_text)


def quality_gate(input_text, output_text):
    """Check output quality. Returns (passed, reasons)."""
    reasons = []
    if not output_text or not output_text.strip():
        reasons.append("Output is empty")
    if len(output_text) < len(input_text) * 0.8:
        reasons.append(
            f"Too short: {len(output_text)} < 80% of "
            f"{len(input_text)}")
    if "[TODO]" in output_text or "[PLACEHOLDER]" in output_text:
        reasons.append("Contains [TODO] or [PLACEHOLDER]")
    return len(reasons) == 0, reasons


def run_pipeline(agents, initial_input, max_retries=2):
    """Run pipeline with quality gates between stages."""
    current = initial_input
    trace = []

    for agent in agents:
        for attempt in range(max_retries + 1):
            extra = ""
            if attempt > 0:
                extra = (f"Your previous output failed quality "
                         f"check: {'; '.join(fail_reasons)}. "
                         f"Try again.")

            output = agent.run(current, extra)
            passed, fail_reasons = quality_gate(current, output)

            trace.append({
                "stage": agent.name,
                "attempt": attempt + 1,
                "input_len": len(current),
                "output_len": len(output),
                "passed": passed,
                "reasons": fail_reasons,
            })

            if passed:
                print(f"  [{agent.name}] attempt {attempt+1}: "
                      f"PASSED ({len(output)} chars)")
                current = output
                break
            else:
                print(f"  [{agent.name}] attempt {attempt+1}: "
                      f"FAILED - {fail_reasons}")
                if attempt == max_retries:
                    print(f"  [{agent.name}] Max retries, "
                          f"using last output")
                    current = output
    return current, trace


# Stage transforms
def draft_fn(text):
    return (f"DRAFT: {text} -- This is a comprehensive "
            f"first draft covering the key points.")

def review_fn(text):
    if "[RETRY" not in text:
        # First attempt deliberately fails (has TODO)
        return text + " [TODO] needs more detail."
    return text.replace("[TODO] needs more detail.", "").strip()

def polish_fn(text):
    return text.replace("DRAFT: ", "FINAL: ")

agents = [
    PipelineAgent("drafter", draft_fn),
    PipelineAgent("reviewer", review_fn),
    PipelineAgent("polisher", polish_fn),
]

# 3 test inputs
test_inputs = [
    "Explain gradient descent in simple terms",
    "Describe the transformer architecture",
    "Compare CNN and RNN approaches",
]

for i, inp in enumerate(test_inputs, 1):
    print(f"\nPipeline run {i}: '{inp[:40]}...'")
    result, trace = run_pipeline(agents, inp)
    print(f"  Result: {result[:60]}...")
    retries = sum(1 for t in trace if t["attempt"] > 1)
    print(f"  Total retries: {retries}")

Quality gates between pipeline stages catch problems early. The reviewer's first attempt deliberately inserts a [TODO] marker -- the gate catches it, and the retry mechanism feeds the failure reason back to the stage. In production, that retry prompt goes to the LLM with context about what went wrong, and the model usually fixes it on the second try.

Exercise 3: Memory-augmented agent with forgetting strategies.

from sentence_transformers import SentenceTransformer
import numpy as np
import time

class VectorMemory:
    """Memory with semantic retrieval and forgetting."""

    def __init__(self, model_name='all-MiniLM-L6-v2'):
        self.model = SentenceTransformer(model_name)
        self.entries = []
        self.embeddings = None

    def store(self, content, metadata=None):
        """Store with timestamp."""
        self.entries.append({
            "content": content,
            "metadata": metadata or {},
            "timestamp": time.time(),
        })
        self._rebuild_index()

    def _rebuild_index(self):
        if not self.entries:
            self.embeddings = None
            return
        texts = [e["content"] for e in self.entries]
        self.embeddings = self.model.encode(
            texts, normalize_embeddings=True)

    def recall(self, query, top_k=3):
        """Retrieve top-k memories by cosine similarity."""
        if not self.entries:
            return []
        q_emb = self.model.encode(
            [query], normalize_embeddings=True)
        scores = (self.embeddings @ q_emb.T).flatten()
        top_idx = np.argsort(scores)[::-1][:top_k]
        return [(self.entries[i]["content"],
                 float(scores[i])) for i in top_idx]

    def forget_old(self, max_age_seconds):
        """Remove memories older than threshold."""
        cutoff = time.time() - max_age_seconds
        before = len(self.entries)
        self.entries = [e for e in self.entries
                        if e["timestamp"] > cutoff]
        if len(self.entries) != before:
            self._rebuild_index()
        return before - len(self.entries)

    def forget_irrelevant(self, query, threshold=0.3):
        """Remove memories below similarity threshold."""
        if not self.entries:
            return 0
        q_emb = self.model.encode(
            [query], normalize_embeddings=True)
        scores = (self.embeddings @ q_emb.T).flatten()
        before = len(self.entries)
        self.entries = [e for e, s in zip(self.entries, scores)
                        if s >= threshold]
        if len(self.entries) != before:
            self._rebuild_index()
        return before - len(self.entries)

    def count(self):
        return len(self.entries)


# Build memory with 20 diverse entries
memory = VectorMemory()

entries = [
    # Python debugging (5)
    "Fixed ImportError in Python by adding __init__.py",
    "Python debugging: TypeError from mixing str and int",
    "Resolved Python circular import with lazy loading",
    "Python traceback: KeyError in dict, added .get()",
    "Fixed Python async bug: missing await on coroutine",
    # ML experiments (5)
    "ML experiment: RandomForest achieved 94% accuracy",
    "Trained BERT model on sentiment data, F1=0.89",
    "Hyperparameter tuning: learning rate 3e-4 works best",
    "ML pipeline: added cross-validation, scores stable",
    "Fine-tuned GPT on custom Q&A dataset, perplexity 8.2",
    # Database issues (5)
    "PostgreSQL slow query fixed with composite index",
    "Database migration failed: column type mismatch",
    "Redis cache eviction policy changed to allkeys-lru",
    "MongoDB aggregation pipeline for sales reports",
    "SQLAlchemy N+1 query detected, added joinedload",
    # General questions (5)
    "How does DNS resolution work step by step",
    "Compared Docker and Podman for container runtime",
    "Git rebase vs merge: when to use which strategy",
    "Setup CI/CD pipeline with Github Actions",
    "Explained REST vs GraphQL tradeoffs to the team",
]

# Store with staggered timestamps
base_time = time.time()
for i, entry in enumerate(entries):
    memory.entries.append({
        "content": entry,
        "metadata": {},
        "timestamp": base_time - (20 - i) * 10,
    })
memory._rebuild_index()

print(f"Memory count: {memory.count()}")

# (a) Query for Python debugging
print("\n--- Query: 'Python debugging' ---")
results = memory.recall("Python debugging", top_k=3)
all_python = True
for content, score in results:
    is_py = "python" in content.lower()
    if not is_py:
        all_python = False
    print(f"  [{score:.3f}] {content[:60]}...")
print(f"  All Python-related: {all_python}")

# (b) Forget old -- remove first 10
print(f"\n--- Before forget_old: {memory.count()} ---")
# The first 10 entries have timestamps 200-110 seconds ago
removed = memory.forget_old(max_age_seconds=105)
print(f"  Removed: {removed}")
print(f"  After forget_old: {memory.count()}")

# (c) Forget irrelevant for ML context
print(f"\n--- Before forget_irrelevant: {memory.count()} ---")
removed = memory.forget_irrelevant(
    "machine learning experiments", threshold=0.3)
print(f"  Removed: {removed}")
print(f"  After forget_irrelevant: {memory.count()}")
print(f"  Remaining:")
for e in memory.entries:
    print(f"    {e['content'][:60]}...")

Forgetting is just as important as remembering. Without pruning, your memory store grows without bound -- retrieval quality degrades as irrelevant noise accumulates, and storage costs climb. The two strategies complement each other: forget_old handles temporal decay (old debugging sessions from last month probably aren't relevant), while forget_irrelevant does context-aware pruning (when working on ML, database memories are just noise). A real production system would combine both in a periodic maintenance pass.

On to today's episode

Here we go! Over the past two episodes we built AI agents from scratch -- single agents with tools, multi-agent systems with hierarchies, memory, error recovery, guardrails. All of that assumed you're working with a pre-trained model that already knows how to do things. But what happens when the base model doesn't know your things?

Your company's coding standards, your medical terminology, your legal document format, the way your customers phrase support tickets -- none of that is in the base model's training data. You've got two options: elaborate prompt engineering (episode #62) with examples and instructions crammed into the system prompt, or actually teaching the model your domain through fine-tuning. Both have their place, and picking the wrong one wastes time and money.

The question isn't "can I fine-tune?" -- it's "should I?" ;-)

When fine-tuning beats prompting (and when it doesn't)

This is the decision framework I use, and I'd argue it's the most important thing in this entire episode. Getting this wrong means you either burn weeks fine-tuning when a 5-line system prompt would have worked, or you keep jamming 4000-token prompts into every API call when a small fine-tuned model would be cheaper and faster.

Choose prompting when: you have fewer than 50 examples, your task can be described in a few sentences, the base model already performs reasonably, or you need to iterate quickly. Prompting is cheap, fast, and flexible. You can change the behavior in seconds by editing the prompt.

Choose fine-tuning when: you have hundreds or thousands of examples, you need consistent formatting or style, the task requires specialized knowledge the base model lacks, you need lower latency (because you can use shorter prompts), or you're spending too much on long system prompts that get sent with every single request. Fine-tuning bakes knowledge into the model weights -- it doesn't need to be reminded every time.

Choose RAG (episode #64) when: the knowledge changes frequently, you need citations and source attribution, or the information volume exceeds what fine-tuning can absorb. RAG retrieves current information; fine-tuning remembers static knowledge.

# Decision matrix -- when to use what
decisions = [
    {
        "scenario": "Customer support bot, 30 example conversations",
        "approach": "Prompting",
        "reason": "Too few examples for fine-tuning. Few-shot "
                  "prompting with 5-10 examples works here.",
    },
    {
        "scenario": "Medical report formatter, 2000 labeled examples",
        "approach": "Fine-tuning",
        "reason": "Specialized formatting + domain knowledge + "
                  "consistent output = fine-tuning territory.",
    },
    {
        "scenario": "Company knowledge base Q&A, docs change weekly",
        "approach": "RAG",
        "reason": "Changing knowledge. Fine-tuning would be stale "
                  "by next week. RAG retrieves current docs.",
    },
    {
        "scenario": "Code review assistant, 5000 PR review examples",
        "approach": "Fine-tuning + RAG",
        "reason": "Fine-tune for your team's style and standards. "
                  "RAG for current codebase context.",
    },
    {
        "scenario": "Translating UI strings to 3 languages",
        "approach": "Prompting (or just use an API)",
        "reason": "Base models already translate well. Don't "
                  "fine-tune what the model already handles.",
    },
]

print(f"{'Scenario':<52} {'Approach':<20}")
print("-" * 74)
for d in decisions:
    print(f"{d['scenario']:<52} {d['approach']:<20}")
    print(f"  -> {d['reason'][:68]}")

In practice, many production systems combine all three. A fine-tuned model with domain knowledge, a RAG system for current information, and prompt engineering for task-specific instructions layered on top. Each technique addresses a different axis: prompting controls how the model responds, fine-tuning controls what the model knows, RAG controls what information is available at query time.

Full fine-tuning: the expensive option

Full fine-tuning updates every parameter in the model. For a 7B parameter model, that means adjusting 7 billion numbers simultaneously. The math is straightforward but the resource requirements are not.

A 7B model in float16 takes roughly 14GB of VRAM just to load. During training, you also need: optimizer states (another 14GB for Adam, which maintains both first and second moment estimates per parameter), gradients (14GB), and activations (variable, often 10-20GB depending on batch size and sequence length). That adds up to 50-60GB minimum -- a single high-end consumer GPU (24GB) can't even fit it.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "meta-llama/Llama-3.1-8B"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Full fine-tuning: every parameter is trainable
trainable = sum(p.numel() for p in model.parameters()
                if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable: {trainable:,} / {total:,} "
      f"({100 * trainable / total:.1f}%)")
# Trainable: 8,030,261,248 / 8,030,261,248 (100.0%)

# Memory estimate (rough)
bytes_per_param = 2  # float16
model_size_gb = total * bytes_per_param / 1e9
optimizer_gb = model_size_gb * 2  # Adam: 2x for moments
grad_gb = model_size_gb
print(f"\nMemory estimates:")
print(f"  Model weights:    {model_size_gb:.1f} GB")
print(f"  Optimizer states: {optimizer_gb:.1f} GB")
print(f"  Gradients:        {grad_gb:.1f} GB")
print(f"  Activations:      ~10-20 GB (varies)")
print(f"  Total:            ~{model_size_gb + optimizer_gb + grad_gb + 15:.0f} GB")

Full fine-tuning gives you maximum flexibility -- the model can learn entirely new behaviors, pick up new languages, or shift its personality completely. But the downsides are serious: catastrophic forgetting (the model loses general capabilities as it specializes -- it gets really good at your task but forgets how to do other things), high compute cost, and the need to store a complete copy of the entire model for each fine-tuned variant. If you need 10 domain-specific models, that's 10 x 14GB = 140GB of storage just for the weights.

Is there a way to get most of the benefit with a fraction of the cost? Glad you asked ;-)

LoRA: the revolution in efficient fine-tuning

Low-Rank Adaptation (Hu et al., 2021) is one of those papers that changed everything with a deceptively simple insight: when you fine-tune a large model, the weight updates have low intrinsic rank. You don't need to modify all 7 billion parameters -- most of the "movement" during fine-tuning can be captured by a much smaller set of parameters.

The idea: in stead of updating a weight matrix W directly, decompose the update into two small matrices. If W is a d x d matrix (say 4096 x 4096), the update delta-W can be approximated as the product of two matrices: A (d x r) and B (r x d), where r is the "rank" -- typically 8, 16, or 32. The original weights stay frozen. Only A and B get trained.

Original forward pass:  y = Wx
LoRA forward pass:      y = Wx + BAx     (B is d x r, A is r x d)

The parameter savings are dramatic:

import torch
import torch.nn as nn

# Parameter comparison
d = 4096  # typical hidden dimension for 7B model

# Full fine-tuning
full_params = d * d
print(f"Full fine-tuning: {full_params:,} parameters per layer")

# LoRA at various ranks
for rank in [4, 8, 16, 32, 64]:
    lora_params = d * rank + rank * d  # A + B matrices
    pct = lora_params / full_params * 100
    print(f"LoRA rank {rank:>2}: {lora_params:>10,} params "
          f"({pct:.2f}% of full)")

Now let's build a LoRA layer from scratch so you can see exactly what's going on under the hood (this is what we do in this series -- build first, use libraries second):

class LoRALayer(nn.Module):
    """Low-Rank Adaptation layer."""

    def __init__(self, original_layer, rank=16, alpha=32):
        super().__init__()
        self.original = original_layer
        self.original.weight.requires_grad = False  # freeze!

        d_in = original_layer.in_features
        d_out = original_layer.out_features

        # Low-rank decomposition: delta_W = B @ A
        # A: initialized with small random values (Kaiming)
        # B: initialized to zeros (so LoRA starts as identity)
        self.A = nn.Parameter(
            torch.randn(d_in, rank) * 0.01)
        self.B = nn.Parameter(
            torch.zeros(rank, d_out))

        # Scaling factor controls adaptation strength
        self.scaling = alpha / rank

    def forward(self, x):
        # Original (frozen) output
        original_out = self.original(x)
        # Low-rank adaptation
        lora_out = (x @ self.A @ self.B) * self.scaling
        return original_out + lora_out


# Demo: wrap a linear layer with LoRA
original = nn.Linear(4096, 4096, bias=False)
lora = LoRALayer(original, rank=16, alpha=32)

# Count parameters
frozen = sum(p.numel() for p in lora.parameters()
             if not p.requires_grad)
trainable = sum(p.numel() for p in lora.parameters()
                if p.requires_grad)
print(f"Frozen:    {frozen:>12,}")
print(f"Trainable: {trainable:>12,}")
print(f"Ratio:     {trainable / (frozen + trainable) * 100:.3f}%")

# Verify output shape
x = torch.randn(2, 128, 4096)  # batch=2, seq=128, dim=4096
y = lora(x)
print(f"\nInput:  {x.shape}")
print(f"Output: {y.shape}")

Notice that B is initialized to zeros. This is critical -- it means the LoRA layer starts with zero modification to the original model. At the very beginning of training, the output is exactly what the pre-trained model would produce. Training then gradually adjusts A and B to adapt the model's behavior. The alpha/rank scaling controls how much influence the LoRA adaptation has on the output.

In practice, you apply LoRA to the attention projection matrices (query, key, value, and output projections). These are where most of the model's "reasoning" happens. The feed-forward layers can also be adapted but yield diminishing returns for the extra parameters.

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,                          # rank
    lora_alpha=32,                 # scaling (alpha/r)
    target_modules=[
        "q_proj", "k_proj",       # attention projections
        "v_proj", "o_proj",
    ],
    lora_dropout=0.05,             # regularization
    bias="none",                   # don't train biases
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 13,631,488 || all params: 8,043,892,736
# || trainable%: 0.1695

0.17% of parameters. Training requires a fraction of the memory, runs significantly faster, and the LoRA weights themselves are tiny -- a few dozen MB in stead of several GB per fine-tuned variant. You can store hundreds of domain-specific adaptations and hot-swap them at inference time by loading different LoRA weights on top of the same frozen base model. That's a massive practical advantage.

QLoRA: fine-tuning on consumer hardware

QLoRA (Dettmers et al., 2023) combines LoRA with quantization. The base model is loaded in 4-bit precision, drastically reducing memory requirements. A 7B model that normally needs 14GB in float16 fits in roughly 4GB in 4-bit. Add LoRA adapters (which are still trained in float16 for numerical precision), and you can fine-tune a 7B model on a single GPU with 8GB VRAM.

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",            # normalized float4
    bnb_4bit_compute_dtype=torch.bfloat16, # compute in bf16
    bnb_4bit_use_double_quant=True,        # quantize the
                                           # quantization
                                           # constants too
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    quantization_config=bnb_config,
    device_map="auto",
)

# Apply LoRA on top of the quantized model
model = get_peft_model(model, lora_config)

# Memory comparison
print("QLoRA memory breakdown (approximate):")
print(f"  Base model (4-bit):  ~4 GB")
print(f"  LoRA adapters (fp16): ~0.03 GB")
print(f"  Optimizer (Adam):     ~0.06 GB")
print(f"  Gradients + acts:     ~2-4 GB")
print(f"  Total:                ~6-8 GB")
print(f"\nCompare to full fine-tuning: ~55 GB")

The "double quantization" deserves a moment: even the quantization scaling factors (which are normally stored in float32) get quantized to float8, squeezing out extra memory savings. It sounds like an accounting trick, but at scale it adds up -- saving 0.5GB on a machine with 8GB total is meaningful.

The NF4 (Normalized Float 4-bit) data type is specifically designed for neural network weights, which tend to follow a normal distribution. NF4 allocates more quantization levels near zero (where most weights cluster) and fewer at the tails, giving better precision than a uniform 4-bit quantization would.

In practice, QLoRA results are very close to full fine-tuning quality. The Dettmers et al. paper showed that a QLoRA-finetuned 65B model matched the performance of a fully fine-tuned 65B model on multiple benchmarks. That's wild -- you're training 0.17% of the parameters in 4-bit precision and getting essentially the same result. This is what made fine-tuning accesible to individual practitioners with a single gaming GPU, rather than something only organizations with clusters of A100s could do.

Dataset preparation: garbage in, garbage out

Your fine-tuning dataset matters more than any hyperparameter. More than the learning rate, more than the rank, more than the number of epochs. A model trained on poor data will confidently produce poor outputs -- and it'll do so in the exact style of your poor data, which makes it even harder to diagnose.

For instruction fine-tuning (teaching the model to follow specific kinds of instructions):

# Standard instruction format
training_examples = [
    {
        "instruction": "Summarize this legal contract clause "
                       "in plain English.",
        "input": "The Licensee shall indemnify and hold harmless "
                 "the Licensor from any claims arising from...",
        "output": "You (the licensee) agree to protect the "
                  "licensor from any legal claims that come "
                  "from your use of the licensed material."
    },
    {
        "instruction": "Convert this Python function to use "
                       "type hints.",
        "input": ("def calculate_total(items, tax_rate):\n"
                  "    subtotal = sum(i['price'] for i in items)\n"
                  "    return subtotal * (1 + tax_rate)"),
        "output": ("def calculate_total(\n"
                   "    items: list[dict[str, float]],\n"
                   "    tax_rate: float,\n"
                   ") -> float:\n"
                   "    subtotal = sum(i['price'] for i in items)\n"
                   "    return subtotal * (1 + tax_rate)")
    },
]

def format_for_training(example, tokenizer):
    """Convert instruction/input/output to training format."""
    prompt = f"### Instruction:\n{example['instruction']}\n\n"
    if example.get("input"):
        prompt += f"### Input:\n{example['input']}\n\n"
    prompt += f"### Response:\n{example['output']}"

    tokens = tokenizer(
        prompt,
        truncation=True,
        max_length=2048,
        padding="max_length",
    )
    # Labels = input_ids (the model learns to predict
    # the full sequence, including the response)
    tokens["labels"] = tokens["input_ids"].copy()
    return tokens

# Show what the training data looks like
for ex in training_examples:
    print(f"Instruction: {ex['instruction'][:50]}...")
    print(f"Input:       {ex.get('input', 'N/A')[:50]}...")
    print(f"Output:      {ex['output'][:50]}...")
    print()

Quality rules (these are NOT optional -- I've seen fine-tuning runs fail because of sloppy data more times than I can count):

Consistency: all examples should follow the same format. If some have trailing whitespace, others have special tokens, and a few use different instruction phrasing, the model learns noise in stead of patterns.
Diversity: cover the full range of inputs the model will see in production. If you only include easy examples, the model won't know what to do with hard ones. Include edge cases.
Correctness: every output must be exactly what you want the model to produce. One wrong example teaches a persistent bad habit. With 500 training examples, each one has outsized influence.
Volume: for LoRA fine-tuning, 500-1000 high-quality examples often suffice. More data helps, but quality trumps quantity by a dramatic margin. 300 perfect examples beat 5000 noisy ones.

A common mistake: fine-tuning on data that the base model already handles well. If GPT-4 already writes good Python, fine-tuning a smaller model on GPT-4-generated Python code mostly teaches the smaller model to imitate GPT-4's formatting quirks, not to actually write better code. Fine-tune on data that represents the gap between what the model can do now and what you need it to do.

The training loop

With LoRA configured and a clean dataset, the actual training is surprisingly straightforward. We've built training loops from scratch in earlier episodes (remember episode #7?), and the Hugging Face Trainer wraps all of that into a few lines:

from transformers import TrainingArguments, Trainer
from datasets import load_dataset

# Load and split dataset
dataset = load_dataset("json", data_files="training_data.json")
dataset = dataset["train"].train_test_split(test_size=0.1)

training_args = TrainingArguments(
    output_dir="./lora-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,    # effective batch = 16
    learning_rate=2e-4,               # higher than full FT!
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    logging_steps=10,
    eval_strategy="steps",
    eval_steps=50,
    save_strategy="steps",
    save_steps=50,
    bf16=True,                        # bfloat16 training
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
)

trainer.train()

Key hyperparameters for LoRA specifically:

Learning rate: 1e-4 to 3e-4. Higher than full fine-tuning (which typically uses 1e-5 to 5e-5) because you're updating fewer parameters. Each parameter update needs to "count for more" since there are so few of them.
Epochs: 2-5. More epochs with less data, fewer with more. Watch the eval loss -- if it starts climbing while train loss keeps dropping, you're overfitting.
Gradient accumulation: simulates larger batch sizes when GPU memory is limited. batch_size=4 with accumulation_steps=4 behaves like batch_size=16 for gradient updates.
Warmup: 3-5% of total steps. Prevents the early high-learning-rate updates from destabilizing the (frozen) base model's representations.

Saving and loading LoRA adapters

One of the practical advantages of LoRA: the adapters are tiny and self-contained. You save just the trained A and B matrices, not the entire model.

# Save the LoRA adapter (just the trained params)
model.save_pretrained("./my-lora-adapter")

# What got saved?
import os
adapter_dir = "./my-lora-adapter"
for f in os.listdir(adapter_dir):
    size = os.path.getsize(os.path.join(adapter_dir, f))
    print(f"  {f}: {size / 1e6:.1f} MB")
# adapter_model.safetensors: ~55 MB (vs 16 GB for full model)
# adapter_config.json: tiny config file

# Load base model + adapter for inference
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    device_map="auto",
)
model_with_adapter = PeftModel.from_pretrained(
    base_model, "./my-lora-adapter")

# Switch between adapters on the same base model
# (hot-swapping for multi-tenant serving)
model_with_adapter.load_adapter("./legal-adapter", "legal")
model_with_adapter.load_adapter("./medical-adapter", "medical")
model_with_adapter.set_adapter("legal")   # switch to legal
model_with_adapter.set_adapter("medical") # switch to medical

This is a huge deal for serving: one base model in memory, multiple LoRA adapters loaded on demand. A multi-tenant SaaS application can serve 50 different customers, each with their own fine-tuned behavior, using only slightly more memory than serving a single model. Compare that to full fine-tuning where each customer needs their own complete copy of the model.

Evaluation: did it actually help?

Fine-tuning without evaluation is guesswork. You need to measure whether the fine-tuned model actually improved on your target task -- and whether it regressed on general tasks you still care about.

def evaluate_fine_tuned(model, tokenizer, test_set):
    """Compare base vs fine-tuned on test examples."""
    results = []

    for example in test_set:
        prompt = (f"### Instruction:\n{example['instruction']}"
                  f"\n\n### Response:\n")

        inputs = tokenizer(prompt, return_tensors="pt")
        outputs = model.generate(
            **inputs, max_new_tokens=256,
            temperature=0.1, do_sample=True,
        )
        generated = tokenizer.decode(
            outputs[0][inputs.input_ids.shape[1]:],
            skip_special_tokens=True,
        )

        # Simple exact-match and length-based scoring
        expected = example["output"]
        exact_match = generated.strip() == expected.strip()
        len_ratio = min(len(generated), len(expected)) / \
                    max(len(generated), len(expected), 1)

        results.append({
            "instruction": example["instruction"][:50],
            "exact_match": exact_match,
            "length_ratio": len_ratio,
            "generated_preview": generated[:80],
        })

    # Summary
    n = len(results)
    exact = sum(r["exact_match"] for r in results)
    avg_len = sum(r["length_ratio"] for r in results) / n
    print(f"Evaluation on {n} examples:")
    print(f"  Exact match: {exact}/{n} ({exact/n*100:.1f}%)")
    print(f"  Avg length ratio: {avg_len:.2f}")
    return results

LLM-as-judge: use a stronger model to evaluate the fine-tuned model's outputs against reference answers. This correlates well with human judgment and scales much better than manual evaluation. You can set up a "judge prompt" that asks the evaluator to rate responses on a 1-5 scale across dimensions like accuracy, completeness, and formatting.

def llm_judge_eval(outputs, references, judge_fn):
    """Use a stronger LLM to judge fine-tuned model outputs."""
    scores = []

    for output, reference in zip(outputs, references):
        judge_prompt = f"""Rate this model output from 1-5 on:
- Accuracy (does it match the reference?)
- Completeness (does it cover all points?)
- Format (does it follow the expected structure?)

Reference answer: {reference}
Model output: {output}

Return 3 scores as: accuracy,completeness,format"""

        # In production: call GPT-4 / Claude here
        # Simulated judge response
        judgment = judge_fn(judge_prompt)
        scores.append(judgment)

    return scores

# The before-and-after comparison is the most important test
def before_after_comparison(
    base_model, finetuned_model, test_prompts
):
    """Run same prompts on both models, compare."""
    print(f"{'Prompt':<40} {'Base':>8} {'FT':>8} {'Delta':>8}")
    print("-" * 68)

    for prompt_text, metric_fn in test_prompts:
        base_score = metric_fn(base_model, prompt_text)
        ft_score = metric_fn(finetuned_model, prompt_text)
        delta = ft_score - base_score
        arrow = "+" if delta > 0 else ""
        print(f"{prompt_text[:40]:<40} {base_score:>7.2f} "
              f"{ft_score:>7.2f} {arrow}{delta:>7.2f}")

Watch for regression: fine-tuning improves target task performance but can degrade general capabilities. Always test on BOTH your target task AND a set of general-purpose prompts. If the model now writes perfect legal documents but can't hold a basic conversation or do simple math anymore, you've overcorrected. This is especially common with small datasets and too many epochs -- the model memorizes your data at the expense of everything else it used to know.

Adapter merging and deployment

Once your LoRA adapter is trained and evaluated, you have two deployment options: serve the base model + adapter separately (flexible, supports hot-swapping), or merge the adapter into the base model for a single model file (simpler deployment, slightly faster inference).

# Merge LoRA weights into base model permanently
merged_model = model_with_adapter.merge_and_unload()

# Now it's a regular model -- no adapter overhead
merged_model.save_pretrained("./merged-model")
tokenizer.save_pretrained("./merged-model")

# At inference time, load like any other model
# No PEFT library needed
final_model = AutoModelForCausalLM.from_pretrained(
    "./merged-model", device_map="auto")

Merging is mathematically exact -- the merged model produces identical outputs to the base + adapter setup. The trade-off: you lose the ability to hot-swap adapters, but you gain simpler deployment (just one model file) and marginally faster inference (no adapter math at runtime). For production serving where you've settled on one fine-tuned variant, merging is usually the right choice.

The bottom line

Fine-tune when prompting isn't enough: consistent style, domain knowledge, lower latency, or cost reduction on long system prompts. But start with prompting and only fine-tune when you have evidence the base model can't handle it;
Full fine-tuning updates all parameters (expensive, risk of catastrophic forgetting) -- rarely necessary in practice now that parameter-efficient methods exist;
LoRA decomposes weight updates into small low-rank matrices, training 0.1-0.2% of parameters with minimal quality loss. The adapters are tiny, storable, and hot-swappable;
QLoRA adds 4-bit quantization, making 7B+ model fine-tuning possible on consumer GPUs with 8GB VRAM. This is what democratized fine-tuning;
Dataset quality is everything: consistent formatting, diverse inputs, correct outputs, and focus on the gap between base model capability and your specific needs;
Always evaluate before and after: automated metrics, LLM-as-judge, and regression testing on general capabilities. Fine-tuning that improves your task but breaks everything else is not a win;
LoRA adapters can be merged into the base model for simpler deployment, or kept separate for multi-tenant serving with hot-swapping.

We've now covered the full spectrum of working with language models: using them via APIs (episode #66), building agent systems on top of them (#67-68), and customizing them through fine-tuning (today). The next logical step is running models locally on your own hardware -- which opens up a whole new set of considerations around model selection, quantization for inference (as opposed to training), and the tradeoffs between running your own models vs using hosted APIs.

Exercises

Exercise 1: Build a LoRA parameter calculator and comparison tool. Create a function lora_analysis(hidden_dim, num_layers, ranks, target_modules) that calculates for each rank: total LoRA parameters, percentage of full model parameters, estimated memory savings in GB (assume float16 for full, float16 for LoRA), and estimated adapter file size in MB. Test it with a realistic config: hidden_dim=4096, 32 layers, ranks=[4, 8, 16, 32, 64], target_modules=["q_proj", "k_proj", "v_proj", "o_proj"]. Print a formatted comparison table. Then add a find_optimal_rank(budget_mb) function that returns the highest rank that fits within a given storage budget. Test with budgets of 10MB, 50MB, 100MB, and 500MB.

Exercise 2: Implement a dataset quality checker for fine-tuning. Create a DatasetValidator class that takes a list of instruction/input/output examples and runs these checks: (a) all examples have required fields, (b) no duplicate instructions, (c) output is not empty for any example, (d) instruction length distribution is reasonable (flag outliers beyond 2 std devs), (e) output length distribution is reasonable (flag extremely short or long), (f) no format inconsistencies (e.g., some outputs end with period and some don't, some have trailing whitespace). Generate 30 test examples where 5 deliberately have quality issues (missing fields, duplicates, empty outputs, extreme lengths, format inconsistencies). Print a quality report showing issues found, with severity (warning vs error) and the affected example indices.

Exercise 3: Build a fine-tuning experiment tracker. Create a class FTExperimentTracker that logs: hyperparameters (rank, alpha, lr, epochs, batch_size), training metrics over time (loss, eval_loss at each logging step), final evaluation scores (accuracy, F1, or custom metrics), and compute usage (simulated GPU hours, peak memory). Simulate 4 experiments with different LoRA ranks (4, 16, 32, 64) and learning rates. Each experiment should generate 50 training steps with realistic-looking decreasing loss curves (add some noise). After all experiments, print a comparison table ranked by final eval loss, and identify the best configuration. Include a recommend() method that picks the best experiment considering both performance and efficiency (penalize configs that use >2x the parameters of the simplest config for <5% improvement).

Bedankt en tot de volgende keer!

Hive account@scipio

Learn AI Series (#69) - Fine-Tuning Language Models

Learn AI Series (#69) - Fine-Tuning Language Models

What will I learn

Requirements

Difficulty

Curriculum (of the Learn AI Series):

Learn AI Series (#69) - Fine-Tuning Language Models

Solutions to Episode #68 Exercises

On to today's episode

When fine-tuning beats prompting (and when it doesn't)

Full fine-tuning: the expensive option

LoRA: the revolution in efficient fine-tuning

QLoRA: fine-tuning on consumer hardware

Dataset preparation: garbage in, garbage out

The training loop

Saving and loading LoRA adapters

Evaluation: did it actually help?

Adapter merging and deployment

The bottom line

Exercises

Bedankt en tot de volgende keer!

Curriculum (of the `Learn AI Series`):