Learn AI Series (#70) - Running Local Models
What will I learn
- You will learn why running models locally matters: privacy, cost, control, and offline capability;
- the local inference stack: Ollama, llama.cpp, and vLLM;
- quantization formats: GGUF, GPTQ, AWQ and what they trade off;
- model selection: which local model for which task;
- hardware realities: what actually matters (VRAM is king);
- running and comparing local models in practice.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution;
- The ambition to learn AI and machine learning.
Difficulty
- Beginner
Curriculum (of the Learn AI Series):
- Learn AI Series (#1) - What Machine Learning Actually Is
- Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
- Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World
- Learn AI Series (#4) - Your First Prediction - No Math, Just Intuition
- Learn AI Series (#5) - Patterns in Data - What "Learning" Actually Looks Like
- Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
- Learn AI Series (#7) - The Training Loop - See It Work Step by Step
- Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra
- Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability
- Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch
- Learn AI Series (#11) - Making Linear Regression Real
- Learn AI Series (#12) - Classification - Logistic Regression From Scratch
- Learn AI Series (#13) - Evaluation - How to Know If Your Model Actually Works
- Learn AI Series (#14) - Data Preparation - The 80% Nobody Talks About
- Learn AI Series (#15) - Feature Engineering and Selection
- Learn AI Series (#16) - Scikit-Learn - The Standard Library of ML
- Learn AI Series (#17) - Decision Trees - How Machines Make Decisions
- Learn AI Series (#18) - Random Forests - Wisdom of Crowds
- Learn AI Series (#19) - Gradient Boosting - The Kaggle Champion
- Learn AI Series (#20) - Support Vector Machines - Drawing the Perfect Boundary
- Learn AI Series (#21) - Mini Project - Predicting Crypto Market Regimes
- Learn AI Series (#22) - K-Means Clustering - Finding Groups
- Learn AI Series (#23) - Advanced Clustering - Beyond K-Means
- Learn AI Series (#24) - Dimensionality Reduction - PCA
- Learn AI Series (#25) - Advanced Dimensionality Reduction - t-SNE and UMAP
- Learn AI Series (#26) - Anomaly Detection - Finding What Doesn't Belong
- Learn AI Series (#27) - Recommendation Systems - "Users Like You Also Liked..."
- Learn AI Series (#28) - Time Series Fundamentals - When Order Matters
- Learn AI Series (#29) - Time Series Forecasting - Predicting What Comes Next
- Learn AI Series (#30) - Natural Language Processing - Text as Data
- Learn AI Series (#31) - Word Embeddings - Meaning in Numbers
- Learn AI Series (#32) - Bayesian Methods - Thinking in Probabilities
- Learn AI Series (#33) - Ensemble Methods Deep Dive - Stacking and Blending
- Learn AI Series (#34) - ML Engineering - From Notebook to Production
- Learn AI Series (#35) - Data Ethics and Bias in ML
- Learn AI Series (#36) - Mini Project - Complete ML Pipeline
- Learn AI Series (#37) - The Perceptron - Where It All Started
- Learn AI Series (#38) - Neural Networks From Scratch - Forward Pass
- Learn AI Series (#39) - Neural Networks From Scratch - Backpropagation
- Learn AI Series (#40) - Training Neural Networks - Practical Challenges
- Learn AI Series (#41) - Optimization Algorithms - SGD, Momentum, Adam
- Learn AI Series (#42) - PyTorch Fundamentals - Tensors and Autograd
- Learn AI Series (#43) - PyTorch Data and Training
- Learn AI Series (#44) - PyTorch nn.Module - Building Real Networks
- Learn AI Series (#45) - Convolutional Neural Networks - Theory
- Learn AI Series (#46) - CNNs in Practice - Classic to Modern Architectures
- Learn AI Series (#47) - CNN Applications - Detection, Segmentation, Style Transfer
- Learn AI Series (#48) - Recurrent Neural Networks - Sequences
- Learn AI Series (#49) - LSTM and GRU - Solving the Memory Problem
- Learn AI Series (#50) - Sequence-to-Sequence Models
- Learn AI Series (#51) - Attention Mechanisms
- Learn AI Series (#52) - The Transformer Architecture (Part 1)
- Learn AI Series (#53) - The Transformer Architecture (Part 2)
- Learn AI Series (#54) - Vision Transformers
- Learn AI Series (#55) - Generative Adversarial Networks
- Learn AI Series (#56) - Mini Project - Building a Transformer From Scratch
- Learn AI Series (#57) - Language Modeling - Predicting the Next Word
- Learn AI Series (#58) - GPT Architecture - Decoder-Only Transformers
- Learn AI Series (#59) - BERT and Encoder Models
- Learn AI Series (#60) - Training Large Language Models
- Learn AI Series (#61) - Instruction Tuning and Alignment
- Learn AI Series (#62) - Prompt Engineering - Getting the Most from LLMs
- Learn AI Series (#63) - Embeddings and Vector Search
- Learn AI Series (#64) - Retrieval-Augmented Generation (RAG) - Basics
- Learn AI Series (#65) - RAG - Advanced Techniques
- Learn AI Series (#66) - Working with LLM APIs
- Learn AI Series (#67) - Building AI Agents (Part 1) - Foundations
- Learn AI Series (#68) - Building AI Agents (Part 2) - Advanced Patterns
- Learn AI Series (#69) - Fine-Tuning Language Models
- Learn AI Series (#70) - Running Local Models (this post)
Learn AI Series (#70) - Running Local Models
Solutions to Episode #69 Exercises
Exercise 1: LoRA parameter calculator and comparison tool.
def lora_analysis(hidden_dim, num_layers, ranks, target_modules):
"""Calculate LoRA params, memory, and file size per rank."""
modules_per_layer = len(target_modules)
full_params_per_module = hidden_dim * hidden_dim
total_full = full_params_per_module * modules_per_layer * num_layers
print(f"Model config: d={hidden_dim}, {num_layers} layers, "
f"{modules_per_layer} modules/layer")
print(f"Full model params: {total_full:,}")
print(f"\n{'Rank':>6} {'LoRA Params':>14} {'% of Full':>10} "
f"{'Savings (GB)':>13} {'Adapter (MB)':>13}")
print("-" * 60)
results = {}
for r in ranks:
# Each LoRA module: A is (d_in x r) + B is (r x d_out)
lora_per_module = hidden_dim * r + r * hidden_dim
total_lora = lora_per_module * modules_per_layer * num_layers
pct = total_lora / total_full * 100
# Memory: full model in fp16 vs LoRA in fp16
full_gb = total_full * 2 / 1e9
lora_gb = total_lora * 2 / 1e9
savings_gb = full_gb - lora_gb
# Adapter file size (just the LoRA weights)
adapter_mb = total_lora * 2 / 1e6
results[r] = {
"params": total_lora,
"pct": pct,
"savings_gb": savings_gb,
"adapter_mb": adapter_mb,
}
print(f"{r:>6} {total_lora:>14,} {pct:>9.3f}% "
f"{savings_gb:>12.2f} {adapter_mb:>12.1f}")
return results
def find_optimal_rank(results, budget_mb):
"""Find highest rank that fits within storage budget."""
best_rank = None
for rank in sorted(results.keys()):
if results[rank]["adapter_mb"] <= budget_mb:
best_rank = rank
return best_rank
# Test with realistic config
ranks = [4, 8, 16, 32, 64]
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"]
results = lora_analysis(4096, 32, ranks, target_modules)
# Budget search
print("\nOptimal rank for budget:")
for budget in [10, 50, 100, 500]:
optimal = find_optimal_rank(results, budget)
if optimal:
mb = results[optimal]["adapter_mb"]
print(f" {budget:>4} MB budget -> rank {optimal} "
f"({mb:.1f} MB adapter)")
else:
print(f" {budget:>4} MB budget -> no rank fits")
The key insight is how dramatically LoRA reduces storage. A rank-16 adapter for a 7B model is around 52MB -- compare that to roughly 16GB for the full model. That 300x reduction is what makes serving dozens of fine-tuned variants practical. The find_optimal_rank function is a simple greedy search: pick the biggest rank your storage can handle, since higher rank generally means better adaptation quality (up to a point of diminishing returns, usually around rank 32-64).
Exercise 2: Dataset quality checker for fine-tuning.
import statistics
class DatasetValidator:
"""Validate fine-tuning dataset quality."""
def __init__(self, examples):
self.examples = examples
self.issues = []
def validate(self):
"""Run all quality checks."""
self.issues = []
self._check_required_fields()
self._check_duplicates()
self._check_empty_outputs()
self._check_length_outliers("instruction")
self._check_length_outliers("output")
self._check_format_consistency()
return self.issues
def _check_required_fields(self):
required = {"instruction", "output"}
for i, ex in enumerate(self.examples):
missing = required - set(ex.keys())
if missing:
self.issues.append({
"severity": "error",
"check": "required_fields",
"index": i,
"detail": f"Missing: {missing}",
})
def _check_duplicates(self):
seen = {}
for i, ex in enumerate(self.examples):
inst = ex.get("instruction", "")
if inst in seen:
self.issues.append({
"severity": "error",
"check": "duplicate",
"index": i,
"detail": f"Duplicate of index {seen[inst]}",
})
else:
seen[inst] = i
def _check_empty_outputs(self):
for i, ex in enumerate(self.examples):
out = ex.get("output", "")
if not out or not out.strip():
self.issues.append({
"severity": "error",
"check": "empty_output",
"index": i,
"detail": "Output is empty or whitespace",
})
def _check_length_outliers(self, field):
lengths = []
for ex in self.examples:
val = ex.get(field, "")
if val:
lengths.append(len(val))
if len(lengths) < 3:
return
mean = statistics.mean(lengths)
stdev = statistics.stdev(lengths)
if stdev == 0:
return
for i, ex in enumerate(self.examples):
val = ex.get(field, "")
if val and abs(len(val) - mean) > 2 * stdev:
self.issues.append({
"severity": "warning",
"check": f"{field}_length_outlier",
"index": i,
"detail": (f"Length {len(val)} is >2 std devs "
f"from mean {mean:.0f}"),
})
def _check_format_consistency(self):
endings = {"period": 0, "no_period": 0}
trailing_ws = 0
for ex in self.examples:
out = ex.get("output", "")
if not out:
continue
if out.rstrip() != out:
trailing_ws += 1
if out.rstrip().endswith("."):
endings["period"] += 1
else:
endings["no_period"] += 1
total = endings["period"] + endings["no_period"]
if total > 0:
minority = min(endings.values())
if 0 < minority < total * 0.3:
for i, ex in enumerate(self.examples):
out = ex.get("output", "").rstrip()
has_period = out.endswith(".")
is_minority = (
(has_period and endings["period"] < endings["no_period"])
or (not has_period and endings["no_period"] < endings["period"])
)
if is_minority and out:
self.issues.append({
"severity": "warning",
"check": "format_inconsistency",
"index": i,
"detail": "Ending punctuation differs from majority",
})
if trailing_ws > 0:
for i, ex in enumerate(self.examples):
out = ex.get("output", "")
if out and out.rstrip() != out:
self.issues.append({
"severity": "warning",
"check": "trailing_whitespace",
"index": i,
"detail": "Output has trailing whitespace",
})
def report(self):
"""Print quality report."""
issues = self.validate()
errors = [i for i in issues if i["severity"] == "error"]
warnings = [i for i in issues if i["severity"] == "warning"]
print(f"Dataset Quality Report")
print(f" Total examples: {len(self.examples)}")
print(f" Errors: {len(errors)}")
print(f" Warnings: {len(warnings)}")
if errors:
print(f"\n ERRORS:")
for e in errors:
print(f" [{e['index']:>3}] {e['check']}: "
f"{e['detail']}")
if warnings:
print(f"\n WARNINGS:")
for w in warnings:
print(f" [{w['index']:>3}] {w['check']}: "
f"{w['detail']}")
return issues
# Generate 30 test examples (5 with deliberate issues)
examples = []
for i in range(25):
examples.append({
"instruction": f"Summarize the concept of topic_{i}.",
"output": f"Topic_{i} is a concept that involves "
f"specific principles and applications.",
})
# Issue 1: missing field
examples.append({"instruction": "Explain gravity."})
# Issue 2: duplicate instruction
examples.append({
"instruction": "Summarize the concept of topic_0.",
"output": "Duplicate entry here.",
})
# Issue 3: empty output
examples.append({
"instruction": "What is entropy?",
"output": " ",
})
# Issue 4: extreme length
examples.append({
"instruction": "A" * 2000,
"output": "Very long instruction above.",
})
# Issue 5: format inconsistency (no period)
examples.append({
"instruction": "Define neural networks.",
"output": "Neural networks are computing systems "
"inspired by biological brains",
})
validator = DatasetValidator(examples)
validator.report()
Garbage in, garbage out. This validator catches the five most common dataset problems before they corrupt your fine-tuning run. The severity distinction matters: errors (missing fields, duplicates, empty outputs) should block training entirely, while warnings (length outliers, format inconsistencies) are worth reviewing but might be intentional. I've seen people burn 4 hours of GPU time on a dataset with duplicate entries that taught the model to repeat itself -- 30 seconds of validation would have caught it.
Exercise 3: Fine-tuning experiment tracker.
import random
import math
class FTExperimentTracker:
"""Track and compare fine-tuning experiments."""
def __init__(self):
self.experiments = []
def log_experiment(self, name, hyperparams, train_log,
eval_scores, compute):
"""Log a complete experiment."""
self.experiments.append({
"name": name,
"hyperparams": hyperparams,
"train_log": train_log,
"eval_scores": eval_scores,
"compute": compute,
})
def compare(self):
"""Print comparison table ranked by eval loss."""
sorted_exps = sorted(
self.experiments,
key=lambda e: e["train_log"][-1]["eval_loss"])
print(f"{'Name':<18} {'Rank':>4} {'LR':>8} "
f"{'Alpha':>6} {'Final Loss':>11} "
f"{'Eval Loss':>10} {'Params':>10} "
f"{'GPU-hrs':>8}")
print("-" * 90)
for exp in sorted_exps:
hp = exp["hyperparams"]
last = exp["train_log"][-1]
print(f"{exp['name']:<18} {hp['rank']:>4} "
f"{hp['lr']:>8.1e} {hp['alpha']:>6} "
f"{last['train_loss']:>11.4f} "
f"{last['eval_loss']:>10.4f} "
f"{hp['total_params']:>10,} "
f"{exp['compute']['gpu_hours']:>8.2f}")
def recommend(self):
"""Pick best experiment considering perf + efficiency."""
if not self.experiments:
return None
sorted_exps = sorted(
self.experiments,
key=lambda e: e["train_log"][-1]["eval_loss"])
best = sorted_exps[0]
simplest_params = min(
e["hyperparams"]["total_params"]
for e in self.experiments)
# Penalize configs using >2x params of simplest
# for <5% improvement over next-simplest
for i, exp in enumerate(sorted_exps):
params = exp["hyperparams"]["total_params"]
eval_loss = exp["train_log"][-1]["eval_loss"]
if params > 2 * simplest_params:
# Check if the improvement is worth it
simpler = [e for e in sorted_exps
if e["hyperparams"]["total_params"]
<= 2 * simplest_params]
if simpler:
simpler_loss = simpler[0]["train_log"][-1]["eval_loss"]
improvement = (simpler_loss - eval_loss) / simpler_loss
if improvement < 0.05:
best = simpler[0]
break
hp = best["hyperparams"]
last = best["train_log"][-1]
print(f"\nRecommended: {best['name']}")
print(f" Rank: {hp['rank']}, LR: {hp['lr']:.1e}, "
f"Alpha: {hp['alpha']}")
print(f" Eval loss: {last['eval_loss']:.4f}, "
f"Params: {hp['total_params']:,}")
return best
# Simulate 4 experiments
tracker = FTExperimentTracker()
random.seed(42)
configs = [
{"rank": 4, "lr": 2e-4, "alpha": 8, "epochs": 3},
{"rank": 16, "lr": 2e-4, "alpha": 32, "epochs": 3},
{"rank": 32, "lr": 1e-4, "alpha": 64, "epochs": 3},
{"rank": 64, "lr": 1e-4, "alpha": 128, "epochs": 3},
]
for cfg in configs:
d = 4096
modules = 4 # q, k, v, o
layers = 32
lora_params = 2 * d * cfg["rank"] * modules * layers
cfg["total_params"] = lora_params
# Simulate training: decreasing loss with noise
train_log = []
base_loss = 2.5 - (cfg["rank"] / 100)
for step in range(50):
t = (step + 1) / 50
decay = base_loss * math.exp(-3 * t)
noise = random.gauss(0, 0.02)
train_loss = max(0.1, decay + 0.15 + noise)
eval_noise = random.gauss(0, 0.03)
# Higher rank = slightly lower eval loss
rank_bonus = cfg["rank"] * 0.0003
eval_loss = max(0.12, train_loss + 0.05
- rank_bonus + eval_noise)
train_log.append({
"step": step + 1,
"train_loss": train_loss,
"eval_loss": eval_loss,
})
# Simulated compute
gpu_hours = 0.5 + cfg["rank"] * 0.03
tracker.log_experiment(
name=f"lora_r{cfg['rank']}_lr{cfg['lr']:.0e}",
hyperparams=cfg,
train_log=train_log,
eval_scores={"final_eval_loss": train_log[-1]["eval_loss"]},
compute={"gpu_hours": gpu_hours, "peak_mem_gb": 6 + cfg["rank"] * 0.1},
)
tracker.compare()
tracker.recommend()
The recommend() method is where the practical wisdom lives. Raw performance ranking would always pick the biggest model -- more parameters almost always means slightly better eval loss. But the marginal gain from rank 32 to rank 64 is often tiny (less than 5% improvement) while doubling the parameter count, training time, and adapter storage. The penalty function catches this: if a complex config doesn't meaningfully outperform a simpler one, pick the simpler one. In production, "good enough with half the resources" beats "marginally better at double the cost" almost every time.
On to today's episode
Here we go! Over the last few episodes we've gone deep on working with language models from the outside -- API calls (episode #66), building agent systems on top of them (#67, #68), and customizing them through fine-tuning (#69). But in every single one of those scenarios, the model lives on someone else's server. Your prompts travel over the network to a provider, get processed on their hardware, and the response comes back. You're renting intelligence by the token.
That works for a lot of use cases. But every API call is a request you don't fully control. The provider can change pricing tomorrow, add rate limits, modify the model's behavior, discontinue the endpoint, or (depending on their terms of service) peek at your data. For many applications that's an acceptable trade-off. For others -- medical records, proprietary code, financial data, air-gapped environments, or simply keeping your monthly bill predictable -- you want the model running on YOUR hardware, under YOUR control.
And here's what's remarkable: local inference has gotten really good. Models that needed a data center two years ago now run on a laptop. Let me show you how ;-)
Why local?
Four reasons keep coming up when people move to local inference, and they're all legitimate:
Privacy. Your data never leaves your machine. No terms of service granting the provider some vague rights to your inputs. No wondering whether your prompts end up in a training dataset somewhere. For regulated industries (healthcare, finance, legal), local inference can be a hard compliance requirement rather than a preference.
Cost. API calls add up fast. A busy application making thousands of calls per day can run into hundreds or thousands of dollars monthly. A local model has a one-time hardware cost and near-zero marginal cost per query. Once you own the GPU, every inference is essentially free (minus electricity, which is pennies compared to API pricing).
Control. You pick the model, the version, the quantization level, the serving configuration. No surprise model updates that change output behavior. No dependency on an external service's uptime. Your system works during internet outages. If you need to reproduce results from three months ago, the model file hasn't changed.
Latency. No network round trip. For applications where response time matters (code completion, real-time assistants, interactive tools), local inference on good hardware can be faster than API calls. No waiting for network hops, no queuing behind other users, no provider-side rate limiting.
The tradeoff is capability. Local models are smaller and less capable than the frontier API models. A 7B parameter model running locally won't match GPT-4 or Claude on complex multi-step reasoning tasks. But for focused applications -- code completion, summarization, classification, entity extraction, simple Q&A -- smaller models are often good enough. And "good enough with zero latency and zero cost per query" beats "slightly better at $0.01 per call" for many production scenarios.
The inference stack
Three tools dominate local inference right now, and each occupies a different niche:
Ollama is the "just works" option. Install it, pull a model, run it. It handles quantization, memory management, model downloading, and exposes an OpenAI-compatible API. That last part is huge -- if you built API clients in episode #66, your existing code works with Ollama by changing one base URL. Zero code changes to switch between cloud and local inference.
# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.ai/install.sh | sh
# Pull and run a model
ollama pull llama3.1:8b
ollama run llama3.1:8b "Explain gradient descent in one paragraph"
# Or use the API (OpenAI-compatible!)
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1:8b",
"messages": [{"role": "user", "content": "Hello"}]
}'
# Your existing OpenAI client code works unchanged
from openai import OpenAI
# Just point it at Ollama instead of OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="not-needed", # Ollama doesn't require a key
)
response = client.chat.completions.create(
model="llama3.1:8b",
messages=[{"role": "user",
"content": "What is backpropagation?"}],
max_tokens=256,
)
print(response.choices[0].message.content)
llama.cpp is the engine underneath Ollama (and many other tools). Written in C/C++, it provides maximum performance and flexibility. If you need custom quantization options, batch processing, fine-grained memory allocation control, or embedding generation with specific parameters, llama.cpp is where you go. It's lower-level -- you manage model files and configuration directly.
# Build from source
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j
# Run inference directly
./llama-cli -m models/llama-3.1-8b-q4_k_m.gguf \
-p "The transformer architecture" -n 256
# Start an API server
./llama-server -m models/llama-3.1-8b-q4_k_m.gguf --port 8080
vLLM is optimized for throughput -- serving many requests concurrently. It uses PagedAttention to manage GPU memory efficiently, achieving 2-4x higher throughput than naive implementations. The key idea: in stead of pre-allocating contiguous memory for each request's KV cache (which wastes memory on padding), PagedAttention allocates memory in pages, like a modern operating system manages RAM. Use vLLM when you're serving a model to multiple users simultaneously or processing large batches.
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--dtype float16 \
--max-model-len 4096
The practical guidance: Ollama for individual use and development. llama.cpp for maximum control and custom setups. vLLM for production serving with multiple concurrent users.
Quantization: making big models fit
Here's the core problem. A 7B parameter model in float16 (2 bytes per parameter) is about 14GB. A 70B model is 140GB. Most consumer GPUs have 8-24GB of VRAM. The math doesn't work -- the models are simply too big to fit.
Quantization compresses model weights by using fewer bits per number. In stead of storing each weight as a 16-bit float, you store it as an 8-bit, 4-bit, or even 2-bit integer (plus some scaling factors). This trades a small amount of output quality for dramatically less memory. And the quality tradeoff is surprisingly small -- the model's actual reasoning ability degrades much less than you'd expect from cutting the precision in half or more.
The key formats you need to know:
GGUF (GPT-Generated Unified Format) is the standard for llama.cpp and Ollama. It supports mixed quantization -- different layers can use different precision levels. Attention layers (where the "reasoning" happens) can stay at higher precision while less critical layers get compressed more aggressively.
Common GGUF quantization levels and what they mean in practice:
# Quantization comparison for a 7B parameter model
quant_levels = [
("Q8_0", 8, 8.0, "Negligible quality loss. Use if it fits."),
("Q6_K", 6, 6.0, "Very close to full precision."),
("Q5_K_M", 5, 5.0, "Good balance of size and quality."),
("Q4_K_M", 4, 4.5, "The sweet spot for most users."),
("Q3_K_M", 3, 3.5, "Noticeable degradation on complex tasks."),
("Q2_K", 2, 3.0, "Significant quality loss. Emergency option."),
]
print(f"{'Format':<10} {'Bits':>5} {'~Size (GB)':>11} Note")
print("-" * 65)
for name, bits, size, note in quant_levels:
reduction = (1 - size / 14) * 100
print(f"{name:<10} {bits:>5} {size:>10.1f} "
f"({reduction:.0f}% smaller) {note}")
# Using llama-cpp-python to load quantized models
from llama_cpp import Llama
llm = Llama(
model_path="models/llama-3.1-8b-q4_k_m.gguf",
n_ctx=4096, # context window size
n_gpu_layers=-1, # offload all layers to GPU (-1 = all)
verbose=False
)
response = llm.create_chat_completion(
messages=[{
"role": "user",
"content": "What is attention in transformers?"
}],
max_tokens=512,
temperature=0.7
)
print(response["choices"][0]["message"]["content"])
GPTQ (Frantar et al., 2022) is a GPU-optimized quantization method. It doesn't just naively round weights to fewer bits -- it runs a calibration pass using real data samples and adjusts the quantized weights to minimize the output error across the entire layer. This compensates for quantization mistakes in one weight by slightly adjusting neighboring weights. GPTQ models are fast on NVIDIA GPUs but require GPU inference -- they don't run on CPU.
AWQ (Activation-aware Weight Quantization) takes a different approach: it analyzes which weights have the biggest impact on activations and preserves those at higher precision. The insight is that a small fraction of weights (roughly 1%) are disproportionately important -- quantization errors in those weights cause much larger output errors than errors in the remaining 99%. AWQ often achieves better quality than GPTQ at the same bit width by being smarter about which weights to protect.
The practical rule: start with Q4_K_M in GGUF format. If quality isn't good enough, move up to Q5 or Q6. If you need to squeeze a bigger model into limited VRAM, try Q3. Below Q3, you're usually better off switching to a smaller model at higher quantization -- a 7B model at Q5 will almost always outperform a 13B model at Q2.
Model selection: which model for what
The local model landscape changes fast. New models appear almost weekly. But some selection principles hold steady regardless of which specific models are trending this month:
For general chat and instruction following: Llama 3.1 (8B, 70B), Mistral (7B), and Qwen 2.5 are strong choices at the time of writing. The 8B class runs comfortably on consumer hardware with 8-16GB VRAM. The 70B class needs 40-48GB VRAM (quantized) or CPU offloading with plenty of RAM (slow but functional).
For code generation: Code Llama, DeepSeek Coder, and StarCoder2 are purpose-built for code. They outperform general models on coding tasks despite being smaller, because their training data is heavily skewed toward code repositories.
For embedding and retrieval: nomic-embed-text, all-MiniLM, and bge models are small, fast, and designed specifically for generating embeddings. These are directly relevant to the RAG systems we built in episodes #63-65 -- you can run your entire retrieval pipeline locally.
For constrained environments: Phi-3 (3.8B) and Gemma 2 (2B) punch well above their weight class. If you're deploying on edge devices or have very limited VRAM, evaluate these first. Their quality-to-size ratio is impressively high.
import ollama
# Compare models on the same task
models = ["llama3.1:8b", "mistral:7b", "qwen2.5:7b", "phi3:3.8b"]
prompt = ("Explain the difference between LoRA and full "
"fine-tuning in 3 sentences.")
for model_name in models:
try:
response = ollama.chat(
model=model_name,
messages=[{"role": "user", "content": prompt}]
)
content = response["message"]["content"]
words = len(content.split())
print(f"\n--- {model_name} ({words} words) ---")
print(content[:200])
except Exception as e:
print(f"\n--- {model_name}: not available ({e}) ---")
Don't just trust benchmarks. Seriously. A model that scores highest on MMLU or HumanEval might not be the best for YOUR specific classification task or YOUR specific document summarization pipeline. Always evaluate on your actual use case with your actual data. Benchmarks tell you about general capability; your deployment requires specific capability ;-)
Hardware: what actually matters
VRAM is king. This is the single most important factor for local inference. More VRAM means bigger models, longer context windows, and faster token generation. Everything else -- clock speed, CUDA cores, memory bandwidth -- is secondary to raw VRAM capacity.
Here are the practical hardware tiers and what you can actually run on them:
# Hardware tiers and what fits
tiers = [
{
"vram": "8 GB",
"fits": "7B models at Q4, some 13B at Q2",
"gpus": "RTX 4060, M1/M2 (shared memory)",
"tok_per_sec": "20-40",
},
{
"vram": "16 GB",
"fits": "7B at Q8, 13B at Q4, some 30B at Q2",
"gpus": "RTX 4070 Ti, M2 Pro",
"tok_per_sec": "30-60",
},
{
"vram": "24 GB",
"fits": "13B at Q8, 30B at Q4, 70B at Q2-Q3",
"gpus": "RTX 4090, RTX 3090, M2 Max",
"tok_per_sec": "40-100",
},
{
"vram": "48 GB+",
"fits": "70B at Q4-Q5, multiple smaller models",
"gpus": "2x RTX 4090, M2 Ultra, A6000",
"tok_per_sec": "50-120+",
},
]
print(f"{'VRAM':<10} {'What Fits':<42} {'Example GPUs':<30}")
print("-" * 85)
for t in tiers:
print(f"{t['vram']:<10} {t['fits']:<42} {t['gpus']:<30}")
print(f"{'':>10} ~{t['tok_per_sec']} tokens/sec")
Apple Silicon deserves special mention. M1/M2/M3/M4 chips use unified memory -- the CPU and GPU share the same RAM pool. A MacBook Pro with 32GB unified memory can run models that would need a discrete GPU on other platforms. A Mac Mini M4 with 64GB can run 70B models quantized to Q4 without breaking a sweat. Performance is competitive with mid-range NVIDIA GPUs, though a high-end RTX 4090 still wins on raw throughput.
CPU inference works but is significantly slower. llama.cpp and Ollama support CPU-only inference, and it's viable for models up to about 13B if you have enough RAM. Expect 5-15 tokens per second on a modern CPU versus 50-100+ on a good GPU. Usable for batch processing overnight; too slow for interactive chat with larger models. Having said that, CPU inference on a beefy server with 256GB RAM and many cores can serve a 70B model -- just don't expect interactive speeds.
Memory bandwidth is the second most important factor after VRAM capacity. This explains why Apple Silicon performs surprisingly well despite lower raw compute -- its unified memory architecture has high bandwidth (200+ GB/s on higher-end chips). On desktop systems, DDR5 RAM helps CPU inference meaningfully compared to DDR4.
# Check what your system can handle
import torch
import platform
print(f"Platform: {platform.system()} {platform.machine()}")
if torch.cuda.is_available():
props = torch.cuda.get_device_properties(0)
vram = props.total_mem / 1e9
print(f"GPU: {props.name}")
print(f"VRAM: {vram:.1f} GB")
print(f"Compute capability: {props.major}.{props.minor}")
# Estimate what fits
if vram >= 24:
print("Can run: 13B at Q8, 70B at Q3")
elif vram >= 16:
print("Can run: 7B at Q8, 13B at Q4")
elif vram >= 8:
print("Can run: 7B at Q4")
else:
print("Limited to very small models or CPU inference")
elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
import subprocess
result = subprocess.run(
["sysctl", "hw.memsize"],
capture_output=True, text=True)
total_ram = int(result.stdout.split(":")[1].strip()) / 1e9
print(f"Apple Silicon -- Unified memory: {total_ram:.0f} GB")
print(f"Usable for models: ~{total_ram * 0.7:.0f} GB "
f"(leave ~30% for OS)")
else:
import psutil
ram = psutil.virtual_memory().total / 1e9
print(f"CPU only -- RAM: {ram:.0f} GB")
print("CPU inference: viable but slow for interactive use")
Practical workflow: from download to deployment
Here's the workflow I actually recomend for getting started with local models. Don't overthink it -- pick a model, run it, benchmark it on your tasks, and decide if it's good enough.
import ollama
import time
def benchmark_model(model_name, prompts):
"""Test a model's speed and quality on actual tasks."""
results = []
for prompt in prompts:
start = time.time()
response = ollama.chat(
model=model_name,
messages=[{"role": "user", "content": prompt}]
)
elapsed = time.time() - start
content = response["message"]["content"]
# eval_count comes from Ollama's response metadata
tokens = response.get("eval_count", len(content.split()))
results.append({
"prompt": prompt[:50],
"tokens": tokens,
"time_sec": elapsed,
"tok_per_sec": tokens / elapsed if elapsed > 0 else 0,
"response_preview": content[:100],
})
return results
# YOUR actual test prompts -- not benchmarks, your real work
test_prompts = [
"Classify this text as positive, negative, or neutral: "
"'The API was decent but documentation was lacking'",
"Extract the key entities from: 'Apple announced the "
"M4 chip at WWDC in Cupertino'",
"Summarize in one sentence: Gradient descent is an "
"optimization algorithm that iteratively adjusts "
"parameters by moving in the direction of steepest "
"descent of the loss function.",
]
# Compare two models
for model in ["llama3.1:8b", "phi3:3.8b"]:
print(f"\n=== {model} ===")
try:
results = benchmark_model(model, test_prompts)
for r in results:
print(f" {r['tok_per_sec']:.1f} tok/s | "
f"{r['prompt'][:45]}...")
except Exception as e:
print(f" Not available: {e}")
Test on your actual tasks. Measure tokens per second. Check output quality by reading the responses. Then decide: is the local model good enough for this use case, or do you need the API? Often the answer is "the 8B model handles 80% of my use cases perfectly fine, and I only call the API for the complex 20%." That hybrid approach -- local for the bulk, API for the hard stuff -- gives you the best of both worlds in terms of cost, latency, and quality.
Building a local model comparison pipeline
Let's put it all together into something you can actually use. A structured comparison pipeline that tests multiple models on your specific tasks and produces a clear recommendation:
import json
import time
class LocalModelEvaluator:
"""Compare local models on your actual tasks."""
def __init__(self):
self.results = {}
def evaluate(self, model_name, test_cases):
"""Run a model through all test cases."""
model_results = []
for case in test_cases:
start = time.time()
try:
import ollama
response = ollama.chat(
model=model_name,
messages=[{
"role": "user",
"content": case["prompt"]
}]
)
elapsed = time.time() - start
output = response["message"]["content"]
tokens = response.get("eval_count",
len(output.split()))
# Score against expected output if provided
score = self._score(output, case.get("expected"))
model_results.append({
"task": case["name"],
"output": output[:200],
"tokens": tokens,
"time_sec": elapsed,
"tok_per_sec": tokens / max(elapsed, 0.001),
"score": score,
})
except Exception as e:
model_results.append({
"task": case["name"],
"error": str(e),
"score": 0,
})
self.results[model_name] = model_results
return model_results
def _score(self, output, expected):
"""Simple keyword-based scoring."""
if not expected:
return 1.0 if len(output.strip()) > 10 else 0.0
# Check if expected keywords appear in output
keywords = expected.lower().split(",")
found = sum(1 for kw in keywords
if kw.strip() in output.lower())
return found / len(keywords) if keywords else 0.0
def summary(self):
"""Print comparison summary."""
print(f"\n{'Model':<20} {'Avg Score':>10} "
f"{'Avg tok/s':>10} {'Tasks OK':>10}")
print("-" * 55)
for model_name, results in self.results.items():
scores = [r["score"] for r in results
if "error" not in r]
speeds = [r["tok_per_sec"] for r in results
if "error" not in r]
ok = len(scores)
avg_score = sum(scores) / len(scores) if scores else 0
avg_speed = sum(speeds) / len(speeds) if speeds else 0
print(f"{model_name:<20} {avg_score:>9.2f} "
f"{avg_speed:>9.1f} {ok:>6}/{len(results)}")
# Define YOUR test cases
test_cases = [
{
"name": "classification",
"prompt": "Classify as positive/negative/neutral: "
"'The new update broke my workflow'",
"expected": "negative",
},
{
"name": "extraction",
"prompt": "Extract entities (person, org, location): "
"'Satoshi Nakamoto created Bitcoin in Japan'",
"expected": "satoshi,bitcoin,japan",
},
{
"name": "summarization",
"prompt": "One-sentence summary: Transformers replaced "
"RNNs because self-attention processes all "
"tokens in parallel rather than sequentially, "
"enabling much faster training on long sequences.",
"expected": "transformers,attention,parallel",
},
{
"name": "code_generation",
"prompt": "Write a Python function that checks if a "
"string is a palindrome. Return True or False.",
"expected": "def,palindrome,return,true,false",
},
]
evaluator = LocalModelEvaluator()
# In practice: run this for each model you're considering
# evaluator.evaluate("llama3.1:8b", test_cases)
# evaluator.evaluate("phi3:3.8b", test_cases)
# evaluator.summary()
# Demo output (simulated for the tutorial)
print("Local Model Comparison Pipeline")
print("================================")
print(f"Test cases defined: {len(test_cases)}")
for tc in test_cases:
print(f" - {tc['name']}: {tc['prompt'][:50]}...")
print("\nRun evaluator.evaluate('model', test_cases) "
"for each model")
print("Then evaluator.summary() for the comparison table")
This is the kind of tooling I wish I had when I started working with local models. You define your tasks ONCE, run every candidate model through them, and get a clear apples-to-apples comparison. No more "I think model X felt better than model Y" -- you have numbers. Numbers you can compare, numbers you can track over time as new models come out.
The API vs local decision framework
One more thing before we wrap up. The question isn't "API or local?" -- it's "which tasks go where?" Almost every production system I've seen that uses local models also uses APIs. The right architecture is a hybrid where you route each request to the most cost-effective backend that meets your quality requirements.
# Routing logic for hybrid local + API setup
class InferenceRouter:
"""Route requests to local or API based on task complexity."""
def __init__(self, local_models, api_threshold=0.7):
self.local_models = local_models
self.api_threshold = api_threshold
self.stats = {"local": 0, "api": 0}
def route(self, task_type, complexity_score):
"""Decide where to send this request.
complexity_score: 0.0 (trivial) to 1.0 (very complex)
"""
if complexity_score < self.api_threshold:
self.stats["local"] += 1
return "local", self.local_models.get(
task_type, "llama3.1:8b")
else:
self.stats["api"] += 1
return "api", "gpt-4"
def cost_report(self, local_cost_per_query=0.0001,
api_cost_per_query=0.03):
"""Estimate cost savings from hybrid routing."""
local_cost = self.stats["local"] * local_cost_per_query
api_cost = self.stats["api"] * api_cost_per_query
all_api = ((self.stats["local"] + self.stats["api"])
* api_cost_per_query)
savings = all_api - (local_cost + api_cost)
return {
"total_queries": self.stats["local"] + self.stats["api"],
"local_queries": self.stats["local"],
"api_queries": self.stats["api"],
"hybrid_cost": local_cost + api_cost,
"all_api_cost": all_api,
"savings": savings,
"savings_pct": (savings / all_api * 100
if all_api > 0 else 0),
}
# Simulate 100 requests with varying complexity
router = InferenceRouter(
local_models={
"classification": "phi3:3.8b",
"extraction": "llama3.1:8b",
"summarization": "llama3.1:8b",
"reasoning": "llama3.1:8b",
}
)
import random
random.seed(42)
tasks = ["classification", "extraction", "summarization",
"reasoning"]
for _ in range(100):
task = random.choice(tasks)
# Classification/extraction tend to be simpler
if task in ("classification", "extraction"):
complexity = random.uniform(0.1, 0.6)
else:
complexity = random.uniform(0.3, 0.95)
router.route(task, complexity)
report = router.cost_report()
print("Hybrid Routing Report (100 queries)")
print(f" Local: {report['local_queries']} queries")
print(f" API: {report['api_queries']} queries")
print(f" Hybrid cost: ${report['hybrid_cost']:.2f}")
print(f" All-API cost: ${report['all_api_cost']:.2f}")
print(f" Savings: ${report['savings']:.2f} "
f"({report['savings_pct']:.0f}%)")
If that routing sends 70% of your queries to local models, you've just cut your inference bill by roughly 70% while maintaining the same quality on the complex 30% that goes to the API. That's the real power of running local models -- not replacing the API entirely, but dramatically reducing how much you depend on it.
Samengevat
- Local models give you privacy, cost control, and independence from API providers, but they trade off capability compared to frontier models -- choose based on your specific task requirements, not hype;
- Ollama for ease of use and OpenAI-compatible APIs, llama.cpp for maximum control over memory and quantization, vLLM for production serving with high concurrent throughput;
- Quantization (GGUF Q4_K_M as the default sweet spot) makes 7B+ models practical on consumer hardware by storing weights in 4 bits in stead of 16, with surprisingly small quality degradation;
- Model selection depends entirely on your task: general chat, code generation, embedding/retrieval, and edge deployment each have specialist models that outperform generalists at a fraction of the size;
- VRAM is king for hardware -- everything else is secondary. Apple Silicon's unified memory makes it surprisingly competitive for local inference. CPU inference works but is 5-10x slower;
- Always benchmark on YOUR actual use case rather than trusting leaderboards. Build a structured comparison pipeline (like the one above) and let the numbers decide;
- The smartest architecture is usually hybrid: route simple tasks to local models (cheap, fast, private) and complex tasks to APIs (powerful, expensive). This gets you 70-80% cost reduction while maintaining quality where it matters.
Exercises
Exercise 1: Build a model memory calculator. Create a function estimate_memory(num_params_billions, quantization_bits, context_length, batch_size) that estimates total GPU memory needed for inference. It should account for: (a) model weights at the given quantization level (params * bits / 8), (b) KV cache memory (2 * num_layers * hidden_dim * context_length * batch_size * 2 bytes, where num_layers and hidden_dim are estimated from the param count using standard ratios: 7B -> 32 layers, 4096 dim; 13B -> 40 layers, 5120 dim; 70B -> 80 layers, 8192 dim), (c) activation memory overhead (~10% of model weights). Test with: 7B at Q4 with 4096 context, 13B at Q4 with 2048 context, 70B at Q4 with 4096 context. Print a table showing each component and the total. Add a fits_in_gpu(vram_gb) method that returns True/False and prints what you'd need to change (lower quantization, shorter context, or smaller model) if it doesn't fit.
Exercise 2: Build an Ollama model manager (no actual Ollama required -- simulate the API). Create a class ModelManager that tracks: which models are "downloaded" (stored in a dict with name, size_gb, quantization, and capabilities list), total "disk" used, and a usage log. Implement: pull(model_name, size_gb, quant, capabilities) to add a model, remove(model_name) to delete one, list_models() to show all with sizes, find_best(task) that picks the smallest model whose capabilities include the requested task, and usage_report() that shows which models were queried most often. Pre-populate with 5 models (e.g., llama3.1:8b, phi3:3.8b, codellama:7b, nomic-embed, mistral:7b) with different capabilities (chat, code, embedding, reasoning). Simulate 50 queries across different task types and print the usage report showing which model handled the most queries and total "disk" usage.
Exercise 3: Build a quantization quality simulator. Create a function simulate_quantization(weights, bits) that takes a NumPy array of float32 weights and quantizes them to the specified bit width using uniform quantization (map the full range to 2^bits levels, then dequantize back to floats). Measure the mean squared error between original and dequantized weights, the max absolute error, and the signal-to-noise ratio in dB (10 * log10(signal_power / noise_power)). Generate a test weight array of 10,000 values drawn from a normal distribution (mean=0, std=0.02 -- realistic for neural network weights). Test with 2, 3, 4, 5, 6, and 8 bits. Print a comparison table. Then implement awq_simulate(weights, bits, importance_scores) that gives higher precision to the top 1% most "important" weights (keep those at float32, quantize the rest). Compare AWQ-simulated vs uniform quantization at 4-bit and show the MSE improvement.