Learn AI Series (#74) - The Hugging Face Ecosystem
What will I learn
- You will learn the Transformers library: models, tokenizers, and pipelines for running inference with one line of code;
- the Model Hub: finding, downloading, and sharing models among 500K+ available checkpoints;
- the Datasets library: loading and processing data at any scale through Arrow-backed memory mapping;
- the Trainer API: simplified fine-tuning with sensible defaults that handles the training loop boilerplate;
- Accelerate: distributed training across multiple GPUs with minimal code changes;
- Spaces: deploying ML demos and applications with Gradio or Streamlit.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution;
- The ambition to learn AI and machine learning.
Difficulty
- Beginner
Curriculum (of the Learn AI Series):
- Learn AI Series (#1) - What Machine Learning Actually Is
- Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
- Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World
- Learn AI Series (#4) - Your First Prediction - No Math, Just Intuition
- Learn AI Series (#5) - Patterns in Data - What "Learning" Actually Looks Like
- Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
- Learn AI Series (#7) - The Training Loop - See It Work Step by Step
- Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra
- Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability
- Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch
- Learn AI Series (#11) - Making Linear Regression Real
- Learn AI Series (#12) - Classification - Logistic Regression From Scratch
- Learn AI Series (#13) - Evaluation - How to Know If Your Model Actually Works
- Learn AI Series (#14) - Data Preparation - The 80% Nobody Talks About
- Learn AI Series (#15) - Feature Engineering and Selection
- Learn AI Series (#16) - Scikit-Learn - The Standard Library of ML
- Learn AI Series (#17) - Decision Trees - How Machines Make Decisions
- Learn AI Series (#18) - Random Forests - Wisdom of Crowds
- Learn AI Series (#19) - Gradient Boosting - The Kaggle Champion
- Learn AI Series (#20) - Support Vector Machines - Drawing the Perfect Boundary
- Learn AI Series (#21) - Mini Project - Predicting Crypto Market Regimes
- Learn AI Series (#22) - K-Means Clustering - Finding Groups
- Learn AI Series (#23) - Advanced Clustering - Beyond K-Means
- Learn AI Series (#24) - Dimensionality Reduction - PCA
- Learn AI Series (#25) - Advanced Dimensionality Reduction - t-SNE and UMAP
- Learn AI Series (#26) - Anomaly Detection - Finding What Doesn't Belong
- Learn AI Series (#27) - Recommendation Systems - "Users Like You Also Liked..."
- Learn AI Series (#28) - Time Series Fundamentals - When Order Matters
- Learn AI Series (#29) - Time Series Forecasting - Predicting What Comes Next
- Learn AI Series (#30) - Natural Language Processing - Text as Data
- Learn AI Series (#31) - Word Embeddings - Meaning in Numbers
- Learn AI Series (#32) - Bayesian Methods - Thinking in Probabilities
- Learn AI Series (#33) - Ensemble Methods Deep Dive - Stacking and Blending
- Learn AI Series (#34) - ML Engineering - From Notebook to Production
- Learn AI Series (#35) - Data Ethics and Bias in ML
- Learn AI Series (#36) - Mini Project - Complete ML Pipeline
- Learn AI Series (#37) - The Perceptron - Where It All Started
- Learn AI Series (#38) - Neural Networks From Scratch - Forward Pass
- Learn AI Series (#39) - Neural Networks From Scratch - Backpropagation
- Learn AI Series (#40) - Training Neural Networks - Practical Challenges
- Learn AI Series (#41) - Optimization Algorithms - SGD, Momentum, Adam
- Learn AI Series (#42) - PyTorch Fundamentals - Tensors and Autograd
- Learn AI Series (#43) - PyTorch Data and Training
- Learn AI Series (#44) - PyTorch nn.Module - Building Real Networks
- Learn AI Series (#45) - Convolutional Neural Networks - Theory
- Learn AI Series (#46) - CNNs in Practice - Classic to Modern Architectures
- Learn AI Series (#47) - CNN Applications - Detection, Segmentation, Style Transfer
- Learn AI Series (#48) - Recurrent Neural Networks - Sequences
- Learn AI Series (#49) - LSTM and GRU - Solving the Memory Problem
- Learn AI Series (#50) - Sequence-to-Sequence Models
- Learn AI Series (#51) - Attention Mechanisms
- Learn AI Series (#52) - The Transformer Architecture (Part 1)
- Learn AI Series (#53) - The Transformer Architecture (Part 2)
- Learn AI Series (#54) - Vision Transformers
- Learn AI Series (#55) - Generative Adversarial Networks
- Learn AI Series (#56) - Mini Project - Building a Transformer From Scratch
- Learn AI Series (#57) - Language Modeling - Predicting the Next Word
- Learn AI Series (#58) - GPT Architecture - Decoder-Only Transformers
- Learn AI Series (#59) - BERT and Encoder Models
- Learn AI Series (#60) - Training Large Language Models
- Learn AI Series (#61) - Instruction Tuning and Alignment
- Learn AI Series (#62) - Prompt Engineering - Getting the Most from LLMs
- Learn AI Series (#63) - Embeddings and Vector Search
- Learn AI Series (#64) - Retrieval-Augmented Generation (RAG) - Basics
- Learn AI Series (#65) - RAG - Advanced Techniques
- Learn AI Series (#66) - Working with LLM APIs
- Learn AI Series (#67) - Building AI Agents (Part 1) - Foundations
- Learn AI Series (#68) - Building AI Agents (Part 2) - Advanced Patterns
- Learn AI Series (#69) - Fine-Tuning Language Models
- Learn AI Series (#70) - Running Local Models
- Learn AI Series (#71) - Text Generation Techniques
- Learn AI Series (#72) - Tokenization Deep Dive
- Learn AI Series (#73) - LLM Evaluation
- Learn AI Series (#74) - The Hugging Face Ecosystem (this post)
Learn AI Series (#74) - The Hugging Face Ecosystem
Solutions to Episode #73 Exercises
Exercise 1: Benchmark suite runner.
class BenchmarkRunner:
"""Manages multi-format evaluation tasks."""
def __init__(self):
self.mc_tasks = []
self.gen_tasks = []
def add_mc(self, question, choices, answer, subject):
self.mc_tasks.append({
"question": question,
"choices": choices,
"answer": answer.upper(),
"subject": subject,
})
def add_gen(self, name, prompt, checks, category):
self.gen_tasks.append({
"name": name,
"prompt": prompt,
"checks": checks,
"category": category,
})
def run(self, model_fn):
mc_results = self._run_mc(model_fn)
gen_results = self._run_gen(model_fn)
self._report(mc_results, gen_results)
return mc_results, gen_results
def _run_mc(self, model_fn):
results = []
for task in self.mc_tasks:
prompt = task["question"] + "\n"
for i, c in enumerate(task["choices"]):
prompt += f"({chr(65 + i)}) {c}\n"
prompt += "Answer: ("
raw = model_fn(prompt).strip()
predicted = raw[0].upper() if raw else ""
correct = predicted == task["answer"]
results.append({
"subject": task["subject"],
"correct": correct,
"predicted": predicted,
"expected": task["answer"],
})
return results
def _run_gen(self, model_fn):
results = []
for task in self.gen_tasks:
response = model_fn(task["prompt"])
check_results = {}
for name, fn in task["checks"].items():
try:
check_results[name] = fn(response)
except Exception:
check_results[name] = False
results.append({
"name": task["name"],
"category": task["category"],
"passed": all(check_results.values()),
"checks": check_results,
})
return results
def _report(self, mc_results, gen_results):
print("=" * 55)
print("BENCHMARK REPORT")
print("=" * 55)
# MC accuracy per subject
subjects = {}
for r in mc_results:
s = r["subject"]
if s not in subjects:
subjects[s] = {"correct": 0, "total": 0}
subjects[s]["total"] += 1
if r["correct"]:
subjects[s]["correct"] += 1
mc_total = len(mc_results)
mc_correct = sum(1 for r in mc_results if r["correct"])
print(f"\nMultiple Choice: {mc_correct}/{mc_total} "
f"({mc_correct / mc_total:.0%})")
for subj, stats in sorted(subjects.items()):
acc = stats["correct"] / stats["total"]
print(f" {subj:<15} {stats['correct']}/"
f"{stats['total']} ({acc:.0%})")
# Generation pass rate per category
categories = {}
for r in gen_results:
cat = r["category"]
if cat not in categories:
categories[cat] = {"passed": 0, "total": 0}
categories[cat]["total"] += 1
if r["passed"]:
categories[cat]["passed"] += 1
gen_total = len(gen_results)
gen_passed = sum(1 for r in gen_results if r["passed"])
print(f"\nGeneration: {gen_passed}/{gen_total} "
f"({gen_passed / gen_total:.0%})")
for cat, stats in sorted(categories.items()):
rate = stats["passed"] / stats["total"]
print(f" {cat:<15} {stats['passed']}/"
f"{stats['total']} ({rate:.0%})")
overall = (mc_correct + gen_passed) / (mc_total + gen_total)
print(f"\nOverall: {overall:.0%}")
# Populate with 15 MC questions (5 per subject)
runner = BenchmarkRunner()
math_qs = [
("What is 7 * 8?", ["54", "56", "58", "64"], "B"),
("Square root of 144?", ["10", "11", "12", "14"], "C"),
("What is 15% of 200?", ["25", "30", "35", "40"], "B"),
("2^10 equals?", ["512", "1000", "1024", "2048"], "C"),
("What is the derivative of x^3?", ["x^2", "2x^2", "3x^2", "3x"], "C"),
]
science_qs = [
("What gas do plants absorb?", ["O2", "CO2", "N2", "H2"], "B"),
("Speed of light (approx)?",
["300 km/s", "300K km/s", "3M km/s", "30K km/s"], "B"),
("Water's chemical formula?", ["HO2", "H2O", "H2O2", "OH"], "B"),
("Smallest unit of matter?", ["Cell", "Molecule", "Atom", "Quark"], "C"),
("Newton's second law?", ["F=mv", "F=ma", "F=mg", "E=mc2"], "B"),
]
history_qs = [
("Year WW2 ended?", ["1943", "1944", "1945", "1946"], "C"),
("First moon landing year?",
["1967", "1968", "1969", "1970"], "C"),
("Berlin Wall fell in?", ["1987", "1988", "1989", "1991"], "C"),
("French Revolution started?",
["1776", "1789", "1799", "1804"], "B"),
("Who wrote The Republic?",
["Aristotle", "Plato", "Socrates", "Homer"], "B"),
]
for q, choices, ans in math_qs:
runner.add_mc(q, choices, ans, "math")
for q, choices, ans in science_qs:
runner.add_mc(q, choices, ans, "science")
for q, choices, ans in history_qs:
runner.add_mc(q, choices, ans, "history")
# 5 generation tasks
runner.add_gen(
"greeting", "Write a professional greeting email",
{"has_subject": lambda r: "subject" in r.lower()
or "dear" in r.lower(),
"polite": lambda r: any(w in r.lower()
for w in ["please", "thank",
"regards"]),
"length_ok": lambda r: 20 < len(r.split()) < 200},
"writing",
)
runner.add_gen(
"code_sort", "Write a Python sort function",
{"has_def": lambda r: "def " in r,
"has_return": lambda r: "return" in r,
"has_sort": lambda r: "sort" in r.lower()},
"coding",
)
runner.add_gen(
"explain_gravity", "Explain gravity to a 10-year-old",
{"simple_words": lambda r: len(r.split()) > 10,
"mentions_force": lambda r: any(
w in r.lower()
for w in ["pull", "force", "attract", "fall"])},
"explanation",
)
runner.add_gen(
"code_fib", "Write a fibonacci function in Python",
{"has_def": lambda r: "def " in r,
"has_fib": lambda r: "fib" in r.lower()},
"coding",
)
runner.add_gen(
"summarize", "Summarize the benefits of exercise",
{"mentions_health": lambda r: "health" in r.lower(),
"length_ok": lambda r: 10 < len(r.split()) < 300},
"writing",
)
# Simulated model with predetermined answers
answers = {
"7 * 8": "B", "144": "C", "15%": "B",
"2^10": "C", "derivative": "C",
"plants": "B", "light": "B", "Water": "B",
"Smallest": "C", "Newton": "B",
"WW2": "C", "moon": "C", "Berlin": "C",
"French": "B", "Republic": "B",
}
def simulated_model(prompt):
for key, ans in answers.items():
if key in prompt:
return ans
if "greeting" in prompt.lower() or "email" in prompt.lower():
return ("Dear colleague,\nPlease find the details. "
"Thank you and best regards.")
if "sort" in prompt.lower():
return "def sort_list(lst):\n return sorted(lst)"
if "gravity" in prompt.lower():
return ("Gravity is a force that pulls things "
"toward each other like a magnet.")
if "fib" in prompt.lower():
return ("def fib(n):\n if n <= 1: return n\n"
" return fib(n-1) + fib(n-2)")
if "exercise" in prompt.lower():
return ("Exercise improves your health by "
"strengthening muscles and heart.")
return "I don't know."
runner.run(simulated_model)
The key design insight is supporting two fundamentally different evaluation modes in one runner. Multiple-choice tasks have a definitive correct answer you can check mechanically (string comparison). Generation tasks have no single correct answer, so you define check functions that verify desirable properties of the output instead. This mirrors how the real evaluation landscape works: MMLU uses multiple-choice, HumanEval uses functional tests, and both get aggregated into a single model scorecard.
Exercise 2: Pairwise tournament evaluator with Elo ratings.
import math
import random
class TournamentEvaluator:
"""Elo-rated pairwise tournament between models."""
def __init__(self, models, prompts, judge_fn, k=32):
self.models = models
self.prompts = prompts
self.judge_fn = judge_fn
self.k = k
self.elo = {name: 1000.0 for name in models}
self.wins = {}
for a in models:
self.wins[a] = {}
for b in models:
self.wins[a][b] = {"w": 0, "l": 0, "d": 0}
def expected(self, ra, rb):
return 1.0 / (1.0 + 10 ** ((rb - ra) / 400))
def update_elo(self, winner, loser):
ea = self.expected(self.elo[winner], self.elo[loser])
eb = self.expected(self.elo[loser], self.elo[winner])
self.elo[winner] += self.k * (1 - ea)
self.elo[loser] += self.k * (0 - eb)
def update_draw(self, a, b):
ea = self.expected(self.elo[a], self.elo[b])
eb = self.expected(self.elo[b], self.elo[a])
self.elo[a] += self.k * (0.5 - ea)
self.elo[b] += self.k * (0.5 - eb)
def run(self):
names = list(self.models.keys())
for prompt in self.prompts:
# Generate all responses
responses = {}
for name, fn in self.models.items():
responses[name] = fn(prompt)
# Pairwise comparison with position-bias mitigation
for i in range(len(names)):
for j in range(i + 1, len(names)):
a, b = names[i], names[j]
ra, rb = responses[a], responses[b]
# Eval 1: a first, b second
j1 = self.judge_fn(prompt, ra, rb)
# Eval 2: b first, a second (swapped)
j2 = self.judge_fn(prompt, rb, ra)
# j1: "A" means a wins, "B" means b wins
# j2: "A" means b wins (swapped), "B" means a
a_score = 0
b_score = 0
if j1 == "A":
a_score += 1
elif j1 == "B":
b_score += 1
if j2 == "A":
b_score += 1
elif j2 == "B":
a_score += 1
if a_score > b_score:
self.update_elo(a, b)
self.wins[a][b]["w"] += 1
self.wins[b][a]["l"] += 1
elif b_score > a_score:
self.update_elo(b, a)
self.wins[b][a]["w"] += 1
self.wins[a][b]["l"] += 1
else:
self.update_draw(a, b)
self.wins[a][b]["d"] += 1
self.wins[b][a]["d"] += 1
self._report(names)
def _report(self, names):
print("ELO RANKINGS")
print("=" * 40)
ranked = sorted(self.elo.items(),
key=lambda x: x[1], reverse=True)
for rank, (name, elo) in enumerate(ranked, 1):
print(f" {rank}. {name:<20} {elo:.0f}")
print(f"\nWIN/LOSS/DRAW MATRIX")
header = f"{'':>20}"
for n in names:
header += f" {n:>12}"
print(header)
for a in names:
row = f"{a:>20}"
for b in names:
if a == b:
row += f" {'---':>12}"
else:
w = self.wins[a][b]
row += f" {w['w']}W/{w['l']}L/{w['d']}D".rjust(12)
print(row)
# 4 simulated models with different quality levels
def model_verbose(prompt):
words = prompt.lower().split()
return ("Here is a very detailed and comprehensive "
"response that covers all aspects of " +
" ".join(words[:5]) +
" with extensive explanation and thorough "
"discussion of every single detail involved "
"in the topic at hand, leaving no stone "
"unturned in our analysis.")
def model_concise(prompt):
words = prompt.lower().split()
return " ".join(words[:3]) + ": done correctly."
def model_accurate(prompt):
words = prompt.lower().split()
keywords = [w for w in words
if len(w) > 3 and w not in
("what", "does", "that", "this", "with")]
return "The answer involves " + ", ".join(keywords[:4]) + "."
def model_inaccurate(prompt):
return ("Regarding your question, the answer is "
"complex and multifaceted. There are many "
"perspectives to consider. Some experts "
"disagree. The evidence is unclear. More "
"research is needed to fully understand "
"this important matter.")
# Judge: prefers responses that mention question keywords
# and are reasonably long but not excessively so
def judge(prompt, resp_a, resp_b):
keywords = set(w.lower() for w in prompt.split()
if len(w) > 3)
def score(resp):
words = resp.lower().split()
kw_hits = sum(1 for w in words if w in keywords)
length_score = min(len(words), 50) / 50
return kw_hits * 2 + length_score
sa = score(resp_a)
sb = score(resp_b)
if sa > sb * 1.1:
return "A"
elif sb > sa * 1.1:
return "B"
return "D"
models = {
"verbose": model_verbose,
"concise": model_concise,
"accurate": model_accurate,
"inaccurate": model_inaccurate,
}
prompts = [
"Explain how gradient descent optimizes a loss function",
"What is the difference between supervised and unsupervised learning",
"How does a convolutional neural network process images",
"Describe the attention mechanism in transformers",
"What is overfitting and how do you prevent it",
"Explain the bias-variance tradeoff",
"How does backpropagation compute gradients",
"What is transfer learning and when is it useful",
"Describe how word embeddings capture meaning",
"What is the vanishing gradient problem",
]
random.seed(42)
tourney = TournamentEvaluator(models, prompts, judge)
tourney.run()
The position-bias mitigation is the most important detail here. Without swapping, whatever response appears first has an advantage. By running each comparison twice with positions swapped, you get four possible outcomes: both rounds agree on A, both agree on B, they disagree (draw), or both call draw. Only when both orderings agree do you have a confident winner. The Elo system then translates pairwise outcomes into a global ranking -- exactly how Chatbot Arena works.
Exercise 3: Contamination detector.
def get_ngrams(text, n=4):
"""Extract n-grams from text."""
words = text.lower().split()
return set(tuple(words[i:i + n])
for i in range(len(words) - n + 1))
def ngram_overlap(text_a, text_b, n=4):
"""Compute n-gram overlap ratio."""
if not text_a.strip() or not text_b.strip():
return 0.0
a = get_ngrams(text_a, n)
b = get_ngrams(text_b, n)
if not a or not b:
return 0.0
shared = a & b
return len(shared) / min(len(a), len(b))
def detect_contamination(model_fn, test_examples):
"""Detect benchmark contamination via continuation.
Gives model a running start (first 30 chars of answer)
and checks if it can reproduce the rest.
"""
categories = {"likely": [], "possibly": [], "clean": []}
for ex in test_examples:
question = ex["question"]
answer = ex["answer"]
prefix = answer[:30]
prompt = f"{question}\n{prefix}"
continuation = model_fn(prompt)
remainder = answer[30:]
overlap = ngram_overlap(continuation, remainder)
entry = {
"question": question[:60],
"overlap": overlap,
"continuation_preview": continuation[:80],
}
if overlap > 0.8:
categories["likely"].append(entry)
elif overlap > 0.5:
categories["possibly"].append(entry)
else:
categories["clean"].append(entry)
# Report
total = len(test_examples)
print("CONTAMINATION REPORT")
print("=" * 55)
print(f"Total examples: {total}")
print(f" Likely memorized (>80%): "
f"{len(categories['likely'])}")
print(f" Possibly memorized (50-80%): "
f"{len(categories['possibly'])}")
print(f" Clean (<50%): {len(categories['clean'])}")
contamination = (len(categories["likely"])
+ 0.5 * len(categories["possibly"]))
rate = contamination / total if total > 0 else 0
print(f"\nEstimated contamination rate: {rate:.0%}")
if categories["likely"]:
print(f"\nMost suspicious cases:")
for entry in categories["likely"][:3]:
print(f" Q: {entry['question']}...")
print(f" Overlap: {entry['overlap']:.1%}")
print(f" Cont: {entry['continuation_preview']}...")
return categories
# Test examples (simulated benchmark)
examples = [
{"question": "What is the capital of France?",
"answer": "The capital of France is Paris which has "
"been the capital since the 10th century "
"and serves as the political economic and "
"cultural center of the nation"},
{"question": "Define photosynthesis.",
"answer": "Photosynthesis is the process by which "
"green plants convert sunlight water and "
"carbon dioxide into glucose and oxygen "
"using chlorophyll in their leaves"},
{"question": "What is the Pythagorean theorem?",
"answer": "The Pythagorean theorem states that in a "
"right triangle the square of the hypotenuse "
"equals the sum of the squares of the other "
"two sides expressed as a squared plus b "
"squared equals c squared"},
{"question": "Explain Newton's first law.",
"answer": "Newton first law states that an object at "
"rest stays at rest and an object in motion "
"stays in motion with the same speed and "
"direction unless acted upon by an "
"unbalanced force"},
{"question": "What is DNA?",
"answer": "DNA or deoxyribonucleic acid is a molecule "
"that carries the genetic instructions used "
"in growth development functioning and "
"reproduction of all known living organisms"},
{"question": "What is the speed of light?",
"answer": "The speed of light in a vacuum is "
"approximately 299792458 meters per second "
"or about 186000 miles per second making it "
"the fastest speed in the universe"},
{"question": "Define machine learning.",
"answer": "Machine learning is a subset of artificial "
"intelligence that enables systems to learn "
"and improve from experience without being "
"explicitly programmed using statistical "
"techniques to find patterns in data"},
{"question": "What is inflation in economics?",
"answer": "Inflation is the rate at which the general "
"level of prices for goods and services "
"rises causing purchasing power to fall "
"central banks attempt to limit inflation "
"through monetary policy"},
{"question": "Explain the water cycle.",
"answer": "The water cycle describes continuous "
"movement of water through evaporation "
"condensation precipitation and collection "
"water evaporates from surfaces rises to "
"form clouds then falls as precipitation"},
{"question": "What is quantum computing?",
"answer": "Quantum computing uses quantum mechanical "
"phenomena such as superposition and "
"entanglement to perform computation "
"quantum computers use qubits instead of "
"classical bits enabling parallel processing"},
]
# Model 1: memorized (returns near-exact canonical answers)
def model_memorized(prompt):
for ex in examples:
if ex["question"][:20] in prompt:
return ex["answer"][30:]
return "I do not know the answer to this question."
# Model 2: clean (generates original responses)
def model_clean(prompt):
if "capital" in prompt.lower():
return "Paris is widely known as the City of Light."
if "photosynthesis" in prompt.lower():
return "Plants use sunlight for energy production."
if "pythagorean" in prompt.lower():
return "Right triangles follow a special rule."
if "newton" in prompt.lower():
return "Objects keep doing what they are doing."
if "dna" in prompt.lower():
return "The blueprint of life stored in cells."
if "light" in prompt.lower():
return "Very fast, nothing travels faster."
if "machine learning" in prompt.lower():
return "Computers finding patterns without rules."
if "inflation" in prompt.lower():
return "When money buys less stuff over time."
if "water cycle" in prompt.lower():
return "Rain falls and water goes back up again."
if "quantum" in prompt.lower():
return "Using physics to compute differently."
return "Unknown topic."
print("=== TESTING MEMORIZED MODEL ===\n")
detect_contamination(model_memorized, examples)
print("\n\n=== TESTING CLEAN MODEL ===\n")
detect_contamination(model_clean, examples)
The running-start technique is what makes this work. Giving the model the first 30 characters of the canonical answer provides enough context that a memorized model will reproduce the rest verbatim -- but a model that genuinly understands the topic will generate its own phrasing. The 4-gram overlap metric is particularly revealing because while individual words will naturally overlap (both responses talk about "the capital of France"), matching sequences of four consecutive words are extremely unlikely unless the text was memorized. This is the same principle behind plagiarism detection tools.
On to today's episode
Here we go ;-) Over the last few episodes we've been deep in the trenches of working with language models. Fine-tuning (#69), running models locally (#70), text generation techniques (#71), tokenization internals (#72), and evaluation strategies (#73). All that knowledge is about to pay off in a big way, because today we're looking at the platform that ties it all together.
Hugging Face has become the central hub of the ML community. If you're doing anything with machine learning in 2025 (and beyond), you will interact with Hugging Face infrastructure whether you realize it or not. The transformers library, the Model Hub, the datasets library, the Trainer API, Accelerate, Spaces -- these are the building blocks that the entire ecosystem runs on.
Think of it as npm for machine learning: a package manager, a registry, and a set of standard APIs that make different models interchangable. The analogy is apt because just like npm transformed JavaScript from a browser toy into a serious ecosystem, Hugging Face transformed ML from "download weights from some random Google Drive link" into a proper software engineering discipline.
Throughout this series we've been building things from scratch, then switching to libraries. We built linear regression before using scikit-learn (episodes #10 and #16). We built neural networks in NumPy before switching to PyTorch (#38-39, then #42-44). We built a transformer from scratch before using pretrained models (#56). That pattern was intentional -- you understand what libraries do when you've done it yourself first. Now we're going to look at the ecosystem that puts all those library-level pieces together into a coherent workflow.
Transformers: the universal model API
The transformers library provides a consistent interface to thousands of model architectures. The core abstractions are AutoModel, AutoTokenizer, and pipeline.
Pipelines are the highest-level API. One line to run inference:
from transformers import pipeline
# Sentiment analysis
classifier = pipeline("sentiment-analysis")
result = classifier(
"This series has been incredibly helpful for "
"understanding AI"
)
print(result)
# [{'label': 'POSITIVE', 'score': 0.9998}]
# Text generation
generator = pipeline(
"text-generation",
model="meta-llama/Llama-3.1-8B-Instruct"
)
output = generator(
"The key insight behind attention mechanisms is",
max_new_tokens=100
)
print(output[0]["generated_text"])
# Question answering
qa = pipeline("question-answering")
result = qa(
question="What does LoRA stand for?",
context="LoRA, or Low-Rank Adaptation, is a "
"parameter-efficient fine-tuning method."
)
print(result)
# {'answer': 'Low-Rank Adaptation', 'score': 0.98, ...}
# Named entity recognition
ner = pipeline("ner", grouped_entities=True)
entities = ner("Scipio wrote this tutorial in Amsterdam")
for ent in entities:
print(f" {ent['word']:>20} -> {ent['entity_group']}")
# Summarization
summarizer = pipeline(
"summarization",
model="facebook/bart-large-cnn"
)
summary = summarizer(
"Your long article text here...",
max_length=130,
min_length=30
)
Pipelines handle tokenization, model loading, inference, and post-processing all in one go. They're perfect for prototyping and simple applications. But for production use or custom workflows, you'll want the lower-level APIs.
AutoModel and AutoTokenizer provide that middle layer -- full control over every step:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load any model with the right architecture auto-detected
model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Manual inference with full control
messages = [
{"role": "user",
"content": "Explain backpropagation in one sentence."}
]
input_ids = tokenizer.apply_chat_template(
messages, return_tensors="pt"
).to(model.device)
with torch.no_grad():
output = model.generate(
input_ids,
max_new_tokens=100,
temperature=0.7,
top_p=0.9,
do_sample=True
)
response = tokenizer.decode(
output[0][input_ids.shape[-1]:],
skip_special_tokens=True
)
print(response)
The Auto prefix means the library inspects the model's config and loads the right architecture class automatically. AutoModelForCausalLM loads the model with a language modeling head. AutoModelForSequenceClassification adds a classification head. Same base model, different task heads. This was one of those design decisions that seems obvious in hindsight but made a huge difference in practice -- before the Auto classes, you had to know the exact architecture class name for every model you wanted to load (was it GPT2LMHeadModel? LlamaForCausalLM? MistralForCausalLM?). Now you just say "give me whatever this model is" and the library figures it out.
The apply_chat_template method is worth a specific mention. Remember the chat formatting mess we discussed in episode #71? Different models expect different prompt formats -- Llama uses [INST] tags, ChatML uses <|im_start|>, Mistral has its own thing. apply_chat_template reads the model's tokenizer config and formats your messages correctly. No more fragile manual prompt construction.
The Model Hub: finding and sharing models
The Hub hosts over 500,000 models now. Navigating it effectively is a genuine skill, and one you'll use constently.
Filtering by task: text-generation, text-classification, token-classification (NER), question-answering, summarization, translation, image-classification, object-detection, audio-classification, and dozens more. The task taxonomy keeps growing as new model types emerge.
Filtering by library: models are tagged with compatible frameworks -- transformers, diffusers, sentence-transformers, spaCy, and others. Not every model on the Hub works with transformers. Some are GGUF files for llama.cpp (remember episode #70?), some are ONNX exports for production inference, some are custom architectures with their own loading code.
Model cards are how you evaluate whether a model is worth using. Every good model has a card documenting: what it was trained on, how it was evaluated, known limitations, intended use cases, and licensing. Read the model card before using any model. A model without a card is a model you shouldn't trust in production.
from huggingface_hub import HfApi, ModelFilter
api = HfApi()
# Search for popular text generation models
models = api.list_models(
filter=ModelFilter(
task="text-generation",
library="transformers",
),
sort="downloads",
direction=-1,
limit=10,
)
for m in models:
print(f"{m.id:50s} downloads: {m.downloads:>12,}")
# Get detailed model info
info = api.model_info("meta-llama/Llama-3.1-8B-Instruct")
print(f"\nModel: {info.id}")
print(f"Downloads last month: {info.downloads:,}")
print(f"Likes: {info.likes}")
print(f"Tags: {info.tags[:10]}")
# Search by specific criteria
embedding_models = api.list_models(
filter=ModelFilter(
task="sentence-similarity",
library="sentence-transformers",
),
sort="downloads",
direction=-1,
limit=5,
)
print("\nTop embedding models:")
for m in embedding_models:
print(f" {m.id}")
Gated models: some models (Llama, Gemma, and others) require you to accept a license agreement before downloading. You'll need a Hugging Face account and an access token. Set it up once and forget about it:
# Login once (saves token to ~/.cache/huggingface/)
# huggingface-cli login
# Or set in code
from huggingface_hub import login
login(token="hf_your_token_here")
# Now gated models work transparently
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct"
)
Uploading models: after fine-tuning (episode #69), you can share your model on the Hub. This is one of the things that makes the ecosystem work -- every fine-tuned model becomes available to everyone else:
# After training
model.push_to_hub(
"your-username/my-fine-tuned-model",
private=True
)
tokenizer.push_to_hub(
"your-username/my-fine-tuned-model",
private=True
)
# Push with a model card
from huggingface_hub import ModelCard
card = ModelCard.from_template(
card_data={
"language": "en",
"license": "mit",
"base_model": "meta-llama/Llama-3.1-8B-Instruct",
"tags": ["text-generation", "fine-tuned"],
},
model_description=(
"Fine-tuned Llama 3.1 for customer support. "
"Trained on 10K examples from internal dataset."
),
)
card.push_to_hub("your-username/my-fine-tuned-model")
Datasets: loading data at any scale
The datasets library is to data what transformers is to models. It provides a consistent API for loading, processing, and sharing datasets -- from tiny evaluation sets to terabyte-scale training corpora. If you've been writing custom data loading code with pandas and manual train/test splits, this library will change your workflow (and I mean that in a very practical way, not as a sales pitch).
from datasets import load_dataset
# Load a dataset from the Hub
squad = load_dataset("squad")
print(squad)
# DatasetDict({
# train: Dataset({
# features: ['id', 'title', 'context',
# 'question', 'answers'],
# num_rows: 87599
# }),
# validation: Dataset({
# features: [...], num_rows: 10570
# })
# })
# Access like a list or dict
print(squad["train"][0])
print(squad["train"]["question"][:3])
# Load local data
my_data = load_dataset(
"json",
data_files="my_training_data.json"
)
my_data = load_dataset(
"csv",
data_files={
"train": "train.csv",
"test": "test.csv"
}
)
# Stream large datasets without downloading everything
wiki = load_dataset(
"wikipedia", "20220301.en",
streaming=True
)
for i, example in enumerate(wiki["train"]):
print(example["title"])
if i >= 4:
break
The key feature is memory mapping. Datasets aren't loaded entirely into RAM. They're stored on disk in Apache Arrow format and loaded lazily. This means you can work with datasets larger than your available memory. I can't stress this enough -- before Arrow-backed datasets, trying to load a 50GB training corpus on a machine with 16GB RAM was a recipe for swap thrashing and eventual OOM kills.
Processing is done with .map(), which applies a function to every example:
def tokenize_function(examples):
return tokenizer(
examples["text"],
truncation=True,
max_length=512,
padding="max_length"
)
# batched=True processes multiple examples at once (faster)
# num_proc=4 uses multiprocessing for parallelism
tokenized = dataset.map(
tokenize_function,
batched=True,
num_proc=4,
remove_columns=["text"], # drop raw text after tokenizing
)
# Filtering, shuffling, splitting -- all efficient
long_texts = dataset.filter(lambda x: len(x["text"]) > 1000)
shuffled = dataset.shuffle(seed=42)
splits = dataset.train_test_split(test_size=0.1, seed=42)
The remove_columns parameter is a subtle but important detail. After tokenizing, you don't need the raw text anymore -- keeping it wastes memory and can cause collation issues during training. Always remove columns you've finished processing.
The Trainer API: simplified fine-tuning
We wrote the training loop manually in episode #69. You now understand exactly what happens inside: forward pass, loss computation, backward pass, optimizer step, gradient clipping, learning rate scheduling, checkpointing. The Trainer API handles all that boilerplate while letting you customize the important parts.
from transformers import (
TrainingArguments, Trainer,
DataCollatorForLanguageModeling,
EarlyStoppingCallback,
)
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=False
)
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=4,
per_device_eval_batch_size=8,
gradient_accumulation_steps=4,
learning_rate=2e-4,
warmup_ratio=0.03,
lr_scheduler_type="cosine",
logging_dir="./logs",
logging_steps=10,
eval_strategy="steps",
eval_steps=100,
save_strategy="steps",
save_steps=100,
save_total_limit=3,
load_best_model_at_end=True,
bf16=True,
report_to="tensorboard",
gradient_checkpointing=True,
optim="adamw_torch",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
data_collator=data_collator,
callbacks=[EarlyStoppingCallback(
early_stopping_patience=3
)],
)
# Train
trainer.train()
# Evaluate
metrics = trainer.evaluate()
print(metrics)
# Save and push
trainer.save_model("./final_model")
trainer.push_to_hub("my-username/my-model")
What Trainer handles for you: gradient accumulation, mixed precision training (bf16/fp16), gradient clipping, learning rate scheduling with warmup, checkpoint management (keeps only the last N checkpoints so your disk doesn't fill up), logging to TensorBoard or Weights & Biases, early stopping, and distributed training across GPUs. That's a LOT of code you don't have to write yourself.
But the really clever part is customization through subclassing. Need a custom loss function? Override compute_loss. Need special evaluation logic? Override evaluate. Need to modify what happens at each training step? Override training_step. The rest stays the same:
class CustomTrainer(Trainer):
def compute_loss(self, model, inputs,
return_outputs=False, **kwargs):
outputs = model(**inputs)
loss = outputs.loss
# Add L2 regularization
l2_reg = sum(p.pow(2).sum()
for p in model.parameters())
loss = loss + 1e-5 * l2_reg
return (loss, outputs) if return_outputs else loss
def compute_metrics(self, eval_pred):
logits, labels = eval_pred
predictions = logits.argmax(axis=-1)
accuracy = (predictions == labels).mean()
return {"accuracy": accuracy}
This is good API design -- the common case is simple (just use Trainer with TrainingArguments), and the custom case is possible without rewriting everything from scratch.
Accelerate: distributed training made simple
When one GPU isn't enough, Accelerate handles distributed training with minimal code changes. The philosophy is brilliant: write your training loop as if it runs on a single device, then Accelerate handles the distribution. No torch.distributed boilerplate, no DistributedDataParallel wrapping, no manual gradient synchronization.
from accelerate import Accelerator
accelerator = Accelerator(mixed_precision="bf16")
# Your normal PyTorch objects
model = AutoModelForCausalLM.from_pretrained(model_name)
optimizer = torch.optim.AdamW(
model.parameters(), lr=2e-4
)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer, T_max=1000
)
dataloader = DataLoader(dataset, batch_size=4)
# Accelerate wraps them for distributed execution
model, optimizer, dataloader, scheduler = accelerator.prepare(
model, optimizer, dataloader, scheduler
)
# Training loop looks almost identical to single-GPU
for epoch in range(3):
model.train()
for batch in dataloader:
outputs = model(**batch)
loss = outputs.loss
accelerator.backward(loss)
optimizer.step()
scheduler.step()
optimizer.zero_grad()
if accelerator.is_main_process:
print(f"Loss: {loss.item():.4f}")
# Save checkpoint (only on main process)
accelerator.wait_for_everyone()
if accelerator.is_main_process:
accelerator.save_model(model, f"checkpoint-{epoch}")
Launch it:
# 4 GPUs on one machine
accelerate launch --num_processes=4 train.py
# Or configure interactively
accelerate config
accelerate launch train.py
# With DeepSpeed (for very large models)
accelerate launch --use_deepspeed train.py
The key thing to notice: the training loop itself didn't change much. You replaced loss.backward() with accelerator.backward(loss) and wrapped your objects with accelerator.prepare(). That's basically it. Accelerate handles figuring out which GPU each process runs on, splitting the data, synchronizing gradients, and gathering metrics. It supports data parallelism (same model on each GPU, different data batches), model parallelism (model split across GPUs), and FSDP (Fully Sharded Data Parallelism -- the current best practice for training models that don't fit on a single GPU).
For most fine-tuning tasks, plain data parallelism with gradient accumulation is sufficient. You only need FSDP or model parallelism when the model itself doesn't fit in one GPU's memory -- which is increasingly the case with 70B+ parameter models.
Spaces: deploying demos
Hugging Face Spaces lets you deploy ML applications with a git push. It supports Gradio (for ML demos) and Streamlit (for data apps). This is where the ecosystem really comes full circle -- you've trained a model, now you want someone (a colleague, a user, a potential investor) to try it without installing anything.
# app.py for a Gradio-based Space
import gradio as gr
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
def analyze(text):
if not text.strip():
return "Please enter some text."
result = classifier(text)[0]
label = result["label"]
score = result["score"]
return f"{label} (confidence: {score:.2%})"
def compare_models(text):
"""Compare sentiment across two models."""
pipe_a = pipeline(
"sentiment-analysis",
model="distilbert-base-uncased-finetuned-sst-2-english"
)
pipe_b = pipeline(
"sentiment-analysis",
model="cardiffnlp/twitter-roberta-base-sentiment-latest"
)
ra = pipe_a(text)[0]
rb = pipe_b(text)[0]
return (
f"DistilBERT: {ra['label']} ({ra['score']:.2%})\n"
f"RoBERTa: {rb['label']} ({rb['score']:.2%})"
)
with gr.Blocks() as demo:
gr.Markdown("# Sentiment Analyzer")
with gr.Tab("Single Model"):
text_input = gr.Textbox(label="Enter text")
output = gr.Textbox(label="Sentiment")
btn = gr.Button("Analyze")
btn.click(analyze, inputs=text_input,
outputs=output)
with gr.Tab("Compare Models"):
text_input2 = gr.Textbox(label="Enter text")
output2 = gr.Textbox(label="Results")
btn2 = gr.Button("Compare")
btn2.click(compare_models, inputs=text_input2,
outputs=output2)
gr.Examples(
examples=[
"This AI series has taught me so much!",
"The documentation is confusing and incomplete.",
"Not bad, could be better though.",
],
inputs=text_input,
)
demo.launch()
Push this to a Space and it's live on the web. The free tier includes CPU-only instances. Paid tiers add GPU support (T4, A10G, A100). Spaces is useful for: demoing your fine-tuned models to stakeholders who don't have Python installed, creating interactive tutorials, building simple tools for non-technical colleagues, and prototyping before building a full production application.
How the pieces fit together
The Hugging Face ecosystem works because the pieces compose into a natural workflow:
- Find a model on the Hub (or discover one doesn't exist for your use case)
- Load it with Transformers (
AutoModel.from_pretrained) - Load training data with Datasets (
load_dataset) - Fine-tune with Trainer (or write a custom loop with Accelerate for more control)
- Push the fine-tuned model back to the Hub
- Deploy a demo with Spaces
- Someone else finds YOUR model on the Hub -- cycle continues
Each piece is useful independently, but together they form a workflow that covers the entire ML lifecycle. And because everything uses the same model format and the same tokenizer configs, switching between steps is frictionless. A model trained with Trainer can be loaded with pipeline. A model pushed to the Hub can be downloaded by someone using Accelerate for distributed inference. A dataset processed with datasets works seamlessly with DataLoader.
Throughout this series we've used Hugging Face components whenever we moved from scratch implementations to library-based code. In episodes #42-44 we used PyTorch directly. In episode #69 we fine-tuned with a manual training loop. Now you know the ecosystem that sits on top of all that -- and more importantly, you know what each piece does under the hood because you built it yourself first. That's the whole point of the "scratch first, library second" philosophy.
Samengevat
- The Transformers library gives you a consistent API (
pipelinefor quick prototyping,AutoModel+AutoTokenizerfor fine-grained control) across thousands of model architectures; - the Model Hub hosts 500K+ models with model cards, licensing info, and gated access -- always read the model card before trusting a checkpoint;
- the Datasets library handles loading and processing data at any scale through Arrow-backed memory mapping, efficient
.map()and.filter()operations, and streaming for datasets larger than your disk; - the Trainer API wraps the training loop boilerplate (checkpointing, logging, distributed training, mixed precision) while remaining customizable through subclassing;
- Accelerate bridges single-GPU and multi-GPU training with minimal code changes -- wrap your objects, replace
loss.backward(), and launch; - Spaces provides free deployment for ML demos using Gradio or Streamlit -- your fine-tuned model goes from weights on disk to a live web app with a git push.
Exercises
Exercise 1: Build a model comparison pipeline. Create a class ModelComparer that: (a) takes a list of model names (use simulated model functions if you don't have GPU access), (b) runs each model on 10 test prompts (mix of sentiment analysis, question answering, and text generation tasks), (c) collects timing information (use time.time() around each call), (d) computes average response length, average latency, and a quality score per model (define quality as: contains relevant keywords from the prompt + stays under 200 words + is grammatically a complete sentence ending with punctuation), (e) produces a formatted comparison table showing all metrics side by side. Test with 3 simulated models: a "fast but dumb" model (returns short generic answers quickly), a "slow but smart" model (returns detailed keyword-rich answers after a sleep), and a "balanced" model (moderate speed and quality). Print the full comparison table and declare a winner based on a weighted score (40% quality, 30% latency, 30% length appropriateness).
Exercise 2: Build a dataset processing pipeline. Create a class DataPipeline that demonstrates the core operations from the datasets library using pure Python (no library imports needed). It should: (a) accept a list of dictionaries as input data (simulate a dataset), (b) implement .map(fn, batched=False) that applies a function to each example and returns a new dataset, (c) implement .filter(fn) that keeps only examples where fn returns True, (d) implement .train_test_split(test_size, seed) that shuffles and splits the data, (e) implement .select(indices) that picks specific rows, (f) track and display processing statistics (rows in, rows out, time per operation). Create a sample dataset of 100 text examples (generate them programatically -- "Sample text number N about topic X" patterns), then run a pipeline: map to add word counts, filter to keep texts over 5 words, map to tokenize (split on spaces), train/test split at 80/20. Print statistics at each step and final dataset info.
Exercise 3: Build a Hub search simulator. Create a class ModelHub that simulates the Hugging Face Hub's search and filtering functionality. It should: (a) store models as dicts with fields: id, task, library, downloads, likes, license, tags, parameters (model size), (b) implement .search(query) that matches against model id and tags (case-insensitive substring matching), (c) implement .filter(task=None, library=None, license=None, min_downloads=None, max_parameters=None) with combinable filters, (d) implement .sort(field, descending=True), (e) implement .model_card(model_id) that returns a formatted string with all model info. Pre-populate with 20 models across 4 tasks (text-generation, text-classification, question-answering, sentence-similarity) and 3 libraries (transformers, sentence-transformers, spacy). Include realistic download counts ranging from 100 to 10M. Demonstrate: search for "llama", filter by task + minimum downloads, sort by likes, and print a model card for the top result.