Learn AI Series (#79) - Object Detection (Part 2) - Modern Approaches
What will I learn
- You will learn how YOLO reframed detection as a single regression problem and why that was revolutionary;
- the YOLO evolution from v1 through v8: anchor boxes, multi-scale prediction, and the move to anchor-free;
- SSD and multi-scale single-shot detection from different feature map levels;
- anchor-free detectors: FCOS and CenterNet -- simpler architectures with fewer hyperparameters;
- training a custom object detector on your own dataset with transfer learning;
- evaluation metrics: mAP, precision-recall curves, and what mAP50-95 actually measures.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution;
- The ambition to learn AI and machine learning.
Difficulty
- Beginner
Curriculum (of the Learn AI Series):
- Learn AI Series (#1) - What Machine Learning Actually Is
- Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
- Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World
- Learn AI Series (#4) - Your First Prediction - No Math, Just Intuition
- Learn AI Series (#5) - Patterns in Data - What "Learning" Actually Looks Like
- Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
- Learn AI Series (#7) - The Training Loop - See It Work Step by Step
- Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra
- Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability
- Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch
- Learn AI Series (#11) - Making Linear Regression Real
- Learn AI Series (#12) - Classification - Logistic Regression From Scratch
- Learn AI Series (#13) - Evaluation - How to Know If Your Model Actually Works
- Learn AI Series (#14) - Data Preparation - The 80% Nobody Talks About
- Learn AI Series (#15) - Feature Engineering and Selection
- Learn AI Series (#16) - Scikit-Learn - The Standard Library of ML
- Learn AI Series (#17) - Decision Trees - How Machines Make Decisions
- Learn AI Series (#18) - Random Forests - Wisdom of Crowds
- Learn AI Series (#19) - Gradient Boosting - The Kaggle Champion
- Learn AI Series (#20) - Support Vector Machines - Drawing the Perfect Boundary
- Learn AI Series (#21) - Mini Project - Predicting Crypto Market Regimes
- Learn AI Series (#22) - K-Means Clustering - Finding Groups
- Learn AI Series (#23) - Advanced Clustering - Beyond K-Means
- Learn AI Series (#24) - Dimensionality Reduction - PCA
- Learn AI Series (#25) - Advanced Dimensionality Reduction - t-SNE and UMAP
- Learn AI Series (#26) - Anomaly Detection - Finding What Doesn't Belong
- Learn AI Series (#27) - Recommendation Systems - "Users Like You Also Liked..."
- Learn AI Series (#28) - Time Series Fundamentals - When Order Matters
- Learn AI Series (#29) - Time Series Forecasting - Predicting What Comes Next
- Learn AI Series (#30) - Natural Language Processing - Text as Data
- Learn AI Series (#31) - Word Embeddings - Meaning in Numbers
- Learn AI Series (#32) - Bayesian Methods - Thinking in Probabilities
- Learn AI Series (#33) - Ensemble Methods Deep Dive - Stacking and Blending
- Learn AI Series (#34) - ML Engineering - From Notebook to Production
- Learn AI Series (#35) - Data Ethics and Bias in ML
- Learn AI Series (#36) - Mini Project - Complete ML Pipeline
- Learn AI Series (#37) - The Perceptron - Where It All Started
- Learn AI Series (#38) - Neural Networks From Scratch - Forward Pass
- Learn AI Series (#39) - Neural Networks From Scratch - Backpropagation
- Learn AI Series (#40) - Training Neural Networks - Practical Challenges
- Learn AI Series (#41) - Optimization Algorithms - SGD, Momentum, Adam
- Learn AI Series (#42) - PyTorch Fundamentals - Tensors and Autograd
- Learn AI Series (#43) - PyTorch Data and Training
- Learn AI Series (#44) - PyTorch nn.Module - Building Real Networks
- Learn AI Series (#45) - Convolutional Neural Networks - Theory
- Learn AI Series (#46) - CNNs in Practice - Classic to Modern Architectures
- Learn AI Series (#47) - CNN Applications - Detection, Segmentation, Style Transfer
- Learn AI Series (#48) - Recurrent Neural Networks - Sequences
- Learn AI Series (#49) - LSTM and GRU - Solving the Memory Problem
- Learn AI Series (#50) - Sequence-to-Sequence Models
- Learn AI Series (#51) - Attention Mechanisms
- Learn AI Series (#52) - The Transformer Architecture (Part 1)
- Learn AI Series (#53) - The Transformer Architecture (Part 2)
- Learn AI Series (#54) - Vision Transformers
- Learn AI Series (#55) - Generative Adversarial Networks
- Learn AI Series (#56) - Mini Project - Building a Transformer From Scratch
- Learn AI Series (#57) - Language Modeling - Predicting the Next Word
- Learn AI Series (#58) - GPT Architecture - Decoder-Only Transformers
- Learn AI Series (#59) - BERT and Encoder Models
- Learn AI Series (#60) - Training Large Language Models
- Learn AI Series (#61) - Instruction Tuning and Alignment
- Learn AI Series (#62) - Prompt Engineering - Getting the Most from LLMs
- Learn AI Series (#63) - Embeddings and Vector Search
- Learn AI Series (#64) - Retrieval-Augmented Generation (RAG) - Basics
- Learn AI Series (#65) - RAG - Advanced Techniques
- Learn AI Series (#66) - Working with LLM APIs
- Learn AI Series (#67) - Building AI Agents (Part 1) - Foundations
- Learn AI Series (#68) - Building AI Agents (Part 2) - Advanced Patterns
- Learn AI Series (#69) - Fine-Tuning Language Models
- Learn AI Series (#70) - Running Local Models
- Learn AI Series (#71) - Text Generation Techniques
- Learn AI Series (#72) - Tokenization Deep Dive
- Learn AI Series (#73) - LLM Evaluation
- Learn AI Series (#74) - The Hugging Face Ecosystem
- Learn AI Series (#75) - Multimodal Models - Text Meets Vision
- Learn AI Series (#76) - Mini Project - Your Own AI Assistant
- Learn AI Series (#77) - Image Processing Fundamentals
- Learn AI Series (#78) - Object Detection (Part 1) - Foundations
- Learn AI Series (#79) - Object Detection (Part 2) - Modern Approaches (this post)
Learn AI Series (#79) - Object Detection (Part 2) - Modern Approaches
Solutions to Episode #78 Exercises
Exercise 1: Detection dataset simulator.
import numpy as np
import cv2
import random
class DetectionDataset:
"""Generate synthetic detection datasets."""
def __init__(self, num_images=50, img_size=300, seed=42):
self.num_images = num_images
self.img_size = img_size
self.rng = random.Random(seed)
self.np_rng = np.random.RandomState(seed)
self.classes = {"red": (0, 0, 255),
"green": (0, 255, 0),
"blue": (255, 0, 0)} # BGR
self.data = self._generate()
def _generate(self):
dataset = []
for _ in range(self.num_images):
img = np.zeros(
(self.img_size, self.img_size, 3),
dtype=np.uint8
)
num_objects = self.rng.randint(1, 5)
annotations = []
placed_boxes = []
for _ in range(num_objects):
cls_name = self.rng.choice(
list(self.classes.keys())
)
color = self.classes[cls_name]
for attempt in range(50):
w = self.rng.randint(30, 100)
h = self.rng.randint(30, 100)
x1 = self.rng.randint(
0, self.img_size - w
)
y1 = self.rng.randint(
0, self.img_size - h
)
x2, y2 = x1 + w, y1 + h
overlap = False
for bx1, by1, bx2, by2 in placed_boxes:
if not (x2 <= bx1 or x1 >= bx2
or y2 <= by1 or y1 >= by2):
overlap = True
break
if not overlap:
img[y1:y2, x1:x2] = color
annotations.append({
"class": cls_name,
"box": [x1, y1, x2, y2],
})
placed_boxes.append(
(x1, y1, x2, y2)
)
break
dataset.append((img, annotations))
return dataset
def get_sample(self, index):
return self.data[index]
def visualize(self, index, save_path="det_vis.png"):
img, anns = self.data[index]
vis = img.copy()
for ann in anns:
x1, y1, x2, y2 = ann["box"]
cv2.rectangle(vis, (x1, y1), (x2, y2),
(255, 255, 255), 2)
cv2.putText(vis, ann["class"], (x1, y1 - 5),
cv2.FONT_HERSHEY_SIMPLEX, 0.5,
(255, 255, 255), 1)
cv2.imwrite(save_path, vis)
return save_path
def statistics(self):
total_objects = 0
class_counts = {c: 0 for c in self.classes}
total_area = 0
for _, anns in self.data:
total_objects += len(anns)
for ann in anns:
class_counts[ann["class"]] += 1
x1, y1, x2, y2 = ann["box"]
total_area += (x2 - x1) * (y2 - y1)
img_area = self.img_size ** 2
avg_objs = total_objects / len(self.data)
avg_pct = (total_area / total_objects
/ img_area * 100)
print(f"Total images: {len(self.data)}")
print(f"Avg objects/img: {avg_objs:.1f}")
print(f"Class distribution:")
for cls, cnt in class_counts.items():
print(f" {cls}: {cnt}")
print(f"Avg object area: {avg_pct:.1f}% of image")
ds = DetectionDataset(num_images=50)
ds.statistics()
The non-overlap constraint is the interesting part. In real datasets, objects overlap constantly (people in a crowd, cars on a highway). Our simulator avoids it for simplicity, but that actually makes detection easier than real-world scenarios. If you extend this simulator to allow overlap, you'll immediately see why NMS (which we covered last episode) becomes so critical -- overlapping ground truth objects generate ambiguous training signals.
Exercise 2: NMS benchmarking suite.
import numpy as np
import random
def compute_iou(box1, box2):
x1 = max(box1[0], box2[0])
y1 = max(box1[1], box2[1])
x2 = min(box1[2], box2[2])
y2 = min(box1[3], box2[3])
inter = max(0, x2 - x1) * max(0, y2 - y1)
a1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
a2 = (box2[2] - box2[0]) * (box2[3] - box2[1])
union = a1 + a2 - inter
return inter / max(union, 1e-6)
class NMSBenchmark:
def __init__(self, seed=42):
self.rng = random.Random(seed)
def _make_scenario(self, mode):
boxes, scores, gt = [], [], []
if mode == "easy":
for i in range(5):
x = 50 + i * 120
gt.append([x, 50, x + 80, 130])
boxes.append([x + 2, 48, x + 82, 132])
scores.append(0.9 - i * 0.05)
elif mode == "moderate":
for i in range(5):
x = 30 + i * 100
gt.append([x, 30, x + 70, 110])
for j in range(3):
dx = self.rng.randint(-10, 10)
dy = self.rng.randint(-10, 10)
boxes.append([x + dx, 30 + dy,
x + 70 + dx, 110 + dy])
scores.append(
0.9 - j * 0.15
+ self.rng.uniform(-0.05, 0.05)
)
else: # hard
positions = [
(30, 30), (60, 40), (200, 30),
(210, 50), (350, 100)
]
for px, py in positions:
gt.append([px, py, px + 60, py + 80])
for gx1, gy1, gx2, gy2 in gt:
for _ in range(10):
dx = self.rng.randint(-15, 15)
dy = self.rng.randint(-15, 15)
boxes.append([gx1 + dx, gy1 + dy,
gx2 + dx, gy2 + dy])
scores.append(
self.rng.uniform(0.3, 0.95)
)
return boxes, scores, gt
def standard_nms(self, boxes, scores, thresh=0.5):
order = sorted(range(len(scores)),
key=lambda i: scores[i],
reverse=True)
keep = []
while order:
best = order[0]
keep.append(best)
order = [
i for i in order[1:]
if compute_iou(boxes[best],
boxes[i]) < thresh
]
return keep
def soft_nms(self, boxes, scores, sigma=0.5,
thresh=0.01):
boxes = [list(b) for b in boxes]
scores = list(scores)
kept = []
while scores:
idx = scores.index(max(scores))
kept.append(idx)
ref = boxes[idx]
boxes.pop(idx)
scores.pop(idx)
for i in range(len(scores)):
iou = compute_iou(ref, boxes[i])
scores[i] *= np.exp(
-(iou ** 2) / sigma
)
surviving = [
(b, s) for b, s in
zip(boxes, scores) if s >= thresh
]
if surviving:
boxes = [x[0] for x in surviving]
scores = [x[1] for x in surviving]
else:
break
return kept
def _detection_rate(self, kept_boxes, gt_boxes):
detected = 0
for g in gt_boxes:
for k in kept_boxes:
if compute_iou(k, g) >= 0.5:
detected += 1
break
return detected / max(len(gt_boxes), 1)
def run(self):
print(f"{'Scenario':<10} {'Method':<10} "
f"{'Kept':>5} {'DetRate':>8} {'FP':>5}")
print("-" * 42)
for mode in ["easy", "moderate", "hard"]:
boxes, scores, gt = self._make_scenario(mode)
for name, fn in [
("NMS", self.standard_nms),
("SoftNMS", self.soft_nms)
]:
kept_idx = fn(list(boxes), list(scores))
kept_b = [boxes[i] for i in kept_idx]
dr = self._detection_rate(kept_b, gt)
matched = sum(
1 for k in kept_b
if any(compute_iou(k, g) >= 0.5
for g in gt)
)
fp = len(kept_b) - matched
print(f"{mode:<10} {name:<10} "
f"{len(kept_b):>5} {dr:>8.1%} "
f"{fp:>5}")
bench = NMSBenchmark()
bench.run()
In the "hard" scenario with closely-placed objects, Soft-NMS should retain more correct detections because it decays scores gradually instead of hard-deleting overlapping boxes. The two objects at positions (30,30) and (60,40) have substantial overlap -- standard NMS will likely suppress one of them, while Soft-NMS keeps both (with a reduced score for the secondary one).
Exercise 3: Simplified Faster R-CNN forward pass simulator.
import numpy as np
import random
def compute_iou(box1, box2):
x1 = max(box1[0], box2[0])
y1 = max(box1[1], box2[1])
x2 = min(box1[2], box2[2])
y2 = min(box1[3], box2[3])
inter = max(0, x2 - x1) * max(0, y2 - y1)
a1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
a2 = (box2[2] - box2[0]) * (box2[3] - box2[1])
union = a1 + a2 - inter
return inter / max(union, 1e-6)
class SimplifiedFasterRCNN:
def __init__(self, grid=7, num_classes=5, seed=42):
self.grid = grid
self.num_classes = num_classes
self.rng = np.random.RandomState(seed)
self.scales = [32, 64, 128]
self.ratios = [0.5, 1.0, 2.0]
def generate_anchors(self):
anchors = []
cell_size = 448 / self.grid # assume 448px image
for row in range(self.grid):
for col in range(self.grid):
cx = (col + 0.5) * cell_size
cy = (row + 0.5) * cell_size
for s in self.scales:
for r in self.ratios:
w = s * np.sqrt(r)
h = s / np.sqrt(r)
anchors.append([
cx - w / 2, cy - h / 2,
cx + w / 2, cy + h / 2
])
return anchors
def rpn_scores(self, anchors):
return self.rng.uniform(0, 1, len(anchors))
def rpn_offsets(self, anchors):
return self.rng.uniform(-5, 5,
(len(anchors), 4))
def apply_offsets(self, anchors, offsets):
proposals = []
for a, o in zip(anchors, offsets):
proposals.append([
a[0] + o[0], a[1] + o[1],
a[2] + o[2], a[3] + o[3]
])
return proposals
def top_k(self, proposals, scores, k=300):
order = np.argsort(scores)[::-1][:k]
return ([proposals[i] for i in order],
[scores[i] for i in order])
def nms(self, boxes, scores, thresh=0.7):
order = sorted(range(len(scores)),
key=lambda i: scores[i],
reverse=True)
keep = []
while order:
best = order[0]
keep.append(best)
order = [
i for i in order[1:]
if compute_iou(boxes[best],
boxes[i]) < thresh
]
return keep
def forward(self):
# Stage 1: generate anchors
anchors = self.generate_anchors()
print(f"Stage 1 - Anchors: {len(anchors)}")
# Stage 2: RPN scoring + offset
obj_scores = self.rpn_scores(anchors)
offsets = self.rpn_offsets(anchors)
proposals = self.apply_offsets(anchors, offsets)
# Stage 3: top-K proposals
proposals, scores = self.top_k(
proposals, obj_scores, k=300
)
print(f"Stage 2 - After top-K: {len(proposals)}")
# Stage 4: NMS on proposals
kept = self.nms(proposals, scores, thresh=0.7)
proposals = [proposals[i] for i in kept]
print(f"Stage 3 - After NMS: {len(proposals)}")
# Stage 5: detection head
detections = []
for p in proposals:
cls = self.rng.randint(0, self.num_classes)
conf = self.rng.uniform(0.1, 1.0)
detections.append({
"box": p, "class": cls,
"score": conf
})
# Stage 6: per-class NMS
final = []
for c in range(self.num_classes):
class_dets = [
d for d in detections if d["class"] == c
]
if not class_dets:
continue
boxes_c = [d["box"] for d in class_dets]
scores_c = [d["score"] for d in class_dets]
kept_c = self.nms(boxes_c, scores_c,
thresh=0.3)
for i in kept_c:
final.append(class_dets[i])
print(f"Stage 4 - Final detections: {len(final)}")
return final
model = SimplifiedFasterRCNN()
results = model.forward()
The key takeaway is the dramatic reduction at each stage: 441 anchors -> 300 top-K -> ~50 after NMS -> final detections after per-class NMS. Each stage acts as a filter. The RPN is basically saying "there might be objects here" for 300 out of 441 locations, then NMS merges overlapping proposals, and the detection head assigns classes and confidence scores. Real Faster R-CNN uses learned weights instead of random scores, but the data flow pattern is identical.
On to today's episode
Here we go! Last episode we traced the evolution of two-stage detectors from the brute-force sliding window all the way through to Faster R-CNN. We built IoU from scratch, implemented NMS, walked through the R-CNN family's systematic bottleneck removal, and ended up at ~5 FPS on a 2015-era GPU. Good, but not real-time.
The two-stage pipeline -- first propose regions, then classify them -- is inherently limited in speed. You're doing two separate jobs sequentially. In 2015, a paper came along with a title that said it all: "You Only Look Once." What if you could skip the proposal stage entirely and predict everything in a single forward pass? That's exactly what YOLO did, and it changed the entire field ;-)
YOLO: detection as regression
YOLO (Redmon et al., 2015) reframed detection as a single regression problem. Instead of the propose-then-classify pipeline, YOLO divides the image into an S x S grid (typically 7x7) and predicts bounding boxes and class probabilities directly from each grid cell in one forward pass through the network.
Each grid cell predicts B bounding boxes (each with 5 values: x, y, w, h, confidence) and C class probabilities. The output tensor has shape S x S x (B*5 + C). For PASCAL VOC with 20 classes, S=7, B=2, that's 7 x 7 x 30 -- a single tensor that encodes every detection in the image.
import torch
import torch.nn as nn
class SimpleYOLOHead(nn.Module):
"""Simplified YOLO detection head."""
def __init__(self, in_channels, grid_size=7,
num_boxes=2, num_classes=20):
super().__init__()
self.S = grid_size
self.B = num_boxes
self.C = num_classes
# Each cell: B boxes (x,y,w,h,conf) + C class probs
out_features = (self.S * self.S
* (self.B * 5 + self.C))
self.fc = nn.Sequential(
nn.Flatten(),
nn.Linear(in_channels * (grid_size ** 2),
4096),
nn.LeakyReLU(0.1),
nn.Dropout(0.5),
nn.Linear(4096, out_features),
)
def forward(self, x):
return self.fc(x).view(
-1, self.S, self.S, self.B * 5 + self.C
)
# For PASCAL VOC: 7x7 grid, 2 boxes, 20 classes
head = SimpleYOLOHead(512, grid_size=7,
num_boxes=2, num_classes=20)
fake_features = torch.randn(1, 512, 7, 7)
output = head(fake_features)
print(f"Output shape: {output.shape}")
# torch.Size([1, 7, 7, 30])
# 30 = 2 boxes * 5 values + 20 classes
The box coordinates (x, y) are relative to the grid cell, and (w, h) are relative to the whole image. The confidence score represents both the probability that a box contains an object AND how good the box actually is: P(object) * IoU(predicted, truth).
YOLO's speed was revolutionary: 45 FPS on a GPU, compared to Faster R-CNN's ~5 FPS. The tradeoff? Lower accuracy, especially on small objects and objects that appear in groups. The coarse 7x7 grid means each cell can only predict a limited number of objects. Two small birds sitting in the same grid cell? YOLO v1 can only detect one of them.
Having said that, for most practical applications (security cameras, robotics, autonomous driving), that speed advantage was worth more than marginal accuracy improvements. A detector that runs at 5 FPS can't power a real-time system. One that runs at 45+ FPS absolutely can.
The YOLO evolution
YOLOv1 had clear limitations, and subsequent versions addressed them systematically. This progression is worth understanding because it shows how the field iterated toward the modern detectors we use today.
YOLOv2 (also called YOLO9000) added batch normalization everywhere, borrowed anchor boxes from Faster R-CNN (instead of free-form box prediction, start with predefined shapes and learn to refine them), introduced multi-scale training (randomly resize the input during training to handle different object sizes), and used a better backbone called Darknet-19.
YOLOv3 brought multi-scale detection: predictions at three different feature map scales (like a Feature Pyramid Network). This was a big deal for small objects -- the fine-grained feature map catches small things that the coarse grid missed. It also switched to Darknet-53 with residual connections (sound familiar from episode #46?).
YOLOv5 (Ultralytics) was a PyTorch reimplementation that prioritized usability over novelty. Easy training, ONNX/TensorRT export, excellent documentation, pip-installable. It became the de facto standard for anyone who needed to actually ship a detection system rather than publish a paper.
YOLOv8 (also Ultralytics, 2023) is the current practical standard. It's anchor-free (no predefined box shapes -- full circle back to the original YOLO philosophy of simplicity), uses a decoupled detection head (separate branches for classification and localization), and supports detection, segmentation, pose estimation, and classification all in one unified framework.
from ultralytics import YOLO
# Load pretrained model (COCO, 80 classes)
model = YOLO("yolov8n.pt") # nano: fast, smaller
# Inference on an image
results = model("street_photo.jpg")
# Process results
for result in results:
boxes = result.boxes
for box in boxes:
x1, y1, x2, y2 = box.xyxy[0].tolist()
confidence = box.conf[0].item()
class_id = int(box.cls[0].item())
class_name = model.names[class_id]
print(f"{class_name}: {confidence:.2f} "
f"at [{x1:.0f},{y1:.0f},"
f"{x2:.0f},{y2:.0f}]")
# Visualize with boxes drawn on the image
result.save("output_detections.jpg")
Five lines of code to go from an image to detected objects with bounding boxes, confidence scores, and class labels. That's the power of modern frameworks built on top of a decade of architectural iteration ;-)
The YOLO model sizes range from nano (3.2M parameters, ~640 FPS on GPU) through small, medium, large, to extra-large (68.2M parameters, highest accuracy). Pick based on your deployment constraint: edge device with limited compute? Use nano. Server with a beefy GPU? Use large or extra-large.
SSD: multi-scale single-shot detection
SSD (Single Shot MultiBox Detector, Liu et al., 2016) was published around the same time as YOLO and took a different approach to single-shot detection. Instead of predicting from a single grid, SSD predicts from multiple feature maps at different scales.
Input Image (300x300)
|
VGG-16 Backbone
|
Feature Map 38x38 -> predictions (small objects)
|
Feature Map 19x19 -> predictions (medium objects)
|
Feature Map 10x10 -> predictions
|
Feature Map 5x5 -> predictions
|
Feature Map 3x3 -> predictions (large objects)
|
Feature Map 1x1 -> predictions (very large objects)
The insight is elegant: small feature maps (3x3) have large receptive fields and detect large objects. Large feature maps (38x38) have small receptive fields and detect small objects. By predicting from all levels simultaneously, SSD handles objects of varying sizes much better than original YOLO.
At each spatial position in each feature map, SSD predicts: class scores for each of several default boxes (anchor boxes) with different aspect ratios, and offset adjustments to refine those default boxes.
import torch
import torch.nn as nn
class SSDPredictionHead(nn.Module):
"""Prediction head for one SSD feature level."""
def __init__(self, in_channels, num_anchors,
num_classes):
super().__init__()
# Classification: anchors * classes per position
self.cls = nn.Conv2d(
in_channels, num_anchors * num_classes,
3, padding=1
)
# Localization: anchors * 4 coords per position
self.loc = nn.Conv2d(
in_channels, num_anchors * 4,
3, padding=1
)
def forward(self, feature_map):
cls_pred = self.cls(feature_map)
loc_pred = self.loc(feature_map)
return cls_pred, loc_pred
# Each feature level gets its own prediction head
# with the appropriate number of input channels
heads = nn.ModuleList([
SSDPredictionHead(512, num_anchors=4,
num_classes=21), # 38x38
SSDPredictionHead(1024, num_anchors=6,
num_classes=21), # 19x19
SSDPredictionHead(512, num_anchors=6,
num_classes=21), # 10x10
SSDPredictionHead(256, num_anchors=6,
num_classes=21), # 5x5
SSDPredictionHead(256, num_anchors=4,
num_classes=21), # 3x3
SSDPredictionHead(256, num_anchors=4,
num_classes=21), # 1x1
])
# Total anchors: 38*38*4 + 19*19*6 + 10*10*6
# + 5*5*6 + 3*3*4 + 1*1*4 = 8732
print(f"Total anchor boxes: "
f"{38*38*4 + 19*19*6 + 10*10*6"
f" + 5*5*6 + 3*3*4 + 1*1*4}")
8,732 anchor boxes total across all feature levels. That sounds like a lot, but the vast majority get classified as "background" very quickly. The network learns to only activate the anchors that actually overlap with objects.
The multi-scale prediction idea from SSD became foundational. Feature Pyramid Networks (FPN, Lin et al., 2017) formalized this approach by adding top-down connections that pass semantic information from deeper layers back to shallower layers. Virtually every modern detector -- including recent YOLO versions -- uses FPN or something similar. The principle that different scales need different feature maps has become one of those things nobody questions anymore.
Anchor-free detection: simpler is better
Anchor-based detectors (Faster R-CNN, SSD, YOLO v2-v5) all require careful anchor design. What sizes? What aspect ratios? How many per position? Get the anchors wrong and performance suffers. There's even an entire research subfield about "anchor optimization" -- which is a sign that maybe the whole anchor concept is more trouble than it's worth.
FCOS (Fully Convolutional One-Stage, Tian et al., 2019) eliminates anchors entirely. For each position in the feature map, it directly predicts: the distances to the four sides of the bounding box (left, top, right, bottom), a classification score, and a centerness score that downweights predictions far from object centers.
import torch
import torch.nn as nn
class FCOSHead(nn.Module):
"""Anchor-free detection: predict distances
to box edges from every feature position."""
def __init__(self, in_channels, num_classes):
super().__init__()
# Classification branch
self.cls_conv = nn.Sequential(
nn.Conv2d(in_channels, 256, 3, padding=1),
nn.GroupNorm(32, 256),
nn.ReLU(),
nn.Conv2d(256, 256, 3, padding=1),
nn.GroupNorm(32, 256),
nn.ReLU(),
)
self.cls_score = nn.Conv2d(
256, num_classes, 3, padding=1
)
# Regression branch
self.reg_conv = nn.Sequential(
nn.Conv2d(in_channels, 256, 3, padding=1),
nn.GroupNorm(32, 256),
nn.ReLU(),
nn.Conv2d(256, 256, 3, padding=1),
nn.GroupNorm(32, 256),
nn.ReLU(),
)
# 4 distances: left, top, right, bottom
self.reg_pred = nn.Conv2d(256, 4, 3, padding=1)
self.centerness = nn.Conv2d(256, 1, 3, padding=1)
def forward(self, feature_map):
cls_feat = self.cls_conv(feature_map)
reg_feat = self.reg_conv(feature_map)
cls_score = self.cls_score(cls_feat)
# exp() ensures distances are positive
reg_pred = torch.exp(self.reg_pred(reg_feat))
centerness = self.centerness(reg_feat)
return cls_score, reg_pred, centerness
# Test with a 32x32 feature map, 80 COCO classes
head = FCOSHead(256, num_classes=80)
feat = torch.randn(1, 256, 32, 32)
cls_out, reg_out, center_out = head(feat)
print(f"Classification: {cls_out.shape}")
# (1, 80, 32, 32) -- per-position class scores
print(f"Regression: {reg_out.shape}")
# (1, 4, 32, 32) -- l,t,r,b distances
print(f"Centerness: {center_out.shape}")
# (1, 1, 32, 32) -- how close to object center
The centerness trick is clever. Without it, positions at the edge of an object produce poor boxes (because the distances to opposite sides are very unequal). The centerness score is defined as sqrt(min(l,r)/max(l,r) * min(t,b)/max(t,b)) -- it's 1.0 at the exact center of an object and approaches 0 at the edges. Multiplying the classification score by centerness during inference naturally suppresses low-quality detections from edge positions.
CenterNet (Zhou et al., 2019) takes the simplification even further. It detects objects as center points. The model produces a heatmap where each peak represents an object center, then predicts width and height at each peak location. No anchors, no NMS needed (peaks in the heatmap naturally separate because each object produces exactly one peak).
def centernet_decode(heatmap, wh_pred, top_k=100):
"""Extract detections from CenterNet outputs.
heatmap: (1, C, H, W) -- per-class center heatmaps
wh_pred: (1, 2, H, W) -- width,height at each pos
"""
batch, num_classes, h, w = heatmap.shape
# Find local maxima (peaks) in the heatmap
# using max-pooling with kernel 3
pooled = torch.nn.functional.max_pool2d(
heatmap, 3, stride=1, padding=1
)
# A position is a peak if it equals the pooled value
peaks = (heatmap == pooled).float() * heatmap
# Get top-K peaks across all classes
flat = peaks.view(batch, -1)
top_scores, top_indices = flat.topk(top_k)
# Convert flat indices back to (class, y, x)
top_classes = top_indices // (h * w)
positions = top_indices % (h * w)
top_y = positions // w
top_x = positions % w
# Look up width/height at each peak position
detections = []
for i in range(top_k):
score = top_scores[0, i].item()
if score < 0.3:
break
cx = top_x[0, i].item()
cy = top_y[0, i].item()
cls = top_classes[0, i].item()
width = wh_pred[0, 0, cy, cx].item()
height = wh_pred[0, 1, cy, cx].item()
detections.append({
"class": cls,
"score": score,
"box": [cx - width / 2, cy - height / 2,
cx + width / 2, cy + height / 2],
})
return detections
The trend in detection architectures is unmistakable: simpler is better. Fewer hyperparameters, fewer special components, more straightforward training. YOLOv8 adopted anchor-free detection, DETR (Detection Transformer) formulates detection as a set prediction problem -- the whole field is moving away from the complex multi-stage pipelines that dominated five years ago.
Training a custom detector
All these architectures are great for understanding the concepts, but the real power of modern detection is that you can train on your own objects with surprisingly little data and effort. Want to detect specific products on store shelves? Defects on a manufacturing line? Species of birds in a forest? The workflow is the same.
Step 1: annotate your data. You need images with bounding box annotations. Each box has coordinates and a class label. Tools like Label Studio, CVAT, or Roboflow provide annotation interfaces. For YOLO format, each image gets a text file with one line per object:
# class_id center_x center_y width height
# all values normalized to 0-1 relative to image size
0 0.45 0.38 0.12 0.25
2 0.73 0.62 0.08 0.15
Step 2: organize your dataset in the standard YOLO directory structure:
dataset/
images/
train/
img001.jpg
img002.jpg
val/
img003.jpg
img004.jpg
labels/
train/
img001.txt
img002.txt
val/
img003.txt
img004.txt
data.yaml
# data.yaml -- tells YOLO where to find everything
path: ./dataset
train: images/train
val: images/val
names:
0: product_a
1: product_b
2: defect
Step 3: train.
from ultralytics import YOLO
# Start from pretrained COCO weights (transfer learning)
model = YOLO("yolov8n.pt")
# Train on your custom dataset
results = model.train(
data="dataset/data.yaml",
epochs=100,
imgsz=640,
batch=16,
patience=20, # early stopping if no improvement
lr0=0.01, # initial learning rate
augment=True, # built-in augmentations
)
# Training outputs: best.pt and last.pt in runs/detect/
The built-in augmentations include mosaic (stitching four training images together -- a trick introduced in YOLOv4 that dramatically improves small object detection), random flip, random rotation, HSV jitter, and scale variation. You get a solid data augmentation pipeline without writing any extra code.
Step 4: evaluate and iterate.
# Evaluate on the validation set
metrics = model.val()
print(f"mAP50: {metrics.box.map50:.3f}")
print(f"mAP50-95: {metrics.box.map:.3f}")
print(f"Precision: {metrics.box.mp:.3f}")
print(f"Recall: {metrics.box.mr:.3f}")
# Run inference on a new image
results = model("new_test_image.jpg")
results[0].save("prediction_result.jpg")
# Export for deployment
model.export(format="onnx") # ONNX for TensorRT
model.export(format="torchscript") # mobile
For most custom detection tasks, 200-500 annotated images with transfer learning from a COCO-pretrained model is enough to get decent results. The pretrained backbone already knows how to extract visual features -- edges, textures, shapes -- from episode #45's CNN theory in practice. Your training just teaches the detection head to recognize your specific objects. More data always helps, but the diminishing returns curve flattens quickly with transfer learning.
A practical tip from experience: spend more time on annotation quality than on model tuning. Inconsistent annotations (different annotators drawing boxes at different tightness levels, missing objects in some images, mislabeled classes) hurt performance more than any hyperparameter choice. Clean data beats a bigger model every time -- the same lesson we learned back in episode #14 about data preparation.
Evaluation: mean Average Precision
Detection evaluation is more involved than classification because you need to match predicted boxes to ground truth boxes. We introduced IoU in episode #78. Now we build the full evaluation pipeline on top of it.
The standard metric is mAP (mean Average Precision). For each class:
- Sort all detections by confidence score (descending)
- For each detection, check if it matches a ground truth box (IoU >= threshold)
- Mark each detection as true positive (matched) or false positive (unmatched or duplicate)
- Compute precision and recall at each step
- Plot the precision-recall curve
- Average Precision = area under this curve
import numpy as np
def compute_iou(box1, box2):
x1 = max(box1[0], box2[0])
y1 = max(box1[1], box2[1])
x2 = min(box1[2], box2[2])
y2 = min(box1[3], box2[3])
inter = max(0, x2 - x1) * max(0, y2 - y1)
a1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
a2 = (box2[2] - box2[0]) * (box2[3] - box2[1])
union = a1 + a2 - inter
return inter / max(union, 1e-6)
def compute_ap(predictions, ground_truths,
iou_threshold=0.5):
"""Average Precision for a single class."""
preds = sorted(predictions,
key=lambda p: p["score"],
reverse=True)
if len(ground_truths) == 0:
return 0.0
tp = np.zeros(len(preds))
fp = np.zeros(len(preds))
matched_gt = set()
for i, pred in enumerate(preds):
best_iou = 0
best_gt = -1
for j, gt in enumerate(ground_truths):
iou = compute_iou(pred["box"], gt["box"])
if iou > best_iou:
best_iou = iou
best_gt = j
if (best_iou >= iou_threshold
and best_gt not in matched_gt):
tp[i] = 1
matched_gt.add(best_gt)
else:
fp[i] = 1
# Cumulative precision and recall
tp_cum = np.cumsum(tp)
fp_cum = np.cumsum(fp)
precision = tp_cum / (tp_cum + fp_cum)
recall = tp_cum / len(ground_truths)
# Area under P-R curve
# (11-point interpolation or all-points)
ap = np.trapz(precision, recall)
return ap
# Example: 4 predictions, 2 ground truth objects
predictions = [
{"box": [100, 100, 200, 200], "score": 0.95},
{"box": [102, 98, 205, 203], "score": 0.90},
{"box": [300, 300, 400, 400], "score": 0.80},
{"box": [50, 50, 100, 100], "score": 0.70},
]
ground_truths = [
{"box": [105, 95, 205, 205]},
{"box": [295, 295, 405, 405]},
]
ap = compute_ap(predictions, ground_truths)
print(f"Average Precision: {ap:.3f}")
# First pred matches GT[0] (TP)
# Second pred overlaps GT[0] too but it's already matched (FP)
# Third pred matches GT[1] (TP)
# Fourth pred matches nothing (FP)
mAP is just the mean AP across all object classes. But which IoU threshold do you use?
mAP50 (PASCAL VOC metric): AP at IoU threshold 0.5. Relatively lenient -- a prediction that overlaps the ground truth by just half counts as correct. This is fine for "is the car roughly there" but doesn't reward precise localization.
mAP50-95 (COCO metric): the mean of AP values computed at IoU thresholds from 0.5 to 0.95, in steps of 0.05. Much stricter. A model that localizes objects precisely (IoU > 0.9) scores significantly higher than one that draws sloppy boxes (IoU around 0.5). This is the standard metric today, and it's what you'll see reported in every modern detection paper.
Choosing your detector
If you're starting a new detection project, here's the practical decision framework:
| Constraint | Recommendation |
|---|---|
| Real-time on edge (phone, Jetson) | YOLOv8n or YOLOv8s |
| Real-time on server GPU | YOLOv8m or YOLOv8l |
| Maximum accuracy, speed secondary | YOLOv8x or Faster R-CNN + ResNeXt |
| Small objects dominate | Faster R-CNN with FPN (two-stage still wins here) |
| Custom classes, few images | YOLOv8 + transfer learning from COCO |
| Quick prototype / demo | YOLOv8 pretrained on COCO (80 classes) |
For the vast majority of real-world projects, YOLOv8 with transfer learning is the right starting point. Fast to train, easy to deploy (ONNX, TensorRT, CoreML, TFLite -- all supported), well-documented, and competitive with anything more complex. Start there, measure your performance, and only switch to a heavier architecture if you have a specific reason.
Two-stage detectors like Faster R-CNN still have an edge on small object detection because the RPN can propose very small regions that single-shot detectors might miss at their coarsest feature level. But that gap is narrowing with every new YOLO release.
Samengevat
- YOLO reframed detection as single-pass regression: divide the image into a grid, predict boxes and classes directly, achieve real-time speeds (45+ FPS) -- a fundamentally different approach from the two-stage proposal-then-classify pipeline we covered in episode #78;
- the YOLO evolution (v1 through v8) systematically added anchor boxes, multi-scale prediction, better backbones, and eventually circled back to anchor-free design -- YOLOv8 is the current practical standrd for real-world detection;
- SSD introduced multi-scale prediction from different feature map levels, handling objects of varying sizes by using small feature maps for large objects and large feature maps for small ones;
- anchor-free detectors (FCOS, CenterNet) simplify architecture by eliminating anchor design entirely -- predict distances to box edges or detect center points directly, with fewer hyperparameters to tune;
- custom detection with transfer learning requires as few as 200-500 annotated images: structure your dataset in YOLO format, start from COCO-pretrained weights, and the backbone's existing feature knowledge does most of the heavy lifting;
- mAP (mean Average Precision) measures detection quality across the full precision-recall curve, with mAP50-95 as the strict modern standard that rewards precise localization at high IoU thresholds.
With the detection foundations and modern approaches both covered, we have the complete picture of how to find objects in images. But detection only draws rectangles around things. What if you need pixel-level precision -- knowing exactly which pixels belong to each object? That's a different problem with its own set of architectures, and it builds directly on the feature pyramid and multi-scale concepts we've been working with here.
Exercises
Exercise 1: Build a YOLO annotation format converter. Create a class AnnotationConverter that: (a) loads annotations from PASCAL VOC XML format (each XML file has <object> tags with <name>, <bndbox> containing <xmin>, <ymin>, <xmax>, <ymax>), (b) converts to YOLO format (class_id, center_x, center_y, width, height -- all normalized 0-1), (c) converts to COCO JSON format (image_id, category_id, bbox as [x, y, width, height] in pixels, area), (d) can convert in all three directions (VOC -> YOLO, YOLO -> COCO, COCO -> VOC). Simulate with 5 test images (hardcoded annotations) and verify round-trip conversion: VOC -> YOLO -> COCO -> VOC produces the same bounding boxes (within floating point tolerance).
Exercise 2: Implement a multi-scale detection simulator that demonstrates the SSD principle. Create a class MultiScaleDetector that: (a) generates a synthetic 300x300 image with 8 objects at three scales -- 2 large (100x100+), 3 medium (40x80), and 3 small (15x25), (b) creates three simulated feature maps: 38x38 (stride 8), 19x19 (stride 16), and 10x10 (stride 30), (c) at each feature map level, iterates over spatial positions and checks which ground truth objects have their center within that cell, (d) assigns each object to the feature level whose stride best matches the object size (large objects -> 10x10 map, small objects -> 38x38 map), (e) prints statistics showing how many objects each level detects and which objects would be missed if only a single scale were used. Demonstrate that no single scale catches all objects, but all three together provide complete coverage.
Exercise 3: Build a detection model benchmarking framework. Create a class DetectorBenchmark that: (a) generates a synthetic test set of 20 images with known ground truth (using the DetectionDataset class from Exercise 1 of episode #78, or a simplified version), (b) simulates three "detectors" with different accuracy profiles -- "perfect" (exact ground truth boxes with score 1.0), "noisy" (ground truth boxes with random IoU perturbation between 0.5-0.95 and random scores), and "poor" (50% of objects detected with high noise plus 30% false positives), (c) runs each detector through the full compute_ap evaluation pipeline, (d) prints a comparison table showing mAP50 and mAP75 for each detector, plus per-class AP. Verify that the "perfect" detector gets AP=1.0 and that the "poor" detector scores significantly lower at the stricter IoU=0.75 threshold than at IoU=0.5.