Learn AI Series (#78) - Object Detection (Part 1) - Foundations
What will I learn
- You will learn the leap from classification to localization: not just what, but where;
- sliding window: the brute force approach and why it's too slow;
- region proposals: selective search and its clever shortcuts;
- IoU (Intersection over Union): the fundamental detection metric;
- Non-Maximum Suppression: eliminating duplicate detections;
- the R-CNN family: from R-CNN to Fast R-CNN to Faster R-CNN.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution;
- The ambition to learn AI and machine learning.
Difficulty
- Beginner
Curriculum (of the Learn AI Series):
- Learn AI Series (#1) - What Machine Learning Actually Is
- Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
- Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World
- Learn AI Series (#4) - Your First Prediction - No Math, Just Intuition
- Learn AI Series (#5) - Patterns in Data - What "Learning" Actually Looks Like
- Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
- Learn AI Series (#7) - The Training Loop - See It Work Step by Step
- Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra
- Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability
- Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch
- Learn AI Series (#11) - Making Linear Regression Real
- Learn AI Series (#12) - Classification - Logistic Regression From Scratch
- Learn AI Series (#13) - Evaluation - How to Know If Your Model Actually Works
- Learn AI Series (#14) - Data Preparation - The 80% Nobody Talks About
- Learn AI Series (#15) - Feature Engineering and Selection
- Learn AI Series (#16) - Scikit-Learn - The Standard Library of ML
- Learn AI Series (#17) - Decision Trees - How Machines Make Decisions
- Learn AI Series (#18) - Random Forests - Wisdom of Crowds
- Learn AI Series (#19) - Gradient Boosting - The Kaggle Champion
- Learn AI Series (#20) - Support Vector Machines - Drawing the Perfect Boundary
- Learn AI Series (#21) - Mini Project - Predicting Crypto Market Regimes
- Learn AI Series (#22) - K-Means Clustering - Finding Groups
- Learn AI Series (#23) - Advanced Clustering - Beyond K-Means
- Learn AI Series (#24) - Dimensionality Reduction - PCA
- Learn AI Series (#25) - Advanced Dimensionality Reduction - t-SNE and UMAP
- Learn AI Series (#26) - Anomaly Detection - Finding What Doesn't Belong
- Learn AI Series (#27) - Recommendation Systems - "Users Like You Also Liked..."
- Learn AI Series (#28) - Time Series Fundamentals - When Order Matters
- Learn AI Series (#29) - Time Series Forecasting - Predicting What Comes Next
- Learn AI Series (#30) - Natural Language Processing - Text as Data
- Learn AI Series (#31) - Word Embeddings - Meaning in Numbers
- Learn AI Series (#32) - Bayesian Methods - Thinking in Probabilities
- Learn AI Series (#33) - Ensemble Methods Deep Dive - Stacking and Blending
- Learn AI Series (#34) - ML Engineering - From Notebook to Production
- Learn AI Series (#35) - Data Ethics and Bias in ML
- Learn AI Series (#36) - Mini Project - Complete ML Pipeline
- Learn AI Series (#37) - The Perceptron - Where It All Started
- Learn AI Series (#38) - Neural Networks From Scratch - Forward Pass
- Learn AI Series (#39) - Neural Networks From Scratch - Backpropagation
- Learn AI Series (#40) - Training Neural Networks - Practical Challenges
- Learn AI Series (#41) - Optimization Algorithms - SGD, Momentum, Adam
- Learn AI Series (#42) - PyTorch Fundamentals - Tensors and Autograd
- Learn AI Series (#43) - PyTorch Data and Training
- Learn AI Series (#44) - PyTorch nn.Module - Building Real Networks
- Learn AI Series (#45) - Convolutional Neural Networks - Theory
- Learn AI Series (#46) - CNNs in Practice - Classic to Modern Architectures
- Learn AI Series (#47) - CNN Applications - Detection, Segmentation, Style Transfer
- Learn AI Series (#48) - Recurrent Neural Networks - Sequences
- Learn AI Series (#49) - LSTM and GRU - Solving the Memory Problem
- Learn AI Series (#50) - Sequence-to-Sequence Models
- Learn AI Series (#51) - Attention Mechanisms
- Learn AI Series (#52) - The Transformer Architecture (Part 1)
- Learn AI Series (#53) - The Transformer Architecture (Part 2)
- Learn AI Series (#54) - Vision Transformers
- Learn AI Series (#55) - Generative Adversarial Networks
- Learn AI Series (#56) - Mini Project - Building a Transformer From Scratch
- Learn AI Series (#57) - Language Modeling - Predicting the Next Word
- Learn AI Series (#58) - GPT Architecture - Decoder-Only Transformers
- Learn AI Series (#59) - BERT and Encoder Models
- Learn AI Series (#60) - Training Large Language Models
- Learn AI Series (#61) - Instruction Tuning and Alignment
- Learn AI Series (#62) - Prompt Engineering - Getting the Most from LLMs
- Learn AI Series (#63) - Embeddings and Vector Search
- Learn AI Series (#64) - Retrieval-Augmented Generation (RAG) - Basics
- Learn AI Series (#65) - RAG - Advanced Techniques
- Learn AI Series (#66) - Working with LLM APIs
- Learn AI Series (#67) - Building AI Agents (Part 1) - Foundations
- Learn AI Series (#68) - Building AI Agents (Part 2) - Advanced Patterns
- Learn AI Series (#69) - Fine-Tuning Language Models
- Learn AI Series (#70) - Running Local Models
- Learn AI Series (#71) - Text Generation Techniques
- Learn AI Series (#72) - Tokenization Deep Dive
- Learn AI Series (#73) - LLM Evaluation
- Learn AI Series (#74) - The Hugging Face Ecosystem
- Learn AI Series (#75) - Multimodal Models - Text Meets Vision
- Learn AI Series (#76) - Mini Project - Your Own AI Assistant
- Learn AI Series (#77) - Image Processing Fundamentals
- Learn AI Series (#78) - Object Detection (Part 1) - Foundations (this post)
Learn AI Series (#78) - Object Detection (Part 1) - Foundations
Solutions to Episode #77 Exercises
Exercise 1: Color histogram analyzer.
import numpy as np
import cv2
class HistogramAnalyzer:
"""Analyze and compare color histograms."""
def __init__(self, bins=256):
self.bins = bins
def compute(self, image_path):
img = cv2.imread(image_path)
if img is None:
raise ValueError(f"Cannot load: {image_path}")
rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
histograms = {}
# Per-channel histograms
for i, name in enumerate(["red", "green", "blue"]):
hist = cv2.calcHist(
[rgb], [i], None,
[self.bins], [0, 256]
)
histograms[name] = hist.flatten()
# Combined intensity histogram (grayscale)
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
hist = cv2.calcHist(
[gray], [0], None,
[self.bins], [0, 256]
)
histograms["intensity"] = hist.flatten()
return histograms
def compare(self, path_a, path_b):
hist_a = self.compute(path_a)
hist_b = self.compute(path_b)
results = {}
for channel in ["red", "green", "blue", "intensity"]:
corr = cv2.compareHist(
hist_a[channel].astype(np.float32),
hist_b[channel].astype(np.float32),
cv2.HISTCMP_CORREL
)
results[channel] = corr
results["average"] = np.mean(
[results[c] for c in
["red", "green", "blue", "intensity"]]
)
return results
def report(self, path_a, path_b):
results = self.compare(path_a, path_b)
print(f"Histogram comparison:")
print(f" Red: {results['red']:.4f}")
print(f" Green: {results['green']:.4f}")
print(f" Blue: {results['blue']:.4f}")
print(f" Intensity: {results['intensity']:.4f}")
print(f" Average: {results['average']:.4f}")
if results["average"] > 0.8:
print(" Verdict: very similar")
elif results["average"] > 0.5:
print(" Verdict: somewhat similar")
else:
print(" Verdict: quite different")
The correlation metric ranges from -1 to 1, where 1 means the histograms are identical in shape. Similar scenes score high even if the exact pixels differ. This is the basis of content-based image retrieval -- searching databases by histogram similarity rather than pixel comparison.
Exercise 2: Multi-kernel edge detector.
import numpy as np
import cv2
def multi_kernel_edges(image_path):
"""Apply four edge kernels and combine results."""
img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
if img is None:
raise ValueError(f"Cannot load: {image_path}")
sobel_x = np.array([
[-1, 0, 1], [-2, 0, 2], [-1, 0, 1]
], dtype=np.float64)
sobel_y = np.array([
[-1, -2, -1], [0, 0, 0], [1, 2, 1]
], dtype=np.float64)
laplacian = np.array([
[0, 1, 0], [1, -4, 1], [0, 1, 0]
], dtype=np.float64)
diagonal = np.array([
[-1, -1, 2], [-1, 2, -1], [2, -1, -1]
], dtype=np.float64)
resp_sx = np.abs(cv2.filter2D(img, cv2.CV_64F, sobel_x))
resp_sy = np.abs(cv2.filter2D(img, cv2.CV_64F, sobel_y))
resp_lap = np.abs(cv2.filter2D(img, cv2.CV_64F, laplacian))
resp_diag = np.abs(cv2.filter2D(img, cv2.CV_64F, diagonal))
combined = np.maximum(resp_sx, resp_sy)
combined = np.maximum(combined, resp_lap)
combined = np.maximum(combined, resp_diag)
combined = np.clip(combined, 0, 255).astype(np.uint8)
canny = cv2.Canny(img, 50, 150)
return combined, canny
Canny produces cleaner single-pixel-wide edges because it includes non-maximum suppression and hysteresis thresholding as built-in post-processing. The raw kernel combination gives thicker, noisier edges but catches diagonal features that Canny might miss.
Exercise 3: Image augmentation pipeline.
import numpy as np
import cv2
import random
class Augmentor:
"""Random augmentation pipeline for training data."""
def __init__(self, seed=None):
self.rng = random.Random(seed)
def random_flip(self, img):
if self.rng.random() < 0.5:
return cv2.flip(img, 1)
return img.copy()
def random_rotate(self, img, max_angle=15):
angle = self.rng.uniform(-max_angle, max_angle)
h, w = img.shape[:2]
center = (w // 2, h // 2)
M = cv2.getRotationMatrix2D(center, angle, 1.0)
return cv2.warpAffine(img, M, (w, h))
def random_brightness(self, img, low=0.7, high=1.3):
factor = self.rng.uniform(low, high)
adjusted = img.astype(np.float32) * factor
return np.clip(adjusted, 0, 255).astype(np.uint8)
def random_crop_resize(self, img, min_area=0.8):
h, w = img.shape[:2]
area_frac = self.rng.uniform(min_area, 1.0)
crop_h = int(h * np.sqrt(area_frac))
crop_w = int(w * np.sqrt(area_frac))
y = self.rng.randint(0, h - crop_h)
x = self.rng.randint(0, w - crop_w)
cropped = img[y:y + crop_h, x:x + crop_w]
return cv2.resize(cropped, (w, h))
def augment(self, img):
img = self.random_flip(img)
img = self.random_rotate(img)
img = self.random_brightness(img)
img = self.random_crop_resize(img)
return img
def generate_batch(self, img, count=10):
results = []
for _ in range(count):
results.append(self.augment(img))
means = [r.mean() for r in results]
stds = [r.std() for r in results]
print(f"Source - mean: {img.mean():.1f}, "
f"std: {img.std():.1f}")
print(f"Augmented ({count} versions):")
print(f" Mean range: {min(means):.1f} - "
f"{max(means):.1f}")
print(f" Std range: {min(stds):.1f} - "
f"{max(stds):.1f}")
return results
If your augmented images have wildly different statistics from the source (mean drops to 10 or std explodes to 200), something in the pipeline is probably broken. Augmentations should change how images look without changing what they are.
On to today's episode
Here we go. Last episode we built up the fundamentals of image processing -- loading pixels, color spaces, convolution kernels, histogram equalization, geometric transforms, the whole preprocessing pipeline. All of that was about getting images ready for models. Now we're going to use those images for something genuinely more ambitious.
Image classification answers "what is this?" We covered that extensively in episodes #45-47 when we built CNNs. A classifier looks at an entire image and outputs a single label: cat, dog, car. But here's the thing -- real images don't contain one thing. A street scene has cars, pedestrians, traffic lights, buildings, bicycles, maybe a stray dog. A classifier just says "street scene" and calls it a day. That's not very useful for a self-driving car that needs to know exactly where each pedestrian is ;-)
Object detection answers both questions simultaneously: what is in the image AND where is it. For every object, the model outputs a class label and a bounding box -- four coordinates that draw a rectangle around it. One image might produce dozens of detections, each with its own label and location.
This is the capability behind autonomous driving, security camera systems, medical imaging (finding tumours on X-rays), warehouse robotics, augmented reality filters, and basically any application where knowing that an object exists isn't enough -- you also need to know where it is.
From classification to localization
Let's be precise about terminology, because these three terms get mixed up constantly:
Classification: input is an image, output is a single class label.
Localization: input is an image, output is a class label AND a bounding box for ONE object.
Detection: input is an image, output is MULTIPLE class labels AND bounding boxes.
A bounding box is defined by four numbers. Two common formats:
(x, y, width, height)-- top-left corner plus dimensions(x1, y1, x2, y2)-- top-left corner and bottom-right corner
Both encode the same information. Different frameworks use different conventions (and mixing them up is a classic source of bugs -- ask me how I know).
For single-object localization, the solution is surprisingly elegant: take a CNN classifier and add a second output head. One head does classification (what is it?), the other does regression (where is it?):
import torch
import torch.nn as nn
from torchvision import models
class LocalizationNet(nn.Module):
def __init__(self, num_classes):
super().__init__()
# Use pretrained ResNet as feature extractor
backbone = models.resnet50(weights="IMAGENET1K_V1")
self.features = nn.Sequential(
*list(backbone.children())[:-1]
)
# Classification head: what is it?
self.classifier = nn.Linear(2048, num_classes)
# Regression head: where is it? (x, y, w, h)
self.box_regressor = nn.Linear(2048, 4)
def forward(self, x):
features = self.features(x).flatten(1)
class_logits = self.classifier(features)
# Sigmoid keeps coordinates in [0, 1] range
# (normalized relative to image dimensions)
box_coords = torch.sigmoid(
self.box_regressor(features)
)
return class_logits, box_coords
This is really just the multi-head pattern we've seen before (episode #44 covered nn.Module design patterns). The backbone extracts features. Two separate linear layers interpret those features differently -- one for classification, one for regression.
The loss function combines both objectives:
def localization_loss(class_pred, class_target,
box_pred, box_target,
lambda_loc=5.0):
cls_loss = nn.CrossEntropyLoss()(
class_pred, class_target
)
loc_loss = nn.SmoothL1Loss()(
box_pred, box_target
)
# Weight localization higher -- getting the box right
# matters more than classification confidence
return cls_loss + lambda_loc * loc_loss
SmoothL1Loss (also called Huber loss) is preferred over plain MSE for bounding box regression because it's less sensitive to outliers. If one training example has an extremely wrong box prediction, L1 loss won't blow up the gradients the way L2 would. That lambda_loc=5.0 weight is a hyperparameter -- you need localization loss to dominate or the network will get lazy and just learn the class without caring about the box position.
But this only works for one object. What if there are three cars and two people? You can't predict a variable number of outputs from a fixed-size network. That's the fundamental detection challenge, and it took the field several generations of architectures to solve it cleanly.
Sliding window: the brute force approach
The most intuitive approach to finding multiple objects: slide a window across the image at every position and every scale, classify each window independently.
import cv2
def sliding_window_detect(image, classifier,
window_sizes, stride=32):
"""Brute-force detection via sliding windows."""
detections = []
h, w = image.shape[:2]
for win_h, win_w in window_sizes:
for y in range(0, h - win_h, stride):
for x in range(0, w - win_w, stride):
# Extract and resize window
window = image[y:y + win_h, x:x + win_w]
resized = cv2.resize(window, (224, 224))
# Run classifier
class_id, confidence = classifier(resized)
if confidence > 0.5:
detections.append({
"box": [x, y, x + win_w,
y + win_h],
"class": class_id,
"score": confidence,
})
return detections
Why is this terrible? Let's count. For a 640x480 image with 5 window sizes and stride 32, you're running the classifier roughly 5 * 15 * 20 = 1,500 times. If each CNN forward pass takes 10ms, that's 15 seconds per image. Real-time detection needs 30+ FPS. So you're off by about three orders of magnitude.
It gets worse. Adjacent windows overlap heavily, so the same object generates dozens of detections at slightly different positions and scales. You end up with a mess of overlapping boxes that somehow need to be merged back into single per-object detections.
The sliding window approach was dominant in the pre-deep-learning era (think HOG + SVM detectors like the Dalal-Triggs pedestrian detector from 2005). It actualy worked reasonably well for single-class detection with hand-crafted features -- but the moment you want to detect 80+ classes at interactive speeds, it falls apart completely.
Region proposals: being smart about where to look
Instead of examining every possible window position, what if we could quickly identify the regions most likely to contain objects? That's the insight behind region proposals.
Selective Search (Uijlings et al., 2013) generates roughly 2,000 region proposals per image using image segmentation heuristics. The algorithm starts with a fine-grained oversegmentation (many tiny regions), then iteratively merges similar regions based on four criteria: color similarity, texture similarity, size compatibility, and shape compatibility.
import cv2
# OpenCV includes selective search
ss = cv2.ximgproc.segmentation \
.createSelectiveSearchSegmentation()
ss.setBaseImage(img)
ss.switchToSelectiveSearchFast()
# Generate region proposals
rects = ss.process()
print(f"Generated {len(rects)} proposals")
# Typically 1000-2000 proposals
# Each rect is (x, y, w, h)
for x, y, w, h in rects[:10]:
cv2.rectangle(
img, (x, y), (x + w, y + h),
(0, 255, 0), 2
)
Selective search reduces 100,000+ possible sliding window positions down to roughly 2,000 high-quality proposals. That's a 50x reduction. Still not real-time (selective search itself takes 1-2 seconds), but it made deep learning-based detection practical when R-CNN appeared in 2014.
The cleverness of selective search is that it uses cheap image-level cues (color gradients, texture patterns) to propose where objects might be, and then you only run the expensive CNN classifier on those proposals. Separation of concerns: fast but imprecise proposal generation, then slow but accurate classification.
IoU: measuring detection quality
So your detector predicts a bounding box. How do you know if it's "correct"? You can't demand pixel-perfect alignment -- even human annotators disagree on exactly where an object boundary is. The field standardized on Intersection over Union (IoU): the area of overlap between predicted and ground-truth boxes, divided by the area of their union.
def compute_iou(box1, box2):
"""IoU between two boxes in [x1, y1, x2, y2] format."""
# Intersection coordinates
x1 = max(box1[0], box2[0])
y1 = max(box1[1], box2[1])
x2 = min(box1[2], box2[2])
y2 = min(box1[3], box2[3])
# Intersection area (0 if boxes don't overlap)
inter_w = max(0, x2 - x1)
inter_h = max(0, y2 - y1)
intersection = inter_w * inter_h
# Individual areas
area1 = ((box1[2] - box1[0])
* (box1[3] - box1[1]))
area2 = ((box2[2] - box2[0])
* (box2[3] - box2[1]))
# Union = total area minus overlap (counted twice)
union = area1 + area2 - intersection
if union == 0:
return 0.0
return intersection / union
IoU = 1.0 means perfect overlap. IoU = 0.0 means no overlap at all. The standard threshold for a "correct" detection is IoU >= 0.5 -- your predicted box overlaps with the ground truth by at least 50%. Some stricter benchmarks use 0.75 or even 0.9 for applications where precise localization matters (medical imaging, for instance).
Let's verify this works:
# Perfect overlap
print(compute_iou(
[100, 100, 200, 200],
[100, 100, 200, 200]
)) # 1.0
# Partial overlap
print(compute_iou(
[100, 100, 200, 200],
[150, 150, 250, 250]
)) # 0.142 -- boxes barely touch
# No overlap
print(compute_iou(
[100, 100, 200, 200],
[300, 300, 400, 400]
)) # 0.0
# Close but not perfect
print(compute_iou(
[100, 100, 200, 200],
[110, 110, 210, 210]
)) # 0.68 -- decent detection
IoU is the universal language of object detection evaluation. Every metric -- mAP (mean Average Precision), AP50, AP75 -- is built on top of IoU. Every paper reports results using IoU thresholds. Every benchmark (PASCAL VOC, COCO, Open Images) uses IoU to decide whether a prediction counts as correct.
Vectorized IoU for batches
In practice, you often need to compute IoU between many pairs of boxes simultaneously (during NMS, during training, during evaluation). Here's a vectorized version using NumPy:
import numpy as np
def batch_iou(boxes_a, boxes_b):
"""Compute IoU between every pair.
boxes_a: (N, 4) array in [x1, y1, x2, y2]
boxes_b: (M, 4) array in [x1, y1, x2, y2]
Returns: (N, M) IoU matrix
"""
boxes_a = np.array(boxes_a, dtype=np.float32)
boxes_b = np.array(boxes_b, dtype=np.float32)
# Intersection
x1 = np.maximum(
boxes_a[:, 0:1], boxes_b[:, 0:1].T
)
y1 = np.maximum(
boxes_a[:, 1:2], boxes_b[:, 1:2].T
)
x2 = np.minimum(
boxes_a[:, 2:3], boxes_b[:, 2:3].T
)
y2 = np.minimum(
boxes_a[:, 3:4], boxes_b[:, 3:4].T
)
inter = (np.maximum(0, x2 - x1)
* np.maximum(0, y2 - y1))
# Areas
area_a = ((boxes_a[:, 2] - boxes_a[:, 0])
* (boxes_a[:, 3] - boxes_a[:, 1]))
area_b = ((boxes_b[:, 2] - boxes_b[:, 0])
* (boxes_b[:, 3] - boxes_b[:, 1]))
union = area_a[:, None] + area_b[None, :] - inter
return inter / np.maximum(union, 1e-6)
# Test: 3 predicted boxes vs 2 ground truth boxes
preds = [[100, 100, 200, 200],
[150, 150, 250, 250],
[300, 300, 400, 400]]
gt = [[105, 95, 205, 205],
[290, 290, 410, 410]]
iou_matrix = batch_iou(preds, gt)
print("IoU matrix (preds x gt):")
print(np.round(iou_matrix, 3))
Broadcasting tricks like boxes_a[:, 0:1] (shape (N, 1)) versus boxes_b[:, 0:1].T (shape (1, M)) let NumPy compute all N*M pairs without Python loops. This matters when N and M are in the thousands.
Non-Maximum Suppression: eliminating duplicates
A detector typically produces many overlapping boxes for the same object. One obvious car might generate 15 bounding boxes at slighty different positions, all with high confidence scores. Non-Maximum Suppression (NMS) selects the best box and removes the rest.
The algorithm is dead simple:
- Sort all detections by confidence score (highest first)
- Take the top detection, add it to the "keep" list
- Remove all remaining detections that overlap with it above the IoU threshold
- Repeat from step 2 with the remaining detections
def non_max_suppression(boxes, scores,
iou_threshold=0.5):
"""Keep the best box, remove overlapping ones."""
if len(boxes) == 0:
return []
# Sort by confidence descending
order = sorted(
range(len(scores)),
key=lambda i: scores[i],
reverse=True
)
keep = []
while order:
# Keep the highest-scoring box
best = order[0]
keep.append(best)
order = order[1:]
# Remove overlapping boxes
remaining = []
for idx in order:
iou = compute_iou(boxes[best], boxes[idx])
if iou < iou_threshold:
remaining.append(idx)
order = remaining
return keep
# Example: three detections, two overlap
boxes = [
[100, 100, 200, 200], # score 0.9
[105, 95, 210, 205], # score 0.85 (overlaps box 0)
[300, 300, 400, 400], # score 0.7 (no overlap)
]
scores = [0.9, 0.85, 0.7]
kept = non_max_suppression(boxes, scores,
iou_threshold=0.5)
print(f"Kept indices: {kept}")
# [0, 2] -- keeps box 0 (highest score) and box 2
# (no overlap). Removes box 1 (overlaps box 0).
NMS has a known weakness: when two objects of the same class genuinely overlap (two people standing close together, two cars parked bumper to bumper), NMS might incorrectly suppress one of them. The algorithm can't distinguish "two overlapping detections of the same object" from "two correct detections of two different objects that happen to be nearby."
Soft-NMS (Bodla et al., 2017) addresses this by gradually reducing the confidence score of overlapping detections instead of hard-removing them:
def soft_nms(boxes, scores, sigma=0.5,
score_threshold=0.01):
"""Soft-NMS: decay scores instead of removing."""
boxes = [list(b) for b in boxes]
scores = list(scores)
order_kept = []
while scores:
max_idx = scores.index(max(scores))
order_kept.append(max_idx)
max_box = boxes[max_idx]
boxes.pop(max_idx)
scores.pop(max_idx)
# Decay remaining scores based on overlap
for i in range(len(scores)):
iou = compute_iou(max_box, boxes[i])
# Gaussian decay: high IoU = big decay
scores[i] *= np.exp(-(iou ** 2) / sigma)
# Remove below-threshold detections
surviving = [
(b, s) for b, s in zip(boxes, scores)
if s >= score_threshold
]
if surviving:
boxes = [x[0] for x in surviving]
scores = [x[1] for x in surviving]
else:
break
return order_kept
The Gaussian decay means that a box with IoU 0.3 (moderate overlap) keeps most of its score, while a box with IoU 0.9 (near-duplicate) gets heavily penalized. This preserves detections of nearby-but-distinct objects while still suppressing redundant boxes. In the COCO benchmark, Soft-NMS consistently improves mAP by 1-2% over standard NMS -- small but meaningful.
The R-CNN family
R-CNN (Girshick et al., 2014) was the breakthrough that married region proposals with deep learning for detection:
- Generate ~2,000 region proposals with selective search
- Warp each proposal to 227x227 pixels
- Run each warped region through a CNN (AlexNet) to extract features
- Classify features with per-class SVMs
- Refine bounding boxes with linear regression
Results were impressive (53.7% mAP on PASCAL VOC 2012), but the speed was brutal: running the CNN 2,000 times per image took about 47 seconds. Not exactly production-ready.
Fast R-CNN (Girshick, 2015) had the crucial insight: stop running the CNN 2,000 times. Run it ONCE on the entire image to get a shared feature map, then extract features for each proposal from that shared map. This is called RoI Pooling (Region of Interest Pooling):
class RoIPool(nn.Module):
"""Extract fixed-size features from
variable-sized regions."""
def __init__(self, output_size=7):
super().__init__()
self.pool = nn.AdaptiveMaxPool2d(output_size)
def forward(self, feature_map, rois):
"""
feature_map: (1, C, H, W) -- shared CNN output
rois: list of [x1, y1, x2, y2] in feature
map coordinates
"""
pooled_features = []
for roi in rois:
x1, y1, x2, y2 = [int(c) for c in roi]
# Crop the region from the feature map
region = feature_map[:, :, y1:y2, x1:x2]
# Pool to fixed size regardless of
# region dimensions
pooled = self.pool(region)
pooled_features.append(pooled)
return torch.cat(pooled_features, dim=0)
The key idea: the CNN backbone (VGG-16, in the original paper) processes the image exactly once. Then for each of the 2,000 proposals, we just do a cheap crop-and-pool on the already-computed feature map. Instead of 2,000 expensive CNN forward passes, we do ONE forward pass plus 2,000 cheap pooling operations. Speed improved from 47 seconds to 0.3 seconds per image -- a 150x speedup.
Fast R-CNN also replaced the separate SVMs with a single softmax classifier and integrated the bounding box regressor into the same network, training the whole thing end-to-end with a multi-task loss. Much cleaner architecture.
Faster R-CNN (Ren et al., 2015) eliminated the last bottleneck: selective search. Instead of an external proposal algorithm, Faster R-CNN introduced the Region Proposal Network (RPN) -- a small neural network that generates proposals directly from the feature map:
class RegionProposalNetwork(nn.Module):
"""Generate object proposals from feature map."""
def __init__(self, in_channels=512, num_anchors=9):
super().__init__()
# Shared convolution
self.conv = nn.Conv2d(
in_channels, 512, 3, padding=1
)
# Two outputs per anchor:
# objectness (object or background)
self.cls_score = nn.Conv2d(
512, num_anchors * 2, 1
)
# Four outputs per anchor:
# bounding box adjustments (dx, dy, dw, dh)
self.bbox_pred = nn.Conv2d(
512, num_anchors * 4, 1
)
def forward(self, feature_map):
x = torch.relu(self.conv(feature_map))
# Is there an object at each position?
objectness = self.cls_score(x)
# How should each anchor box be adjusted?
box_deltas = self.bbox_pred(x)
return objectness, box_deltas
The RPN uses anchor boxes: predefined bounding boxes at each position in the feature map. At every spatial location, you place 9 anchors (3 scales x 3 aspect ratios -- small/medium/large crossed with tall/square/wide). For each anchor, the RPN predicts two things: (1) is there an object here? (binary classification) and (2) how should this anchor be adjusted to better fit the actual object? (four regression values: offsets for center position and scale adjustments for width/height).
Why 9 anchors? Because objects come in different shapes. A standing person is tall and narrow. A car is wide and short. A ball is roughly square. By placing anchors of varying aspect ratios at every position, you guarantee that at least one anchor has reasonable overlap with whatever object might be there. The RPN just needs to figure out which anchor to activate and how to nudge it into better alignment.
The complete Faster R-CNN pipeline:
Image -> CNN backbone -> feature map
|
+-> RPN -> ~300 proposals
|
+-> RoI Pooling (proposals
on feature map)
|
+-> classifier
| + box regressor
|
+-> NMS -> final
detections
All of it end-to-end trainable. About 5 FPS on a 2015-era GPU -- not quite real-time, but a massive improvement over the original R-CNN's 47-second inference time.
The evolution tells a clear story
The R-CNN family's progression is a case study in systematic bottleneck removal:
| Model | Year | Innovation | Speed |
|---|---|---|---|
| R-CNN | 2014 | CNN features for detection | 47 sec/img |
| Fast R-CNN | 2015 | Shared feature map, RoI Pool | 0.3 sec/img |
| Faster R-CNN | 2015 | Learned proposals (RPN) | 0.2 sec/img |
Each generation identified the single biggest bottleneck and removed it. R-CNN's bottleneck was running the CNN per-region -- solved by sharing features. Fast R-CNN's bottleneck was selective search -- solved by learning proposals. This kind of iterative optimization is how most breakthroughs actually happen: not a single radical idea, but a series of targeted improvements.
Detection evaluation: mAP
We discussed IoU for individual detections, but how do you evaluate a detector's overall performance across an entire dataset with thousands of images and tens of thousands of objects? That's where mean Average Precision (mAP) comes in.
def compute_ap(precisions, recalls):
"""Compute Average Precision from
precision-recall curve."""
precisions = [0.0] + list(precisions) + [0.0]
recalls = [0.0] + list(recalls) + [1.0]
# Make precision monotonically decreasing
for i in range(len(precisions) - 2, -1, -1):
precisions[i] = max(
precisions[i], precisions[i + 1]
)
# Integrate under the curve
ap = 0.0
for i in range(1, len(recalls)):
if recalls[i] != recalls[i - 1]:
ap += ((recalls[i] - recalls[i - 1])
* precisions[i])
return ap
def evaluate_detections(predictions, ground_truths,
iou_threshold=0.5):
"""Compute AP for a single class."""
preds = sorted(predictions,
key=lambda p: p["score"],
reverse=True)
total_gt = len(ground_truths)
if total_gt == 0:
return 0.0
matched_gt = set()
tp_list = []
fp_list = []
for pred in preds:
best_iou = 0.0
best_gt_idx = -1
for gt_idx, gt in enumerate(ground_truths):
iou = compute_iou(pred["box"], gt["box"])
if iou > best_iou:
best_iou = iou
best_gt_idx = gt_idx
if (best_iou >= iou_threshold
and best_gt_idx not in matched_gt):
tp_list.append(1)
fp_list.append(0)
matched_gt.add(best_gt_idx)
else:
tp_list.append(0)
fp_list.append(1)
tp_cumsum = np.cumsum(tp_list)
fp_cumsum = np.cumsum(fp_list)
precisions = tp_cumsum / (tp_cumsum + fp_cumsum)
recalls = tp_cumsum / total_gt
return compute_ap(precisions, recalls)
# Test with sample data
predictions = [
{"box": [100, 100, 200, 200], "score": 0.95},
{"box": [102, 98, 205, 203], "score": 0.90},
{"box": [300, 300, 400, 400], "score": 0.80},
{"box": [50, 50, 100, 100], "score": 0.70},
]
ground_truths = [
{"box": [105, 95, 205, 205]},
{"box": [295, 295, 405, 405]},
]
ap = evaluate_detections(predictions, ground_truths)
print(f"Average Precision: {ap:.3f}")
The logic: sort all detections by confidence. Go through them one by one. If a detection matches an unmatched ground truth box (IoU >= threshold), it's a true positive. Otherwise it's a false positive. At each step, compute precision (TP / (TP + FP)) and recall (TP / total GT). The area under the resulting precision-recall curve is the Average Precision for that class.
mAP is just the mean AP across all object classes. PASCAL VOC uses AP@50 (IoU threshold of 0.5). COCO uses the average of AP at IoU thresholds from 0.5 to 0.95 in steps of 0.05, which is stricter and rewards precise localization.
Samengevat
- Object detection predicts both class labels and bounding box coordinates for every object in an image -- going beyond classification's "what" to also answer "where";
- sliding window is intuitive but impractical: checking every position at every scale means thousands of CNN forward passes per image, way too slow for real applications;
- region proposals (selective search) reduce 100K+ possible windows to ~2,000 candidates using cheap image segmentation heuristics, making CNN-based detection feasable;
- IoU (Intersection over Union) is the universal metric for measuring bounding box overlap -- IoU >= 0.5 is the standard threshold for a "correct" detection, and every detection benchmark builds on it;
- Non-Maximum Suppression keeps the highest-confidence box and removes overlapping duplicates, with Soft-NMS as the gentler alternative for crowded scenes;
- R-CNN -> Fast R-CNN -> Faster R-CNN progressively removed bottlenecks: shared feature computation (150x speedup), end-to-end training, and learned region proposals via the RPN;
- mAP (mean Average Precision) evaluates detector performance across entire datasets by computing the area under per-class precision-recall curves.
The R-CNN family established the fundamental concepts that every modern detector builds on: anchor boxes, feature sharing, multi-task losses, and the proposal-then-classify pipeline. But they're still "two-stage" detectors -- first propose regions, then classify them. In the next episode we'll look at the architectures that skip the proposal stage entirely and predict everything in a single pass. That's where real-time detection finaly became practical.
Exercises
Exercise 1: Build a detection dataset simulator. Create a class DetectionDataset that: (a) generates synthetic images (300x300, black background) containing 1-5 randomly placed colored rectangles (red, green, blue -- each color is a class), with random sizes between 30x30 and 100x100, no overlapping objects, (b) for each image, records the ground truth annotations as a list of {"class": str, "box": [x1, y1, x2, y2]} dicts, (c) implements get_sample(index) returning (image, annotations), (d) implements visualize(index) that draws the ground truth boxes with class labels on the image and saves to a file, (e) implements statistics() that prints: total images, average objects per image, class distribution, average object area as percentage of image area. Generate 50 images and print the statistics.
Exercise 2: Build a complete NMS benchmarking suite. Create a class NMSBenchmark that: (a) generates test scenarios with controlled overlap patterns -- "easy" (5 boxes, no overlap), "moderate" (15 boxes, some clusters), "hard" (50 boxes, heavy overlap including genuine close objects), (b) implements both standard NMS and Soft-NMS, (c) for each scenario, runs both algorithms and records: how many boxes were kept, which ground-truth objects were correctly detected (IoU >= 0.5 with at least one kept box), and which were missed, (d) prints a comparison table showing the number of kept boxes, detection rate, and false positive rate for each algorithm and scenario. Demonstrate that Soft-NMS preserves more correct detections in the "hard" scenario with closely-placed objects.
Exercise 3: Implement a simplified Faster R-CNN forward pass simulator. Create a class SimplifiedFasterRCNN that: (a) generates a fake 7x7 feature map (simulating CNN backbone output), (b) places 9 anchor boxes (3 scales x 3 aspect ratios) at each of the 49 feature map positions (441 anchors total), (c) implements an RPN that scores each anchor with a random objectness score and generates random box offsets, (d) selects the top 300 proposals by objectness score, (e) applies NMS to reduce to ~50 proposals, (f) for each surviving proposal, assigns a random class label and confidence score (simulating the detection head), (g) applies final NMS per class. Print statistics at each stage: total anchors, proposals after top-K, proposals after NMS, final detections. The goal is to understand the data flow through the pipeline, not to train anything.