Learn AI Series (#78) - Object Detection (Part 1) - Foundations

What will I learn

You will learn the leap from classification to localization: not just what, but where;
sliding window: the brute force approach and why it's too slow;
region proposals: selective search and its clever shortcuts;
IoU (Intersection over Union): the fundamental detection metric;
Non-Maximum Suppression: eliminating duplicate detections;
the R-CNN family: from R-CNN to Fast R-CNN to Faster R-CNN.

Requirements

A working modern computer running macOS, Windows or Ubuntu;
An installed Python 3(.11+) distribution;
The ambition to learn AI and machine learning.

Difficulty

Beginner

Curriculum (of the `Learn AI Series`):

Learn AI Series (#78) - Object Detection (Part 1) - Foundations

Solutions to Episode #77 Exercises

Exercise 1: Color histogram analyzer.

import numpy as np
import cv2


class HistogramAnalyzer:
    """Analyze and compare color histograms."""

    def __init__(self, bins=256):
        self.bins = bins

    def compute(self, image_path):
        img = cv2.imread(image_path)
        if img is None:
            raise ValueError(f"Cannot load: {image_path}")

        rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        histograms = {}

        # Per-channel histograms
        for i, name in enumerate(["red", "green", "blue"]):
            hist = cv2.calcHist(
                [rgb], [i], None,
                [self.bins], [0, 256]
            )
            histograms[name] = hist.flatten()

        # Combined intensity histogram (grayscale)
        gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
        hist = cv2.calcHist(
            [gray], [0], None,
            [self.bins], [0, 256]
        )
        histograms["intensity"] = hist.flatten()

        return histograms

    def compare(self, path_a, path_b):
        hist_a = self.compute(path_a)
        hist_b = self.compute(path_b)

        results = {}
        for channel in ["red", "green", "blue", "intensity"]:
            corr = cv2.compareHist(
                hist_a[channel].astype(np.float32),
                hist_b[channel].astype(np.float32),
                cv2.HISTCMP_CORREL
            )
            results[channel] = corr

        results["average"] = np.mean(
            [results[c] for c in
             ["red", "green", "blue", "intensity"]]
        )
        return results

    def report(self, path_a, path_b):
        results = self.compare(path_a, path_b)
        print(f"Histogram comparison:")
        print(f"  Red:       {results['red']:.4f}")
        print(f"  Green:     {results['green']:.4f}")
        print(f"  Blue:      {results['blue']:.4f}")
        print(f"  Intensity: {results['intensity']:.4f}")
        print(f"  Average:   {results['average']:.4f}")
        if results["average"] > 0.8:
            print("  Verdict: very similar")
        elif results["average"] > 0.5:
            print("  Verdict: somewhat similar")
        else:
            print("  Verdict: quite different")

The correlation metric ranges from -1 to 1, where 1 means the histograms are identical in shape. Similar scenes score high even if the exact pixels differ. This is the basis of content-based image retrieval -- searching databases by histogram similarity rather than pixel comparison.

Exercise 2: Multi-kernel edge detector.

import numpy as np
import cv2


def multi_kernel_edges(image_path):
    """Apply four edge kernels and combine results."""
    img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
    if img is None:
        raise ValueError(f"Cannot load: {image_path}")

    sobel_x = np.array([
        [-1, 0, 1], [-2, 0, 2], [-1, 0, 1]
    ], dtype=np.float64)

    sobel_y = np.array([
        [-1, -2, -1], [0, 0, 0], [1, 2, 1]
    ], dtype=np.float64)

    laplacian = np.array([
        [0, 1, 0], [1, -4, 1], [0, 1, 0]
    ], dtype=np.float64)

    diagonal = np.array([
        [-1, -1, 2], [-1, 2, -1], [2, -1, -1]
    ], dtype=np.float64)

    resp_sx = np.abs(cv2.filter2D(img, cv2.CV_64F, sobel_x))
    resp_sy = np.abs(cv2.filter2D(img, cv2.CV_64F, sobel_y))
    resp_lap = np.abs(cv2.filter2D(img, cv2.CV_64F, laplacian))
    resp_diag = np.abs(cv2.filter2D(img, cv2.CV_64F, diagonal))

    combined = np.maximum(resp_sx, resp_sy)
    combined = np.maximum(combined, resp_lap)
    combined = np.maximum(combined, resp_diag)
    combined = np.clip(combined, 0, 255).astype(np.uint8)

    canny = cv2.Canny(img, 50, 150)
    return combined, canny

Canny produces cleaner single-pixel-wide edges because it includes non-maximum suppression and hysteresis thresholding as built-in post-processing. The raw kernel combination gives thicker, noisier edges but catches diagonal features that Canny might miss.

Exercise 3: Image augmentation pipeline.

import numpy as np
import cv2
import random


class Augmentor:
    """Random augmentation pipeline for training data."""

    def __init__(self, seed=None):
        self.rng = random.Random(seed)

    def random_flip(self, img):
        if self.rng.random() < 0.5:
            return cv2.flip(img, 1)
        return img.copy()

    def random_rotate(self, img, max_angle=15):
        angle = self.rng.uniform(-max_angle, max_angle)
        h, w = img.shape[:2]
        center = (w // 2, h // 2)
        M = cv2.getRotationMatrix2D(center, angle, 1.0)
        return cv2.warpAffine(img, M, (w, h))

    def random_brightness(self, img, low=0.7, high=1.3):
        factor = self.rng.uniform(low, high)
        adjusted = img.astype(np.float32) * factor
        return np.clip(adjusted, 0, 255).astype(np.uint8)

    def random_crop_resize(self, img, min_area=0.8):
        h, w = img.shape[:2]
        area_frac = self.rng.uniform(min_area, 1.0)
        crop_h = int(h * np.sqrt(area_frac))
        crop_w = int(w * np.sqrt(area_frac))
        y = self.rng.randint(0, h - crop_h)
        x = self.rng.randint(0, w - crop_w)
        cropped = img[y:y + crop_h, x:x + crop_w]
        return cv2.resize(cropped, (w, h))

    def augment(self, img):
        img = self.random_flip(img)
        img = self.random_rotate(img)
        img = self.random_brightness(img)
        img = self.random_crop_resize(img)
        return img

    def generate_batch(self, img, count=10):
        results = []
        for _ in range(count):
            results.append(self.augment(img))
        means = [r.mean() for r in results]
        stds = [r.std() for r in results]
        print(f"Source  - mean: {img.mean():.1f}, "
              f"std: {img.std():.1f}")
        print(f"Augmented ({count} versions):")
        print(f"  Mean range: {min(means):.1f} - "
              f"{max(means):.1f}")
        print(f"  Std range:  {min(stds):.1f} - "
              f"{max(stds):.1f}")
        return results

If your augmented images have wildly different statistics from the source (mean drops to 10 or std explodes to 200), something in the pipeline is probably broken. Augmentations should change how images look without changing what they are.

On to today's episode

Here we go. Last episode we built up the fundamentals of image processing -- loading pixels, color spaces, convolution kernels, histogram equalization, geometric transforms, the whole preprocessing pipeline. All of that was about getting images ready for models. Now we're going to use those images for something genuinely more ambitious.

Image classification answers "what is this?" We covered that extensively in episodes #45-47 when we built CNNs. A classifier looks at an entire image and outputs a single label: cat, dog, car. But here's the thing -- real images don't contain one thing. A street scene has cars, pedestrians, traffic lights, buildings, bicycles, maybe a stray dog. A classifier just says "street scene" and calls it a day. That's not very useful for a self-driving car that needs to know exactly where each pedestrian is ;-)

Object detection answers both questions simultaneously: what is in the image AND where is it. For every object, the model outputs a class label and a bounding box -- four coordinates that draw a rectangle around it. One image might produce dozens of detections, each with its own label and location.

This is the capability behind autonomous driving, security camera systems, medical imaging (finding tumours on X-rays), warehouse robotics, augmented reality filters, and basically any application where knowing that an object exists isn't enough -- you also need to know where it is.

From classification to localization

Let's be precise about terminology, because these three terms get mixed up constantly:

Classification: input is an image, output is a single class label.
Localization: input is an image, output is a class label AND a bounding box for ONE object.
Detection: input is an image, output is MULTIPLE class labels AND bounding boxes.

A bounding box is defined by four numbers. Two common formats:

(x, y, width, height) -- top-left corner plus dimensions
(x1, y1, x2, y2) -- top-left corner and bottom-right corner

Both encode the same information. Different frameworks use different conventions (and mixing them up is a classic source of bugs -- ask me how I know).

For single-object localization, the solution is surprisingly elegant: take a CNN classifier and add a second output head. One head does classification (what is it?), the other does regression (where is it?):

import torch
import torch.nn as nn
from torchvision import models


class LocalizationNet(nn.Module):
    def __init__(self, num_classes):
        super().__init__()
        # Use pretrained ResNet as feature extractor
        backbone = models.resnet50(weights="IMAGENET1K_V1")
        self.features = nn.Sequential(
            *list(backbone.children())[:-1]
        )

        # Classification head: what is it?
        self.classifier = nn.Linear(2048, num_classes)

        # Regression head: where is it? (x, y, w, h)
        self.box_regressor = nn.Linear(2048, 4)

    def forward(self, x):
        features = self.features(x).flatten(1)
        class_logits = self.classifier(features)
        # Sigmoid keeps coordinates in [0, 1] range
        # (normalized relative to image dimensions)
        box_coords = torch.sigmoid(
            self.box_regressor(features)
        )
        return class_logits, box_coords

This is really just the multi-head pattern we've seen before (episode #44 covered nn.Module design patterns). The backbone extracts features. Two separate linear layers interpret those features differently -- one for classification, one for regression.

The loss function combines both objectives:

def localization_loss(class_pred, class_target,
                      box_pred, box_target,
                      lambda_loc=5.0):
    cls_loss = nn.CrossEntropyLoss()(
        class_pred, class_target
    )
    loc_loss = nn.SmoothL1Loss()(
        box_pred, box_target
    )
    # Weight localization higher -- getting the box right
    # matters more than classification confidence
    return cls_loss + lambda_loc * loc_loss

SmoothL1Loss (also called Huber loss) is preferred over plain MSE for bounding box regression because it's less sensitive to outliers. If one training example has an extremely wrong box prediction, L1 loss won't blow up the gradients the way L2 would. That lambda_loc=5.0 weight is a hyperparameter -- you need localization loss to dominate or the network will get lazy and just learn the class without caring about the box position.

But this only works for one object. What if there are three cars and two people? You can't predict a variable number of outputs from a fixed-size network. That's the fundamental detection challenge, and it took the field several generations of architectures to solve it cleanly.

Sliding window: the brute force approach

The most intuitive approach to finding multiple objects: slide a window across the image at every position and every scale, classify each window independently.

import cv2


def sliding_window_detect(image, classifier,
                           window_sizes, stride=32):
    """Brute-force detection via sliding windows."""
    detections = []
    h, w = image.shape[:2]

    for win_h, win_w in window_sizes:
        for y in range(0, h - win_h, stride):
            for x in range(0, w - win_w, stride):
                # Extract and resize window
                window = image[y:y + win_h, x:x + win_w]
                resized = cv2.resize(window, (224, 224))

                # Run classifier
                class_id, confidence = classifier(resized)

                if confidence > 0.5:
                    detections.append({
                        "box": [x, y, x + win_w,
                                y + win_h],
                        "class": class_id,
                        "score": confidence,
                    })

    return detections

Why is this terrible? Let's count. For a 640x480 image with 5 window sizes and stride 32, you're running the classifier roughly 5 * 15 * 20 = 1,500 times. If each CNN forward pass takes 10ms, that's 15 seconds per image. Real-time detection needs 30+ FPS. So you're off by about three orders of magnitude.

It gets worse. Adjacent windows overlap heavily, so the same object generates dozens of detections at slightly different positions and scales. You end up with a mess of overlapping boxes that somehow need to be merged back into single per-object detections.

The sliding window approach was dominant in the pre-deep-learning era (think HOG + SVM detectors like the Dalal-Triggs pedestrian detector from 2005). It actualy worked reasonably well for single-class detection with hand-crafted features -- but the moment you want to detect 80+ classes at interactive speeds, it falls apart completely.

Region proposals: being smart about where to look

Instead of examining every possible window position, what if we could quickly identify the regions most likely to contain objects? That's the insight behind region proposals.

Selective Search (Uijlings et al., 2013) generates roughly 2,000 region proposals per image using image segmentation heuristics. The algorithm starts with a fine-grained oversegmentation (many tiny regions), then iteratively merges similar regions based on four criteria: color similarity, texture similarity, size compatibility, and shape compatibility.

import cv2

# OpenCV includes selective search
ss = cv2.ximgproc.segmentation \
    .createSelectiveSearchSegmentation()
ss.setBaseImage(img)
ss.switchToSelectiveSearchFast()

# Generate region proposals
rects = ss.process()
print(f"Generated {len(rects)} proposals")
# Typically 1000-2000 proposals

# Each rect is (x, y, w, h)
for x, y, w, h in rects[:10]:
    cv2.rectangle(
        img, (x, y), (x + w, y + h),
        (0, 255, 0), 2
    )

Selective search reduces 100,000+ possible sliding window positions down to roughly 2,000 high-quality proposals. That's a 50x reduction. Still not real-time (selective search itself takes 1-2 seconds), but it made deep learning-based detection practical when R-CNN appeared in 2014.

The cleverness of selective search is that it uses cheap image-level cues (color gradients, texture patterns) to propose where objects might be, and then you only run the expensive CNN classifier on those proposals. Separation of concerns: fast but imprecise proposal generation, then slow but accurate classification.

IoU: measuring detection quality

So your detector predicts a bounding box. How do you know if it's "correct"? You can't demand pixel-perfect alignment -- even human annotators disagree on exactly where an object boundary is. The field standardized on Intersection over Union (IoU): the area of overlap between predicted and ground-truth boxes, divided by the area of their union.

def compute_iou(box1, box2):
    """IoU between two boxes in [x1, y1, x2, y2] format."""
    # Intersection coordinates
    x1 = max(box1[0], box2[0])
    y1 = max(box1[1], box2[1])
    x2 = min(box1[2], box2[2])
    y2 = min(box1[3], box2[3])

    # Intersection area (0 if boxes don't overlap)
    inter_w = max(0, x2 - x1)
    inter_h = max(0, y2 - y1)
    intersection = inter_w * inter_h

    # Individual areas
    area1 = ((box1[2] - box1[0])
             * (box1[3] - box1[1]))
    area2 = ((box2[2] - box2[0])
             * (box2[3] - box2[1]))

    # Union = total area minus overlap (counted twice)
    union = area1 + area2 - intersection

    if union == 0:
        return 0.0
    return intersection / union

IoU = 1.0 means perfect overlap. IoU = 0.0 means no overlap at all. The standard threshold for a "correct" detection is IoU >= 0.5 -- your predicted box overlaps with the ground truth by at least 50%. Some stricter benchmarks use 0.75 or even 0.9 for applications where precise localization matters (medical imaging, for instance).

Let's verify this works:

# Perfect overlap
print(compute_iou(
    [100, 100, 200, 200],
    [100, 100, 200, 200]
))  # 1.0

# Partial overlap
print(compute_iou(
    [100, 100, 200, 200],
    [150, 150, 250, 250]
))  # 0.142 -- boxes barely touch

# No overlap
print(compute_iou(
    [100, 100, 200, 200],
    [300, 300, 400, 400]
))  # 0.0

# Close but not perfect
print(compute_iou(
    [100, 100, 200, 200],
    [110, 110, 210, 210]
))  # 0.68 -- decent detection

IoU is the universal language of object detection evaluation. Every metric -- mAP (mean Average Precision), AP50, AP75 -- is built on top of IoU. Every paper reports results using IoU thresholds. Every benchmark (PASCAL VOC, COCO, Open Images) uses IoU to decide whether a prediction counts as correct.

Vectorized IoU for batches

In practice, you often need to compute IoU between many pairs of boxes simultaneously (during NMS, during training, during evaluation). Here's a vectorized version using NumPy:

import numpy as np


def batch_iou(boxes_a, boxes_b):
    """Compute IoU between every pair.

    boxes_a: (N, 4) array in [x1, y1, x2, y2]
    boxes_b: (M, 4) array in [x1, y1, x2, y2]
    Returns: (N, M) IoU matrix
    """
    boxes_a = np.array(boxes_a, dtype=np.float32)
    boxes_b = np.array(boxes_b, dtype=np.float32)

    # Intersection
    x1 = np.maximum(
        boxes_a[:, 0:1], boxes_b[:, 0:1].T
    )
    y1 = np.maximum(
        boxes_a[:, 1:2], boxes_b[:, 1:2].T
    )
    x2 = np.minimum(
        boxes_a[:, 2:3], boxes_b[:, 2:3].T
    )
    y2 = np.minimum(
        boxes_a[:, 3:4], boxes_b[:, 3:4].T
    )

    inter = (np.maximum(0, x2 - x1)
             * np.maximum(0, y2 - y1))

    # Areas
    area_a = ((boxes_a[:, 2] - boxes_a[:, 0])
              * (boxes_a[:, 3] - boxes_a[:, 1]))
    area_b = ((boxes_b[:, 2] - boxes_b[:, 0])
              * (boxes_b[:, 3] - boxes_b[:, 1]))

    union = area_a[:, None] + area_b[None, :] - inter

    return inter / np.maximum(union, 1e-6)


# Test: 3 predicted boxes vs 2 ground truth boxes
preds = [[100, 100, 200, 200],
         [150, 150, 250, 250],
         [300, 300, 400, 400]]
gt = [[105, 95, 205, 205],
      [290, 290, 410, 410]]

iou_matrix = batch_iou(preds, gt)
print("IoU matrix (preds x gt):")
print(np.round(iou_matrix, 3))

Broadcasting tricks like boxes_a[:, 0:1] (shape (N, 1)) versus boxes_b[:, 0:1].T (shape (1, M)) let NumPy compute all N*M pairs without Python loops. This matters when N and M are in the thousands.

Non-Maximum Suppression: eliminating duplicates

A detector typically produces many overlapping boxes for the same object. One obvious car might generate 15 bounding boxes at slighty different positions, all with high confidence scores. Non-Maximum Suppression (NMS) selects the best box and removes the rest.

The algorithm is dead simple:

Sort all detections by confidence score (highest first)
Take the top detection, add it to the "keep" list
Remove all remaining detections that overlap with it above the IoU threshold
Repeat from step 2 with the remaining detections

def non_max_suppression(boxes, scores,
                         iou_threshold=0.5):
    """Keep the best box, remove overlapping ones."""
    if len(boxes) == 0:
        return []

    # Sort by confidence descending
    order = sorted(
        range(len(scores)),
        key=lambda i: scores[i],
        reverse=True
    )
    keep = []

    while order:
        # Keep the highest-scoring box
        best = order[0]
        keep.append(best)
        order = order[1:]

        # Remove overlapping boxes
        remaining = []
        for idx in order:
            iou = compute_iou(boxes[best], boxes[idx])
            if iou < iou_threshold:
                remaining.append(idx)
        order = remaining

    return keep


# Example: three detections, two overlap
boxes = [
    [100, 100, 200, 200],  # score 0.9
    [105, 95, 210, 205],   # score 0.85 (overlaps box 0)
    [300, 300, 400, 400],  # score 0.7 (no overlap)
]
scores = [0.9, 0.85, 0.7]

kept = non_max_suppression(boxes, scores,
                            iou_threshold=0.5)
print(f"Kept indices: {kept}")
# [0, 2] -- keeps box 0 (highest score) and box 2
# (no overlap). Removes box 1 (overlaps box 0).

NMS has a known weakness: when two objects of the same class genuinely overlap (two people standing close together, two cars parked bumper to bumper), NMS might incorrectly suppress one of them. The algorithm can't distinguish "two overlapping detections of the same object" from "two correct detections of two different objects that happen to be nearby."

Soft-NMS (Bodla et al., 2017) addresses this by gradually reducing the confidence score of overlapping detections instead of hard-removing them:

def soft_nms(boxes, scores, sigma=0.5,
             score_threshold=0.01):
    """Soft-NMS: decay scores instead of removing."""
    boxes = [list(b) for b in boxes]
    scores = list(scores)
    order_kept = []

    while scores:
        max_idx = scores.index(max(scores))
        order_kept.append(max_idx)
        max_box = boxes[max_idx]

        boxes.pop(max_idx)
        scores.pop(max_idx)

        # Decay remaining scores based on overlap
        for i in range(len(scores)):
            iou = compute_iou(max_box, boxes[i])
            # Gaussian decay: high IoU = big decay
            scores[i] *= np.exp(-(iou ** 2) / sigma)

        # Remove below-threshold detections
        surviving = [
            (b, s) for b, s in zip(boxes, scores)
            if s >= score_threshold
        ]
        if surviving:
            boxes = [x[0] for x in surviving]
            scores = [x[1] for x in surviving]
        else:
            break

    return order_kept

The Gaussian decay means that a box with IoU 0.3 (moderate overlap) keeps most of its score, while a box with IoU 0.9 (near-duplicate) gets heavily penalized. This preserves detections of nearby-but-distinct objects while still suppressing redundant boxes. In the COCO benchmark, Soft-NMS consistently improves mAP by 1-2% over standard NMS -- small but meaningful.

The R-CNN family

R-CNN (Girshick et al., 2014) was the breakthrough that married region proposals with deep learning for detection:

Generate ~2,000 region proposals with selective search
Warp each proposal to 227x227 pixels
Run each warped region through a CNN (AlexNet) to extract features
Classify features with per-class SVMs
Refine bounding boxes with linear regression

Results were impressive (53.7% mAP on PASCAL VOC 2012), but the speed was brutal: running the CNN 2,000 times per image took about 47 seconds. Not exactly production-ready.

Fast R-CNN (Girshick, 2015) had the crucial insight: stop running the CNN 2,000 times. Run it ONCE on the entire image to get a shared feature map, then extract features for each proposal from that shared map. This is called RoI Pooling (Region of Interest Pooling):

class RoIPool(nn.Module):
    """Extract fixed-size features from
    variable-sized regions."""

    def __init__(self, output_size=7):
        super().__init__()
        self.pool = nn.AdaptiveMaxPool2d(output_size)

    def forward(self, feature_map, rois):
        """
        feature_map: (1, C, H, W) -- shared CNN output
        rois: list of [x1, y1, x2, y2] in feature
              map coordinates
        """
        pooled_features = []
        for roi in rois:
            x1, y1, x2, y2 = [int(c) for c in roi]
            # Crop the region from the feature map
            region = feature_map[:, :, y1:y2, x1:x2]
            # Pool to fixed size regardless of
            # region dimensions
            pooled = self.pool(region)
            pooled_features.append(pooled)
        return torch.cat(pooled_features, dim=0)

The key idea: the CNN backbone (VGG-16, in the original paper) processes the image exactly once. Then for each of the 2,000 proposals, we just do a cheap crop-and-pool on the already-computed feature map. Instead of 2,000 expensive CNN forward passes, we do ONE forward pass plus 2,000 cheap pooling operations. Speed improved from 47 seconds to 0.3 seconds per image -- a 150x speedup.

Fast R-CNN also replaced the separate SVMs with a single softmax classifier and integrated the bounding box regressor into the same network, training the whole thing end-to-end with a multi-task loss. Much cleaner architecture.

Faster R-CNN (Ren et al., 2015) eliminated the last bottleneck: selective search. Instead of an external proposal algorithm, Faster R-CNN introduced the Region Proposal Network (RPN) -- a small neural network that generates proposals directly from the feature map:

class RegionProposalNetwork(nn.Module):
    """Generate object proposals from feature map."""

    def __init__(self, in_channels=512, num_anchors=9):
        super().__init__()
        # Shared convolution
        self.conv = nn.Conv2d(
            in_channels, 512, 3, padding=1
        )
        # Two outputs per anchor:
        # objectness (object or background)
        self.cls_score = nn.Conv2d(
            512, num_anchors * 2, 1
        )
        # Four outputs per anchor:
        # bounding box adjustments (dx, dy, dw, dh)
        self.bbox_pred = nn.Conv2d(
            512, num_anchors * 4, 1
        )

    def forward(self, feature_map):
        x = torch.relu(self.conv(feature_map))
        # Is there an object at each position?
        objectness = self.cls_score(x)
        # How should each anchor box be adjusted?
        box_deltas = self.bbox_pred(x)
        return objectness, box_deltas

The RPN uses anchor boxes: predefined bounding boxes at each position in the feature map. At every spatial location, you place 9 anchors (3 scales x 3 aspect ratios -- small/medium/large crossed with tall/square/wide). For each anchor, the RPN predicts two things: (1) is there an object here? (binary classification) and (2) how should this anchor be adjusted to better fit the actual object? (four regression values: offsets for center position and scale adjustments for width/height).

Why 9 anchors? Because objects come in different shapes. A standing person is tall and narrow. A car is wide and short. A ball is roughly square. By placing anchors of varying aspect ratios at every position, you guarantee that at least one anchor has reasonable overlap with whatever object might be there. The RPN just needs to figure out which anchor to activate and how to nudge it into better alignment.

The complete Faster R-CNN pipeline:

Image -> CNN backbone -> feature map
                            |
                            +-> RPN -> ~300 proposals
                            |
                            +-> RoI Pooling (proposals
                                 on feature map)
                                    |
                                    +-> classifier
                                    |   + box regressor
                                    |
                                    +-> NMS -> final
                                         detections

All of it end-to-end trainable. About 5 FPS on a 2015-era GPU -- not quite real-time, but a massive improvement over the original R-CNN's 47-second inference time.

The evolution tells a clear story

The R-CNN family's progression is a case study in systematic bottleneck removal:

Model	Year	Innovation	Speed
R-CNN	2014	CNN features for detection	47 sec/img
Fast R-CNN	2015	Shared feature map, RoI Pool	0.3 sec/img
Faster R-CNN	2015	Learned proposals (RPN)	0.2 sec/img

Each generation identified the single biggest bottleneck and removed it. R-CNN's bottleneck was running the CNN per-region -- solved by sharing features. Fast R-CNN's bottleneck was selective search -- solved by learning proposals. This kind of iterative optimization is how most breakthroughs actually happen: not a single radical idea, but a series of targeted improvements.

Detection evaluation: mAP

We discussed IoU for individual detections, but how do you evaluate a detector's overall performance across an entire dataset with thousands of images and tens of thousands of objects? That's where mean Average Precision (mAP) comes in.

def compute_ap(precisions, recalls):
    """Compute Average Precision from
    precision-recall curve."""
    precisions = [0.0] + list(precisions) + [0.0]
    recalls = [0.0] + list(recalls) + [1.0]

    # Make precision monotonically decreasing
    for i in range(len(precisions) - 2, -1, -1):
        precisions[i] = max(
            precisions[i], precisions[i + 1]
        )

    # Integrate under the curve
    ap = 0.0
    for i in range(1, len(recalls)):
        if recalls[i] != recalls[i - 1]:
            ap += ((recalls[i] - recalls[i - 1])
                   * precisions[i])

    return ap


def evaluate_detections(predictions, ground_truths,
                         iou_threshold=0.5):
    """Compute AP for a single class."""
    preds = sorted(predictions,
                    key=lambda p: p["score"],
                    reverse=True)

    total_gt = len(ground_truths)
    if total_gt == 0:
        return 0.0

    matched_gt = set()
    tp_list = []
    fp_list = []

    for pred in preds:
        best_iou = 0.0
        best_gt_idx = -1

        for gt_idx, gt in enumerate(ground_truths):
            iou = compute_iou(pred["box"], gt["box"])
            if iou > best_iou:
                best_iou = iou
                best_gt_idx = gt_idx

        if (best_iou >= iou_threshold
                and best_gt_idx not in matched_gt):
            tp_list.append(1)
            fp_list.append(0)
            matched_gt.add(best_gt_idx)
        else:
            tp_list.append(0)
            fp_list.append(1)

    tp_cumsum = np.cumsum(tp_list)
    fp_cumsum = np.cumsum(fp_list)

    precisions = tp_cumsum / (tp_cumsum + fp_cumsum)
    recalls = tp_cumsum / total_gt

    return compute_ap(precisions, recalls)


# Test with sample data
predictions = [
    {"box": [100, 100, 200, 200], "score": 0.95},
    {"box": [102, 98, 205, 203], "score": 0.90},
    {"box": [300, 300, 400, 400], "score": 0.80},
    {"box": [50, 50, 100, 100], "score": 0.70},
]
ground_truths = [
    {"box": [105, 95, 205, 205]},
    {"box": [295, 295, 405, 405]},
]

ap = evaluate_detections(predictions, ground_truths)
print(f"Average Precision: {ap:.3f}")

The logic: sort all detections by confidence. Go through them one by one. If a detection matches an unmatched ground truth box (IoU >= threshold), it's a true positive. Otherwise it's a false positive. At each step, compute precision (TP / (TP + FP)) and recall (TP / total GT). The area under the resulting precision-recall curve is the Average Precision for that class.

mAP is just the mean AP across all object classes. PASCAL VOC uses AP@50 (IoU threshold of 0.5). COCO uses the average of AP at IoU thresholds from 0.5 to 0.95 in steps of 0.05, which is stricter and rewards precise localization.

Samengevat

Object detection predicts both class labels and bounding box coordinates for every object in an image -- going beyond classification's "what" to also answer "where";
sliding window is intuitive but impractical: checking every position at every scale means thousands of CNN forward passes per image, way too slow for real applications;
region proposals (selective search) reduce 100K+ possible windows to ~2,000 candidates using cheap image segmentation heuristics, making CNN-based detection feasable;
IoU (Intersection over Union) is the universal metric for measuring bounding box overlap -- IoU >= 0.5 is the standard threshold for a "correct" detection, and every detection benchmark builds on it;
Non-Maximum Suppression keeps the highest-confidence box and removes overlapping duplicates, with Soft-NMS as the gentler alternative for crowded scenes;
R-CNN -> Fast R-CNN -> Faster R-CNN progressively removed bottlenecks: shared feature computation (150x speedup), end-to-end training, and learned region proposals via the RPN;
mAP (mean Average Precision) evaluates detector performance across entire datasets by computing the area under per-class precision-recall curves.

The R-CNN family established the fundamental concepts that every modern detector builds on: anchor boxes, feature sharing, multi-task losses, and the proposal-then-classify pipeline. But they're still "two-stage" detectors -- first propose regions, then classify them. In the next episode we'll look at the architectures that skip the proposal stage entirely and predict everything in a single pass. That's where real-time detection finaly became practical.

Exercises

Exercise 1: Build a detection dataset simulator. Create a class DetectionDataset that: (a) generates synthetic images (300x300, black background) containing 1-5 randomly placed colored rectangles (red, green, blue -- each color is a class), with random sizes between 30x30 and 100x100, no overlapping objects, (b) for each image, records the ground truth annotations as a list of {"class": str, "box": [x1, y1, x2, y2]} dicts, (c) implements get_sample(index) returning (image, annotations), (d) implements visualize(index) that draws the ground truth boxes with class labels on the image and saves to a file, (e) implements statistics() that prints: total images, average objects per image, class distribution, average object area as percentage of image area. Generate 50 images and print the statistics.

Exercise 2: Build a complete NMS benchmarking suite. Create a class NMSBenchmark that: (a) generates test scenarios with controlled overlap patterns -- "easy" (5 boxes, no overlap), "moderate" (15 boxes, some clusters), "hard" (50 boxes, heavy overlap including genuine close objects), (b) implements both standard NMS and Soft-NMS, (c) for each scenario, runs both algorithms and records: how many boxes were kept, which ground-truth objects were correctly detected (IoU >= 0.5 with at least one kept box), and which were missed, (d) prints a comparison table showing the number of kept boxes, detection rate, and false positive rate for each algorithm and scenario. Demonstrate that Soft-NMS preserves more correct detections in the "hard" scenario with closely-placed objects.

Exercise 3: Implement a simplified Faster R-CNN forward pass simulator. Create a class SimplifiedFasterRCNN that: (a) generates a fake 7x7 feature map (simulating CNN backbone output), (b) places 9 anchor boxes (3 scales x 3 aspect ratios) at each of the 49 feature map positions (441 anchors total), (c) implements an RPN that scores each anchor with a random objectness score and generates random box offsets, (d) selects the top 300 proposals by objectness score, (e) applies NMS to reduce to ~50 proposals, (f) for each surviving proposal, assigns a random class label and confidence score (simulating the detection head), (g) applies final NMS per class. Print statistics at each stage: total anchors, proposals after top-K, proposals after NMS, final detections. The goal is to understand the data flow through the pipeline, not to train anything.

De groeten!

Hive account@scipio

Learn AI Series (#78) - Object Detection (Part 1) - Foundations

Learn AI Series (#78) - Object Detection (Part 1) - Foundations

What will I learn

Requirements

Difficulty

Curriculum (of the Learn AI Series):

Learn AI Series (#78) - Object Detection (Part 1) - Foundations

Solutions to Episode #77 Exercises

On to today's episode

From classification to localization

Sliding window: the brute force approach

Region proposals: being smart about where to look

IoU: measuring detection quality

Vectorized IoU for batches

Non-Maximum Suppression: eliminating duplicates

The R-CNN family

The evolution tells a clear story

Detection evaluation: mAP

Samengevat

Exercises

De groeten!

Curriculum (of the `Learn AI Series`):