Learn AI Series (#77) - Image Processing Fundamentals

What will I learn

You will learn how digital images are represented as numerical arrays: pixels, channels, and data types;
color spaces beyond RGB: HSV for color detection, LAB for perceptual operations, grayscale for speed;
convolution as image filtering: blur, sharpen, and edge detection with hand-crafted kernels;
histogram equalization and adaptive contrast enhancement with CLAHE;
geometric transformations: resize, crop, rotate, flip, and perspective warping;
building a complete image preprocessing pipeline for deep learning models.

Requirements

A working modern computer running macOS, Windows or Ubuntu;
An installed Python 3(.11+) distribution;
The ambition to learn AI and machine learning.

Difficulty

Beginner

Curriculum (of the `Learn AI Series`):

Learn AI Series (#77) - Image Processing Fundamentals

Welcome to Arc 5: Computer Vision. We've spent the last twenty episodes deep in LLM territory -- language models, transformers, RAG, agents, evaluation, the whole works. And in episode #76 we capped that arc off by building a complete AI assistant that ties it all together. Now we're shifting focus. From text to pixels.

Here's the thing though -- we haven't been completely ignoring images. Back in episodes #45-47 we built CNNs and used them for classification, detection, and style transfer. In episode #54 we covered Vision Transformers. And in #75 we looked at multimodal models that bridge text and vision. But in all of those episodes we sort of... assumed the images were already preprocessed and ready to go. We loaded them, resized them, maybe normalized them, and moved on.

That's backwards. In actual computer vision work, image preprocessing is where quit some of the battle is won or lost. A model trained on well-preprocessed images consistently outperforms the same architecture trained on raw, noisy, inconsistently-sized data. So before we go deeper into detection, segmentation, and generative vision models in the coming episodes, we need to build a proper foundation. Starting from the literal pixels.

Digital images: just arrays of numbers

Way back in episode #3 we established that all data is numbers. Images are no exception. A grayscale image of size 480x640 is a 2D NumPy array with shape (480, 640). Each element is a pixel intensity value: 0 means black, 255 means white, everything in between is a shade of gray.

A color image adds a third dimension: channels. An RGB image of size 480x640 has shape (480, 640, 3) -- one channel each for red, green, and blue. Each pixel is a triplet like (142, 87, 213), meaning "this much red, this much green, this much blue, mix them together."

import numpy as np
import cv2

# Load an image from disk
img = cv2.imread("photo.jpg")
print(f"Shape: {img.shape}")       # (480, 640, 3) = H x W x C
print(f"Dtype: {img.dtype}")       # uint8 -- values 0 to 255
print(f"Size in bytes: {img.nbytes}")  # 480 * 640 * 3 = 921,600
print(f"Pixel at (100, 200): {img[100, 200]}")  # [B, G, R] array

# Create a grayscale gradient from scratch
gradient = np.zeros((256, 256), dtype=np.uint8)
for i in range(256):
    gradient[i, :] = i  # each row is a different brightness

# Create a color image from scratch
color_img = np.zeros((200, 300, 3), dtype=np.uint8)
color_img[:, :100] = [255, 0, 0]    # left third: blue (BGR!)
color_img[:, 100:200] = [0, 255, 0]  # middle third: green
color_img[:, 200:] = [0, 0, 255]    # right third: red

Now, here is an important OpenCV quirk that trips up literally everyone at some point: OpenCV loads images in BGR order, not RGB. This is a historical artifact from when BGR was common in camera hardware. It means that if you load an image with cv2.imread and display it with matplotlib (which expects RGB), the colors will be swapped -- reds become blues and vice versa. When passing images to PyTorch or any other framework that expects RGB, you need to convert:

# OpenCV BGR to standard RGB
rgb_img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

# Or skip the problem entirely: use PIL for loading
from PIL import Image
pil_img = Image.open("photo.jpg")   # loads as RGB directly
img_array = np.array(pil_img)       # (H, W, 3) in RGB order

I've wasted hours debugging color issues before realizing the BGR/RGB mismatch was the culprit. Consider this your warning ;-)

Color spaces: beyond RGB

RGB is intuitive (we can reason about "more red" or "less blue") but it's not always the best representation for computer vision tasks. Different color spaces separate different kinds of information, and choosing the right one for your task can simplify everything.

HSV -- Hue, Saturation, Value

HSV separates what color it is (hue) from how vivid (saturation) and how bright (value). The hue channel runs from 0 to 180 in OpenCV (not 360 -- another quirk), saturation and value from 0 to 255.

Why HSV matters: suppose you want to find all the red objects in a scene. In RGB, "red" means high R, low G, low B -- but a dark red, a bright red, and a slightly orange red have very different RGB values. In HSV, all reds cluster around the same hue value regardless of lighting. You just filter on hue:

hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)

# Red wraps around hue 0/180 in OpenCV, so we need two ranges
lower_red1 = np.array([0, 100, 80])
upper_red1 = np.array([10, 255, 255])
lower_red2 = np.array([170, 100, 80])
upper_red2 = np.array([180, 255, 255])

mask1 = cv2.inRange(hsv, lower_red1, upper_red1)
mask2 = cv2.inRange(hsv, lower_red2, upper_red2)
red_mask = mask1 | mask2
# red_mask: 255 where red, 0 elsewhere

# Count red pixels
red_pixels = np.count_nonzero(red_mask)
total_pixels = red_mask.shape[0] * red_mask.shape[1]
print(f"Red coverage: {red_pixels / total_pixels:.1%}")

LAB -- Perceptually uniform

LAB (also called CIELAB) splits an image into L (lightness), A (green-to-red axis), and B (blue-to-yellow axis). The key property is that it's perceptually uniform: a numerical distance of 10 in LAB space looks like the same amount of visual difference no matter where in the color space you are. That's NOT true for RGB (a change of 10 in the blue channel looks very different at low brightness vs high brightness).

LAB is useful for any operation where you need to reason about "how different do these colors look to a human" -- things like color-based clustering, image quality metrics, and contrast enhancement.

Grayscale -- When color doesn't matter

For many tasks -- edge detection, text recognition, feature matching -- color is irrelevant. Converting to grayscale reduces your data from 3 channels to 1, making everything 3x faster:

gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
lab = cv2.cvtColor(img, cv2.COLOR_BGR2LAB)

# Grayscale conversion uses a weighted average, not a simple mean:
# gray = 0.299*R + 0.587*G + 0.114*B
# These weights match human perception (we're most sensitive to green)

Choosing the right color space for the job is one of those things that separates a thoughtful pipeline from a naive one. For color-based detection: HSV. For perceptual operations (contrast, color distance): LAB. For speed and when color doesn't matter: grayscale. For everything else: RGB (or BGR if you're stuck in OpenCV land).

Convolution as image filtering

We covered convolution mathematically in episode #45 when we built CNNs. The same operation -- sliding a small matrix (the kernel) across the image and computing weighted sums -- has been used in image processing for decades before neural networks existed. CNNs just learn which kernels to use automatically. Today we're designing them by hand.

def convolve2d(image, kernel):
    """Manual 2D convolution to understand the mechanics."""
    h, w = image.shape[:2]
    kh, kw = kernel.shape
    pad_h, pad_w = kh // 2, kw // 2

    # Pad edges using reflection (mirror the boundary pixels)
    padded = np.pad(image, ((pad_h, pad_h), (pad_w, pad_w)),
                    mode='reflect')
    output = np.zeros_like(image, dtype=np.float64)

    for i in range(h):
        for j in range(w):
            region = padded[i:i + kh, j:j + kw]
            output[i, j] = np.sum(region * kernel)

    return output

This is deliberately slow (pure Python loops over every pixel) but it shows exactly what happens: at each position, extract a region the size of the kernel, multiply element-wise, sum. The kernel values determine the effect.

Different kernels produce fundamentally different results:

# Box blur: average neighboring pixels (smoothing)
blur_kernel = np.ones((5, 5)) / 25.0

# Sharpen: enhance edges and fine details
sharpen_kernel = np.array([
    [ 0, -1,  0],
    [-1,  5, -1],
    [ 0, -1,  0]
], dtype=np.float64)

# Sobel: detect horizontal edges
sobel_x = np.array([
    [-1, 0, 1],
    [-2, 0, 2],
    [-1, 0, 1]
], dtype=np.float64)

# Sobel: detect vertical edges
sobel_y = np.array([
    [-1, -2, -1],
    [ 0,  0,  0],
    [ 1,  2,  1]
], dtype=np.float64)

# Apply using OpenCV (orders of magnitude faster than our loop)
blurred = cv2.filter2D(gray, -1, blur_kernel)
sharpened = cv2.filter2D(gray, -1, sharpen_kernel)
edges_x = cv2.filter2D(gray, cv2.CV_64F, sobel_x)
edges_y = cv2.filter2D(gray, cv2.CV_64F, sobel_y)

# Combine edge directions for overall edge magnitude
edge_magnitude = np.sqrt(edges_x**2 + edges_y**2)
edge_magnitude = np.clip(edge_magnitude, 0, 255).astype(np.uint8)

Why does this matter for deep learning? Because the first layers of a trained CNN learn edge-detection kernels that look remarkably similar to Sobel filters. The network independently discovers through backpropagation what image processing engineers designed by hand in the 1960s. That's not a coincidence -- edges are the most informative low-level features in an image.

Gaussian blur

The box blur above applies equal weight to all neighbors. Gaussian blur weights center pixels more than edge pixels, following a bell curve distribution. This produces a more natural smoothing effect:

# Gaussian blur: kernel size must be odd, sigma controls spread
blurred = cv2.GaussianBlur(img, (5, 5), sigmaX=1.0)
very_blurred = cv2.GaussianBlur(img, (15, 15), sigmaX=5.0)

# Bilateral filter: blurs while preserving edges
# (slow but powerful for noise reduction)
denoised = cv2.bilateralFilter(img, d=9,
                                sigmaColor=75,
                                sigmaSpace=75)

The bilateral filter is worth highlighting. Regular Gaussian blur smooths everything -- edges included. The bilateral filter adds a second condition: only average pixels that are also similar in color. This preserves sharp edges while smoothing flat regions. It's computationally expensive but produces dramatically better results for noise reduction.

Canny edge detection

Canny is the go-to edge detection algorithm. It combines Gaussian blur (noise reduction), gradient computation (finding edges), non-maximum suppression (thinning edges to single-pixel width), and hysteresis thresholding (connecting strong edges to weak ones) into a single pipeline:

# Canny edge detection
edges = cv2.Canny(gray, threshold1=50, threshold2=150)
# Result: binary image, 255 at edges, 0 elsewhere

# The two thresholds control sensitivity:
# - Below threshold1: definitely NOT an edge
# - Above threshold2: definitely an edge
# - Between: only an edge if connected to a strong edge
# This hysteresis approach produces cleaner results than a single threshold

Having said that, tuning those two thresholds is something of an art. Too low and you get noise edges everywhere. Too high and you miss real structure. A common strategy is to compute the median pixel intensity and set thresholds relative to it:

median_val = np.median(gray)
lower = int(max(0, 0.7 * median_val))
upper = int(min(255, 1.3 * median_val))
edges = cv2.Canny(gray, lower, upper)

Histogram equalization and contrast

An image histogram shows the distribution of pixel intensities. A dark underexposed photo has most of its pixels clustered near 0. A washed-out overexposed photo has pixels bunched around 200-255. Histogram equalization remaps the distribution so it spans the full 0-255 range, improving contrast automatically.

# Basic histogram equalization on grayscale
equalized = cv2.equalizeHist(gray)

# Compute histograms to see the difference
hist_before = cv2.calcHist([gray], [0], None, [256], [0, 256])
hist_after = cv2.calcHist([equalized], [0], None, [256], [0, 256])

For color images, you can't just equalize each RGB channel independently -- that creates weird color shifts because the channels are correlated. Instead, convert to LAB and equalize only the L (lightness) channel:

lab = cv2.cvtColor(img, cv2.COLOR_BGR2LAB)
lab[:, :, 0] = cv2.equalizeHist(lab[:, :, 0])
equalized_color = cv2.cvtColor(lab, cv2.COLOR_LAB2BGR)

CLAHE: adaptive histogram equalization

Global equalization has a problem: it applies the same transformation everywhere. A photo with a bright window and a dark corner needs different adjustments in different regions. CLAHE (Contrast Limited Adaptive Histogram Equalization) solves this by dividing the image into tiles and equalizing each tile independently:

clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))

# Apply to grayscale
enhanced_gray = clahe.apply(gray)

# Apply to color via LAB
lab = cv2.cvtColor(img, cv2.COLOR_BGR2LAB)
lab[:, :, 0] = clahe.apply(lab[:, :, 0])
enhanced_color = cv2.cvtColor(lab, cv2.COLOR_LAB2BGR)

The clipLimit parameter controls how much contrast enhancement is allowed per tile. Higher values mean more contrast but also more noise amplification. The tileGridSize controls how many tiles the image is divided into -- (8, 8) means 64 tiles, each equalized independently (with interpolation at tile borders so you don't see seams).

CLAHE is used extensively in medical imaging where lighting conditions are variable and subtle details in dark regions matter. It's also standard in autonomous driving pipelines where the camera might see bright sky and dark shadows in the same frame.

Image transformations

Resizing, cropping, rotating, and warping are the workhorses of every image pipeline. Every model expects a fixed input size. Every dataset has images of different dimensions. Transformations bridge that gap.

# Resize to exact dimensions
resized = cv2.resize(img, (224, 224))

# Resize by scale factor
half = cv2.resize(img, None, fx=0.5, fy=0.5)
double = cv2.resize(img, None, fx=2.0, fy=2.0)

# Resize with high-quality interpolation
resized_hq = cv2.resize(img, (224, 224),
                          interpolation=cv2.INTER_LANCZOS4)

# Crop is just NumPy slicing
cropped = img[50:300, 100:400]  # rows 50-300, cols 100-400

# Rotate around center
h, w = img.shape[:2]
center = (w // 2, h // 2)
matrix = cv2.getRotationMatrix2D(center, angle=30, scale=1.0)
rotated = cv2.warpAffine(img, matrix, (w, h))

# Flip
flipped_h = cv2.flip(img, 1)   # horizontal mirror
flipped_v = cv2.flip(img, 0)   # vertical mirror
flipped_both = cv2.flip(img, -1)  # both axes

Interpolation matters more than most people realize when resizing. Here's a quick reference:

INTER_NEAREST: fastest, but pixelated. Only use for label masks (where you don't want to blend values)
INTER_LINEAR: good default for upscaling, fast
INTER_AREA: best for downscaling (preserves information by averaging)
INTER_LANCZOS4: highest quality for upscaling, slower

A common mistake is using INTER_LINEAR (the default) for downscaling. When shrinking an image significantly, INTER_AREA produces noticeably better results because it averages all the source pixels that map to each destination pixel rather than just sampling a few.

Perspective transforms

Sometimes you need to correct perspective distortion -- like straightening a photo of a document taken at an angle. This requires a 4-point perspective transform:

# Source points (corners of the tilted document in the photo)
src_pts = np.float32([
    [56, 65], [368, 52], [28, 387], [389, 390]
])
# Destination points (where we want those corners to end up)
dst_pts = np.float32([
    [0, 0], [300, 0], [0, 400], [300, 400]
])

M = cv2.getPerspectiveTransform(src_pts, dst_pts)
straightened = cv2.warpPerspective(img, M, (300, 400))

This is the same math behind every "scan document with phone camera" app. Detect the document corners, compute the perspective transform, warp to a flat rectangle. Simple in principle, and the OpenCV implementaton is very efficient.

Building a complete preprocessing pipeline

Real computer vision projects chain multiple operations into a consistent pipeline. Here's a class that takes raw images and produces model-ready tensors:

class ImagePreprocessor:
    """Complete pipeline: load, convert, resize, enhance,
    normalize, and reformat for PyTorch."""

    def __init__(self, target_size=(224, 224), normalize=True):
        self.target_size = target_size
        self.normalize = normalize
        # ImageNet statistics (used by virtually all
        # pretrained vision models)
        self.mean = np.array([0.485, 0.456, 0.406])
        self.std = np.array([0.229, 0.224, 0.225])
        self.clahe = cv2.createCLAHE(
            clipLimit=2.0, tileGridSize=(8, 8)
        )

    def __call__(self, image_path):
        """Process a single image file."""
        # Step 1: Load
        img = cv2.imread(str(image_path))
        if img is None:
            raise ValueError(
                f"Failed to load image: {image_path}"
            )

        # Step 2: BGR to RGB
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

        # Step 3: Resize preserving aspect ratio
        img = self._resize_with_padding(img)

        # Step 4: Contrast enhancement if needed
        if self._is_low_contrast(img):
            img = self._enhance_contrast(img)

        # Step 5: Float conversion [0, 1]
        img = img.astype(np.float32) / 255.0

        # Step 6: Normalize with ImageNet stats
        if self.normalize:
            img = (img - self.mean) / self.std

        # Step 7: HWC to CHW for PyTorch
        img = np.transpose(img, (2, 0, 1))

        return img

    def _resize_with_padding(self, img):
        """Resize to target size, maintaining aspect ratio,
        with zero-padding for the remainder."""
        h, w = img.shape[:2]
        th, tw = self.target_size
        scale = min(tw / w, th / h)
        new_w = int(w * scale)
        new_h = int(h * scale)

        resized = cv2.resize(
            img, (new_w, new_h),
            interpolation=cv2.INTER_AREA
            if scale < 1.0 else cv2.INTER_LINEAR
        )

        # Center the resized image on a black canvas
        pad_top = (th - new_h) // 2
        pad_left = (tw - new_w) // 2
        padded = np.zeros((th, tw, 3), dtype=np.uint8)
        padded[pad_top:pad_top + new_h,
               pad_left:pad_left + new_w] = resized

        return padded

    def _is_low_contrast(self, img):
        """Detect images that need contrast enhancement."""
        gray = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)
        return gray.std() < 40.0

    def _enhance_contrast(self, img):
        """Apply CLAHE on the L channel in LAB space."""
        lab = cv2.cvtColor(img, cv2.COLOR_RGB2LAB)
        lab[:, :, 0] = self.clahe.apply(lab[:, :, 0])
        return cv2.cvtColor(lab, cv2.COLOR_LAB2RGB)

    def process_batch(self, image_paths):
        """Process multiple images into a batch tensor."""
        batch = []
        for path in image_paths:
            try:
                processed = self(path)
                batch.append(processed)
            except ValueError as e:
                print(f"Skipping {path}: {e}")
        return np.stack(batch) if batch else np.array([])

Several design decisions here are worth noting:

Aspect ratio preservation with padding instead of stretching. Stretching distorts features -- a circle becomes an oval, a square becomes a rectangle. For models pretrained on ImageNet (where images are square), this distortion can hurt accuracy. Padding maintains the original proportions.

Automatic interpolation selection: INTER_AREA when downscaling, INTER_LINEAR when upscaling. This gives the best quality for each direction without the user having to think about it.

Conditional CLAHE: only applied to low-contrast images (standard deviation below 40). Applying histogram equalization to an image that already has good contrast wastes computation and can actually degrade quality by amplifying noise in already-visible regions.

ImageNet normalization values (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]). These numbers are worth memorizing if you work with vision models. Nearly every pretrained model (ResNet, VGG, EfficientNet, ViT -- all of them) was trained with these statistics. If you normalize differently, the first layer receives input it wasn't calibrated for and performance drops. It's a small thing that makes a big difference.

The HWC to CHW transpose at the end is because PyTorch expects channel-first format (C, H, W) while OpenCV and NumPy use channel-last (H, W, C). This is another constant source of shape-mismatch bugs. Get it right in the preprocessing pipeline and you never have to think about it downstream.

Putting it all together: a practical workflow

Let's see the pipeline in action with a batch of images:

from pathlib import Path

preprocessor = ImagePreprocessor(
    target_size=(224, 224), normalize=True
)

# Process a directory of images
image_dir = Path("./dataset/images")
image_paths = sorted(image_dir.glob("*.jpg"))

# Single image
single = preprocessor(image_paths[0])
print(f"Single image shape: {single.shape}")
# (3, 224, 224) -- CHW format, ready for PyTorch

# Batch
batch = preprocessor.process_batch(image_paths[:32])
print(f"Batch shape: {batch.shape}")
# (32, 3, 224, 224) -- (N, C, H, W)

# Quick stats to verify normalization
print(f"Channel means: {batch.mean(axis=(0, 2, 3))}")
print(f"Channel stds:  {batch.std(axis=(0, 2, 3))}")
# Should be close to 0 mean and 1 std if images are
# representative of ImageNet distribution

Samengevat

Digital images are NumPy arrays: grayscale is (H, W) with values 0-255, color is (H, W, 3) -- and OpenCV loads BGR, not RGB, so always convert;
color spaces (HSV, LAB, grayscale) separate different kinds of information: HSV for color detection, LAB for perceptual operations, grayscale when color is irrelevant;
convolution kernels perform image filtering: blur smooths noise, sharpen enhances detail, Sobel and Canny detect edges -- CNNs discover these same operations automatically through training;
histogram equalization improves contrast globally, while CLAHE adapts locally to handle uneven lighting -- critical for medical imaging and autonomous driving;
geometric transforms (resize, crop, rotate, perspective warp) with proper interpolation choices bridge the gap between raw images and fixed-size model input;
a complete preprocessing pipeline chains all these steps (load, color convert, resize with padding, contrast enhance, normalize, transpose) into a reusable class that produces model-ready tensors.

This is the groundwork for everything we'll build in the coming episodes. Once you can reliably load, transform, and normalize images, the next step is teaching models to find specific things within those images -- not just "this is a cat" but "the cat is here, at these coordinates, with this bounding box." That's a fundamentally different problem, and it needs its own set of architectures and techniques.

Exercises

Exercise 1: Build a color histogram analyzer. Write a class that loads an image, computes histograms for each channel (R, G, B) independently, and also computes a combined intensity histogram. Add a method that compares two images by computing the correlation between their histograms (use cv2.compareHist with cv2.HISTCMP_CORREL). Test it by comparing photos of similar scenes versus completely different scenes.

Exercise 2: Implement a multi-kernel edge detector. Write a function that applies four different edge-detection kernels to the same image: Sobel X, Sobel Y, Laplacian, and a custom diagonal kernel. Combine the results into a single edge map by taking the maximum response at each pixel. Compare your combined result against cv2.Canny -- which produces cleaner edges and why?

Exercise 3: Build an image augmentation pipeline. Create a class with methods for random horizontal flip (50% chance), random rotation (between -15 and +15 degrees), random brightness adjustment (multiply by a factor between 0.7 and 1.3), and random crop-and-resize (crop 80-100% of the area, then resize back to original dimensions). Apply all augmentations sequentially to produce varied training examples from a single source image. Generate 10 augmented versions and verify that pixel statistics (mean, std) remain reasonable after augmentation.

Bedankt en tot de volgende keer!

Hive account@scipio

Learn AI Series (#77) - Image Processing Fundamentals

Learn AI Series (#77) - Image Processing Fundamentals

What will I learn

Requirements

Difficulty

Curriculum (of the Learn AI Series):

Learn AI Series (#77) - Image Processing Fundamentals

Digital images: just arrays of numbers

Color spaces: beyond RGB

HSV -- Hue, Saturation, Value

LAB -- Perceptually uniform

Grayscale -- When color doesn't matter

Convolution as image filtering

Gaussian blur

Canny edge detection

Histogram equalization and contrast

CLAHE: adaptive histogram equalization

Image transformations

Perspective transforms

Building a complete preprocessing pipeline

Putting it all together: a practical workflow

Samengevat

Exercises

Bedankt en tot de volgende keer!

Curriculum (of the `Learn AI Series`):