Learn AI Series (#77) - Image Processing Fundamentals
What will I learn
- You will learn how digital images are represented as numerical arrays: pixels, channels, and data types;
- color spaces beyond RGB: HSV for color detection, LAB for perceptual operations, grayscale for speed;
- convolution as image filtering: blur, sharpen, and edge detection with hand-crafted kernels;
- histogram equalization and adaptive contrast enhancement with CLAHE;
- geometric transformations: resize, crop, rotate, flip, and perspective warping;
- building a complete image preprocessing pipeline for deep learning models.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution;
- The ambition to learn AI and machine learning.
Difficulty
- Beginner
Curriculum (of the Learn AI Series):
- Learn AI Series (#1) - What Machine Learning Actually Is
- Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
- Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World
- Learn AI Series (#4) - Your First Prediction - No Math, Just Intuition
- Learn AI Series (#5) - Patterns in Data - What "Learning" Actually Looks Like
- Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
- Learn AI Series (#7) - The Training Loop - See It Work Step by Step
- Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra
- Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability
- Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch
- Learn AI Series (#11) - Making Linear Regression Real
- Learn AI Series (#12) - Classification - Logistic Regression From Scratch
- Learn AI Series (#13) - Evaluation - How to Know If Your Model Actually Works
- Learn AI Series (#14) - Data Preparation - The 80% Nobody Talks About
- Learn AI Series (#15) - Feature Engineering and Selection
- Learn AI Series (#16) - Scikit-Learn - The Standard Library of ML
- Learn AI Series (#17) - Decision Trees - How Machines Make Decisions
- Learn AI Series (#18) - Random Forests - Wisdom of Crowds
- Learn AI Series (#19) - Gradient Boosting - The Kaggle Champion
- Learn AI Series (#20) - Support Vector Machines - Drawing the Perfect Boundary
- Learn AI Series (#21) - Mini Project - Predicting Crypto Market Regimes
- Learn AI Series (#22) - K-Means Clustering - Finding Groups
- Learn AI Series (#23) - Advanced Clustering - Beyond K-Means
- Learn AI Series (#24) - Dimensionality Reduction - PCA
- Learn AI Series (#25) - Advanced Dimensionality Reduction - t-SNE and UMAP
- Learn AI Series (#26) - Anomaly Detection - Finding What Doesn't Belong
- Learn AI Series (#27) - Recommendation Systems - "Users Like You Also Liked..."
- Learn AI Series (#28) - Time Series Fundamentals - When Order Matters
- Learn AI Series (#29) - Time Series Forecasting - Predicting What Comes Next
- Learn AI Series (#30) - Natural Language Processing - Text as Data
- Learn AI Series (#31) - Word Embeddings - Meaning in Numbers
- Learn AI Series (#32) - Bayesian Methods - Thinking in Probabilities
- Learn AI Series (#33) - Ensemble Methods Deep Dive - Stacking and Blending
- Learn AI Series (#34) - ML Engineering - From Notebook to Production
- Learn AI Series (#35) - Data Ethics and Bias in ML
- Learn AI Series (#36) - Mini Project - Complete ML Pipeline
- Learn AI Series (#37) - The Perceptron - Where It All Started
- Learn AI Series (#38) - Neural Networks From Scratch - Forward Pass
- Learn AI Series (#39) - Neural Networks From Scratch - Backpropagation
- Learn AI Series (#40) - Training Neural Networks - Practical Challenges
- Learn AI Series (#41) - Optimization Algorithms - SGD, Momentum, Adam
- Learn AI Series (#42) - PyTorch Fundamentals - Tensors and Autograd
- Learn AI Series (#43) - PyTorch Data and Training
- Learn AI Series (#44) - PyTorch nn.Module - Building Real Networks
- Learn AI Series (#45) - Convolutional Neural Networks - Theory
- Learn AI Series (#46) - CNNs in Practice - Classic to Modern Architectures
- Learn AI Series (#47) - CNN Applications - Detection, Segmentation, Style Transfer
- Learn AI Series (#48) - Recurrent Neural Networks - Sequences
- Learn AI Series (#49) - LSTM and GRU - Solving the Memory Problem
- Learn AI Series (#50) - Sequence-to-Sequence Models
- Learn AI Series (#51) - Attention Mechanisms
- Learn AI Series (#52) - The Transformer Architecture (Part 1)
- Learn AI Series (#53) - The Transformer Architecture (Part 2)
- Learn AI Series (#54) - Vision Transformers
- Learn AI Series (#55) - Generative Adversarial Networks
- Learn AI Series (#56) - Mini Project - Building a Transformer From Scratch
- Learn AI Series (#57) - Language Modeling - Predicting the Next Word
- Learn AI Series (#58) - GPT Architecture - Decoder-Only Transformers
- Learn AI Series (#59) - BERT and Encoder Models
- Learn AI Series (#60) - Training Large Language Models
- Learn AI Series (#61) - Instruction Tuning and Alignment
- Learn AI Series (#62) - Prompt Engineering - Getting the Most from LLMs
- Learn AI Series (#63) - Embeddings and Vector Search
- Learn AI Series (#64) - Retrieval-Augmented Generation (RAG) - Basics
- Learn AI Series (#65) - RAG - Advanced Techniques
- Learn AI Series (#66) - Working with LLM APIs
- Learn AI Series (#67) - Building AI Agents (Part 1) - Foundations
- Learn AI Series (#68) - Building AI Agents (Part 2) - Advanced Patterns
- Learn AI Series (#69) - Fine-Tuning Language Models
- Learn AI Series (#70) - Running Local Models
- Learn AI Series (#71) - Text Generation Techniques
- Learn AI Series (#72) - Tokenization Deep Dive
- Learn AI Series (#73) - LLM Evaluation
- Learn AI Series (#74) - The Hugging Face Ecosystem
- Learn AI Series (#75) - Multimodal Models - Text Meets Vision
- Learn AI Series (#76) - Mini Project - Your Own AI Assistant
- Learn AI Series (#77) - Image Processing Fundamentals (this post)
Learn AI Series (#77) - Image Processing Fundamentals
Welcome to Arc 5: Computer Vision. We've spent the last twenty episodes deep in LLM territory -- language models, transformers, RAG, agents, evaluation, the whole works. And in episode #76 we capped that arc off by building a complete AI assistant that ties it all together. Now we're shifting focus. From text to pixels.
Here's the thing though -- we haven't been completely ignoring images. Back in episodes #45-47 we built CNNs and used them for classification, detection, and style transfer. In episode #54 we covered Vision Transformers. And in #75 we looked at multimodal models that bridge text and vision. But in all of those episodes we sort of... assumed the images were already preprocessed and ready to go. We loaded them, resized them, maybe normalized them, and moved on.
That's backwards. In actual computer vision work, image preprocessing is where quit some of the battle is won or lost. A model trained on well-preprocessed images consistently outperforms the same architecture trained on raw, noisy, inconsistently-sized data. So before we go deeper into detection, segmentation, and generative vision models in the coming episodes, we need to build a proper foundation. Starting from the literal pixels.
Digital images: just arrays of numbers
Way back in episode #3 we established that all data is numbers. Images are no exception. A grayscale image of size 480x640 is a 2D NumPy array with shape (480, 640). Each element is a pixel intensity value: 0 means black, 255 means white, everything in between is a shade of gray.
A color image adds a third dimension: channels. An RGB image of size 480x640 has shape (480, 640, 3) -- one channel each for red, green, and blue. Each pixel is a triplet like (142, 87, 213), meaning "this much red, this much green, this much blue, mix them together."
import numpy as np
import cv2
# Load an image from disk
img = cv2.imread("photo.jpg")
print(f"Shape: {img.shape}") # (480, 640, 3) = H x W x C
print(f"Dtype: {img.dtype}") # uint8 -- values 0 to 255
print(f"Size in bytes: {img.nbytes}") # 480 * 640 * 3 = 921,600
print(f"Pixel at (100, 200): {img[100, 200]}") # [B, G, R] array
# Create a grayscale gradient from scratch
gradient = np.zeros((256, 256), dtype=np.uint8)
for i in range(256):
gradient[i, :] = i # each row is a different brightness
# Create a color image from scratch
color_img = np.zeros((200, 300, 3), dtype=np.uint8)
color_img[:, :100] = [255, 0, 0] # left third: blue (BGR!)
color_img[:, 100:200] = [0, 255, 0] # middle third: green
color_img[:, 200:] = [0, 0, 255] # right third: red
Now, here is an important OpenCV quirk that trips up literally everyone at some point: OpenCV loads images in BGR order, not RGB. This is a historical artifact from when BGR was common in camera hardware. It means that if you load an image with cv2.imread and display it with matplotlib (which expects RGB), the colors will be swapped -- reds become blues and vice versa. When passing images to PyTorch or any other framework that expects RGB, you need to convert:
# OpenCV BGR to standard RGB
rgb_img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
# Or skip the problem entirely: use PIL for loading
from PIL import Image
pil_img = Image.open("photo.jpg") # loads as RGB directly
img_array = np.array(pil_img) # (H, W, 3) in RGB order
I've wasted hours debugging color issues before realizing the BGR/RGB mismatch was the culprit. Consider this your warning ;-)
Color spaces: beyond RGB
RGB is intuitive (we can reason about "more red" or "less blue") but it's not always the best representation for computer vision tasks. Different color spaces separate different kinds of information, and choosing the right one for your task can simplify everything.
HSV -- Hue, Saturation, Value
HSV separates what color it is (hue) from how vivid (saturation) and how bright (value). The hue channel runs from 0 to 180 in OpenCV (not 360 -- another quirk), saturation and value from 0 to 255.
Why HSV matters: suppose you want to find all the red objects in a scene. In RGB, "red" means high R, low G, low B -- but a dark red, a bright red, and a slightly orange red have very different RGB values. In HSV, all reds cluster around the same hue value regardless of lighting. You just filter on hue:
hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)
# Red wraps around hue 0/180 in OpenCV, so we need two ranges
lower_red1 = np.array([0, 100, 80])
upper_red1 = np.array([10, 255, 255])
lower_red2 = np.array([170, 100, 80])
upper_red2 = np.array([180, 255, 255])
mask1 = cv2.inRange(hsv, lower_red1, upper_red1)
mask2 = cv2.inRange(hsv, lower_red2, upper_red2)
red_mask = mask1 | mask2
# red_mask: 255 where red, 0 elsewhere
# Count red pixels
red_pixels = np.count_nonzero(red_mask)
total_pixels = red_mask.shape[0] * red_mask.shape[1]
print(f"Red coverage: {red_pixels / total_pixels:.1%}")
LAB -- Perceptually uniform
LAB (also called CIELAB) splits an image into L (lightness), A (green-to-red axis), and B (blue-to-yellow axis). The key property is that it's perceptually uniform: a numerical distance of 10 in LAB space looks like the same amount of visual difference no matter where in the color space you are. That's NOT true for RGB (a change of 10 in the blue channel looks very different at low brightness vs high brightness).
LAB is useful for any operation where you need to reason about "how different do these colors look to a human" -- things like color-based clustering, image quality metrics, and contrast enhancement.
Grayscale -- When color doesn't matter
For many tasks -- edge detection, text recognition, feature matching -- color is irrelevant. Converting to grayscale reduces your data from 3 channels to 1, making everything 3x faster:
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
lab = cv2.cvtColor(img, cv2.COLOR_BGR2LAB)
# Grayscale conversion uses a weighted average, not a simple mean:
# gray = 0.299*R + 0.587*G + 0.114*B
# These weights match human perception (we're most sensitive to green)
Choosing the right color space for the job is one of those things that separates a thoughtful pipeline from a naive one. For color-based detection: HSV. For perceptual operations (contrast, color distance): LAB. For speed and when color doesn't matter: grayscale. For everything else: RGB (or BGR if you're stuck in OpenCV land).
Convolution as image filtering
We covered convolution mathematically in episode #45 when we built CNNs. The same operation -- sliding a small matrix (the kernel) across the image and computing weighted sums -- has been used in image processing for decades before neural networks existed. CNNs just learn which kernels to use automatically. Today we're designing them by hand.
def convolve2d(image, kernel):
"""Manual 2D convolution to understand the mechanics."""
h, w = image.shape[:2]
kh, kw = kernel.shape
pad_h, pad_w = kh // 2, kw // 2
# Pad edges using reflection (mirror the boundary pixels)
padded = np.pad(image, ((pad_h, pad_h), (pad_w, pad_w)),
mode='reflect')
output = np.zeros_like(image, dtype=np.float64)
for i in range(h):
for j in range(w):
region = padded[i:i + kh, j:j + kw]
output[i, j] = np.sum(region * kernel)
return output
This is deliberately slow (pure Python loops over every pixel) but it shows exactly what happens: at each position, extract a region the size of the kernel, multiply element-wise, sum. The kernel values determine the effect.
Different kernels produce fundamentally different results:
# Box blur: average neighboring pixels (smoothing)
blur_kernel = np.ones((5, 5)) / 25.0
# Sharpen: enhance edges and fine details
sharpen_kernel = np.array([
[ 0, -1, 0],
[-1, 5, -1],
[ 0, -1, 0]
], dtype=np.float64)
# Sobel: detect horizontal edges
sobel_x = np.array([
[-1, 0, 1],
[-2, 0, 2],
[-1, 0, 1]
], dtype=np.float64)
# Sobel: detect vertical edges
sobel_y = np.array([
[-1, -2, -1],
[ 0, 0, 0],
[ 1, 2, 1]
], dtype=np.float64)
# Apply using OpenCV (orders of magnitude faster than our loop)
blurred = cv2.filter2D(gray, -1, blur_kernel)
sharpened = cv2.filter2D(gray, -1, sharpen_kernel)
edges_x = cv2.filter2D(gray, cv2.CV_64F, sobel_x)
edges_y = cv2.filter2D(gray, cv2.CV_64F, sobel_y)
# Combine edge directions for overall edge magnitude
edge_magnitude = np.sqrt(edges_x**2 + edges_y**2)
edge_magnitude = np.clip(edge_magnitude, 0, 255).astype(np.uint8)
Why does this matter for deep learning? Because the first layers of a trained CNN learn edge-detection kernels that look remarkably similar to Sobel filters. The network independently discovers through backpropagation what image processing engineers designed by hand in the 1960s. That's not a coincidence -- edges are the most informative low-level features in an image.
Gaussian blur
The box blur above applies equal weight to all neighbors. Gaussian blur weights center pixels more than edge pixels, following a bell curve distribution. This produces a more natural smoothing effect:
# Gaussian blur: kernel size must be odd, sigma controls spread
blurred = cv2.GaussianBlur(img, (5, 5), sigmaX=1.0)
very_blurred = cv2.GaussianBlur(img, (15, 15), sigmaX=5.0)
# Bilateral filter: blurs while preserving edges
# (slow but powerful for noise reduction)
denoised = cv2.bilateralFilter(img, d=9,
sigmaColor=75,
sigmaSpace=75)
The bilateral filter is worth highlighting. Regular Gaussian blur smooths everything -- edges included. The bilateral filter adds a second condition: only average pixels that are also similar in color. This preserves sharp edges while smoothing flat regions. It's computationally expensive but produces dramatically better results for noise reduction.
Canny edge detection
Canny is the go-to edge detection algorithm. It combines Gaussian blur (noise reduction), gradient computation (finding edges), non-maximum suppression (thinning edges to single-pixel width), and hysteresis thresholding (connecting strong edges to weak ones) into a single pipeline:
# Canny edge detection
edges = cv2.Canny(gray, threshold1=50, threshold2=150)
# Result: binary image, 255 at edges, 0 elsewhere
# The two thresholds control sensitivity:
# - Below threshold1: definitely NOT an edge
# - Above threshold2: definitely an edge
# - Between: only an edge if connected to a strong edge
# This hysteresis approach produces cleaner results than a single threshold
Having said that, tuning those two thresholds is something of an art. Too low and you get noise edges everywhere. Too high and you miss real structure. A common strategy is to compute the median pixel intensity and set thresholds relative to it:
median_val = np.median(gray)
lower = int(max(0, 0.7 * median_val))
upper = int(min(255, 1.3 * median_val))
edges = cv2.Canny(gray, lower, upper)
Histogram equalization and contrast
An image histogram shows the distribution of pixel intensities. A dark underexposed photo has most of its pixels clustered near 0. A washed-out overexposed photo has pixels bunched around 200-255. Histogram equalization remaps the distribution so it spans the full 0-255 range, improving contrast automatically.
# Basic histogram equalization on grayscale
equalized = cv2.equalizeHist(gray)
# Compute histograms to see the difference
hist_before = cv2.calcHist([gray], [0], None, [256], [0, 256])
hist_after = cv2.calcHist([equalized], [0], None, [256], [0, 256])
For color images, you can't just equalize each RGB channel independently -- that creates weird color shifts because the channels are correlated. Instead, convert to LAB and equalize only the L (lightness) channel:
lab = cv2.cvtColor(img, cv2.COLOR_BGR2LAB)
lab[:, :, 0] = cv2.equalizeHist(lab[:, :, 0])
equalized_color = cv2.cvtColor(lab, cv2.COLOR_LAB2BGR)
CLAHE: adaptive histogram equalization
Global equalization has a problem: it applies the same transformation everywhere. A photo with a bright window and a dark corner needs different adjustments in different regions. CLAHE (Contrast Limited Adaptive Histogram Equalization) solves this by dividing the image into tiles and equalizing each tile independently:
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
# Apply to grayscale
enhanced_gray = clahe.apply(gray)
# Apply to color via LAB
lab = cv2.cvtColor(img, cv2.COLOR_BGR2LAB)
lab[:, :, 0] = clahe.apply(lab[:, :, 0])
enhanced_color = cv2.cvtColor(lab, cv2.COLOR_LAB2BGR)
The clipLimit parameter controls how much contrast enhancement is allowed per tile. Higher values mean more contrast but also more noise amplification. The tileGridSize controls how many tiles the image is divided into -- (8, 8) means 64 tiles, each equalized independently (with interpolation at tile borders so you don't see seams).
CLAHE is used extensively in medical imaging where lighting conditions are variable and subtle details in dark regions matter. It's also standard in autonomous driving pipelines where the camera might see bright sky and dark shadows in the same frame.
Image transformations
Resizing, cropping, rotating, and warping are the workhorses of every image pipeline. Every model expects a fixed input size. Every dataset has images of different dimensions. Transformations bridge that gap.
# Resize to exact dimensions
resized = cv2.resize(img, (224, 224))
# Resize by scale factor
half = cv2.resize(img, None, fx=0.5, fy=0.5)
double = cv2.resize(img, None, fx=2.0, fy=2.0)
# Resize with high-quality interpolation
resized_hq = cv2.resize(img, (224, 224),
interpolation=cv2.INTER_LANCZOS4)
# Crop is just NumPy slicing
cropped = img[50:300, 100:400] # rows 50-300, cols 100-400
# Rotate around center
h, w = img.shape[:2]
center = (w // 2, h // 2)
matrix = cv2.getRotationMatrix2D(center, angle=30, scale=1.0)
rotated = cv2.warpAffine(img, matrix, (w, h))
# Flip
flipped_h = cv2.flip(img, 1) # horizontal mirror
flipped_v = cv2.flip(img, 0) # vertical mirror
flipped_both = cv2.flip(img, -1) # both axes
Interpolation matters more than most people realize when resizing. Here's a quick reference:
INTER_NEAREST: fastest, but pixelated. Only use for label masks (where you don't want to blend values)INTER_LINEAR: good default for upscaling, fastINTER_AREA: best for downscaling (preserves information by averaging)INTER_LANCZOS4: highest quality for upscaling, slower
A common mistake is using INTER_LINEAR (the default) for downscaling. When shrinking an image significantly, INTER_AREA produces noticeably better results because it averages all the source pixels that map to each destination pixel rather than just sampling a few.
Perspective transforms
Sometimes you need to correct perspective distortion -- like straightening a photo of a document taken at an angle. This requires a 4-point perspective transform:
# Source points (corners of the tilted document in the photo)
src_pts = np.float32([
[56, 65], [368, 52], [28, 387], [389, 390]
])
# Destination points (where we want those corners to end up)
dst_pts = np.float32([
[0, 0], [300, 0], [0, 400], [300, 400]
])
M = cv2.getPerspectiveTransform(src_pts, dst_pts)
straightened = cv2.warpPerspective(img, M, (300, 400))
This is the same math behind every "scan document with phone camera" app. Detect the document corners, compute the perspective transform, warp to a flat rectangle. Simple in principle, and the OpenCV implementaton is very efficient.
Building a complete preprocessing pipeline
Real computer vision projects chain multiple operations into a consistent pipeline. Here's a class that takes raw images and produces model-ready tensors:
class ImagePreprocessor:
"""Complete pipeline: load, convert, resize, enhance,
normalize, and reformat for PyTorch."""
def __init__(self, target_size=(224, 224), normalize=True):
self.target_size = target_size
self.normalize = normalize
# ImageNet statistics (used by virtually all
# pretrained vision models)
self.mean = np.array([0.485, 0.456, 0.406])
self.std = np.array([0.229, 0.224, 0.225])
self.clahe = cv2.createCLAHE(
clipLimit=2.0, tileGridSize=(8, 8)
)
def __call__(self, image_path):
"""Process a single image file."""
# Step 1: Load
img = cv2.imread(str(image_path))
if img is None:
raise ValueError(
f"Failed to load image: {image_path}"
)
# Step 2: BGR to RGB
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
# Step 3: Resize preserving aspect ratio
img = self._resize_with_padding(img)
# Step 4: Contrast enhancement if needed
if self._is_low_contrast(img):
img = self._enhance_contrast(img)
# Step 5: Float conversion [0, 1]
img = img.astype(np.float32) / 255.0
# Step 6: Normalize with ImageNet stats
if self.normalize:
img = (img - self.mean) / self.std
# Step 7: HWC to CHW for PyTorch
img = np.transpose(img, (2, 0, 1))
return img
def _resize_with_padding(self, img):
"""Resize to target size, maintaining aspect ratio,
with zero-padding for the remainder."""
h, w = img.shape[:2]
th, tw = self.target_size
scale = min(tw / w, th / h)
new_w = int(w * scale)
new_h = int(h * scale)
resized = cv2.resize(
img, (new_w, new_h),
interpolation=cv2.INTER_AREA
if scale < 1.0 else cv2.INTER_LINEAR
)
# Center the resized image on a black canvas
pad_top = (th - new_h) // 2
pad_left = (tw - new_w) // 2
padded = np.zeros((th, tw, 3), dtype=np.uint8)
padded[pad_top:pad_top + new_h,
pad_left:pad_left + new_w] = resized
return padded
def _is_low_contrast(self, img):
"""Detect images that need contrast enhancement."""
gray = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)
return gray.std() < 40.0
def _enhance_contrast(self, img):
"""Apply CLAHE on the L channel in LAB space."""
lab = cv2.cvtColor(img, cv2.COLOR_RGB2LAB)
lab[:, :, 0] = self.clahe.apply(lab[:, :, 0])
return cv2.cvtColor(lab, cv2.COLOR_LAB2RGB)
def process_batch(self, image_paths):
"""Process multiple images into a batch tensor."""
batch = []
for path in image_paths:
try:
processed = self(path)
batch.append(processed)
except ValueError as e:
print(f"Skipping {path}: {e}")
return np.stack(batch) if batch else np.array([])
Several design decisions here are worth noting:
Aspect ratio preservation with padding instead of stretching. Stretching distorts features -- a circle becomes an oval, a square becomes a rectangle. For models pretrained on ImageNet (where images are square), this distortion can hurt accuracy. Padding maintains the original proportions.
Automatic interpolation selection: INTER_AREA when downscaling, INTER_LINEAR when upscaling. This gives the best quality for each direction without the user having to think about it.
Conditional CLAHE: only applied to low-contrast images (standard deviation below 40). Applying histogram equalization to an image that already has good contrast wastes computation and can actually degrade quality by amplifying noise in already-visible regions.
ImageNet normalization values (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]). These numbers are worth memorizing if you work with vision models. Nearly every pretrained model (ResNet, VGG, EfficientNet, ViT -- all of them) was trained with these statistics. If you normalize differently, the first layer receives input it wasn't calibrated for and performance drops. It's a small thing that makes a big difference.
The HWC to CHW transpose at the end is because PyTorch expects channel-first format (C, H, W) while OpenCV and NumPy use channel-last (H, W, C). This is another constant source of shape-mismatch bugs. Get it right in the preprocessing pipeline and you never have to think about it downstream.
Putting it all together: a practical workflow
Let's see the pipeline in action with a batch of images:
from pathlib import Path
preprocessor = ImagePreprocessor(
target_size=(224, 224), normalize=True
)
# Process a directory of images
image_dir = Path("./dataset/images")
image_paths = sorted(image_dir.glob("*.jpg"))
# Single image
single = preprocessor(image_paths[0])
print(f"Single image shape: {single.shape}")
# (3, 224, 224) -- CHW format, ready for PyTorch
# Batch
batch = preprocessor.process_batch(image_paths[:32])
print(f"Batch shape: {batch.shape}")
# (32, 3, 224, 224) -- (N, C, H, W)
# Quick stats to verify normalization
print(f"Channel means: {batch.mean(axis=(0, 2, 3))}")
print(f"Channel stds: {batch.std(axis=(0, 2, 3))}")
# Should be close to 0 mean and 1 std if images are
# representative of ImageNet distribution
Samengevat
- Digital images are NumPy arrays: grayscale is
(H, W)with values 0-255, color is(H, W, 3)-- and OpenCV loads BGR, not RGB, so always convert; - color spaces (HSV, LAB, grayscale) separate different kinds of information: HSV for color detection, LAB for perceptual operations, grayscale when color is irrelevant;
- convolution kernels perform image filtering: blur smooths noise, sharpen enhances detail, Sobel and Canny detect edges -- CNNs discover these same operations automatically through training;
- histogram equalization improves contrast globally, while CLAHE adapts locally to handle uneven lighting -- critical for medical imaging and autonomous driving;
- geometric transforms (resize, crop, rotate, perspective warp) with proper interpolation choices bridge the gap between raw images and fixed-size model input;
- a complete preprocessing pipeline chains all these steps (load, color convert, resize with padding, contrast enhance, normalize, transpose) into a reusable class that produces model-ready tensors.
This is the groundwork for everything we'll build in the coming episodes. Once you can reliably load, transform, and normalize images, the next step is teaching models to find specific things within those images -- not just "this is a cat" but "the cat is here, at these coordinates, with this bounding box." That's a fundamentally different problem, and it needs its own set of architectures and techniques.
Exercises
Exercise 1: Build a color histogram analyzer. Write a class that loads an image, computes histograms for each channel (R, G, B) independently, and also computes a combined intensity histogram. Add a method that compares two images by computing the correlation between their histograms (use cv2.compareHist with cv2.HISTCMP_CORREL). Test it by comparing photos of similar scenes versus completely different scenes.
Exercise 2: Implement a multi-kernel edge detector. Write a function that applies four different edge-detection kernels to the same image: Sobel X, Sobel Y, Laplacian, and a custom diagonal kernel. Combine the results into a single edge map by taking the maximum response at each pixel. Compare your combined result against cv2.Canny -- which produces cleaner edges and why?
Exercise 3: Build an image augmentation pipeline. Create a class with methods for random horizontal flip (50% chance), random rotation (between -15 and +15 degrees), random brightness adjustment (multiply by a factor between 0.7 and 1.3), and random crop-and-resize (crop 80-100% of the area, then resize back to original dimensions). Apply all augmentations sequentially to produce varied training examples from a single source image. Generate 10 augmented versions and verify that pixel statistics (mean, std) remain reasonable after augmentation.