Learn Zig Series (#45) - Image Tool: Pixel Operations
Project C: Image Manipulation Tool (2/3)
What will I learn
- You will learn grayscale conversion using the weighted luminance formula (0.299R + 0.587G + 0.114B);
- You will learn pixel inversion by subtracting each channel from 255;
- You will learn brightness and contrast adjustment with clamping;
- You will learn box blur by averaging neighboring pixels with a kernel;
- You will learn edge detection using Sobel-like gradient operators;
- You will learn sepia tone and color channel manipulation;
- You will learn when to operate on pixels in-place versus allocating new buffers;
- You will learn SIMD-accelerated pixel operations using Zig's @Vector from episode 19.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Zig 0.14+ distribution (download from ziglang.org);
- The ambition to learn Zig programming.
Difficulty
- Intermediate
Curriculum (of the Learn Zig Series):
- Zig Programming Tutorial - ep001 - Intro
- Learn Zig Series (#2) - Hello Zig, Variables and Types
- Learn Zig Series (#3) - Functions and Control Flow
- Learn Zig Series (#4) - Error Handling (Zig's Best Feature)
- Learn Zig Series (#5) - Arrays, Slices, and Strings
- Learn Zig Series (#6) - Structs, Enums, and Tagged Unions
- Learn Zig Series (#7) - Memory Management and Allocators
- Learn Zig Series (#8) - Pointers and Memory Layout
- Learn Zig Series (#9) - Comptime (Zig's Superpower)
- Learn Zig Series (#10) - Project Structure, Modules, and File I/O
- Learn Zig Series (#11) - Mini Project: Building a Step Sequencer
- Learn Zig Series (#12) - Testing and Test-Driven Development
- Learn Zig Series (#13) - Interfaces via Type Erasure
- Learn Zig Series (#14) - Generics with Comptime Parameters
- Learn Zig Series (#15) - The Build System (build.zig)
- Learn Zig Series (#16) - Sentinel-Terminated Types and C Strings
- Learn Zig Series (#17) - Packed Structs and Bit Manipulation
- Learn Zig Series (#18) - Async Concepts and Event Loops
- Learn Zig Series (#18b) - Addendum: Async Returns in Zig 0.16
- Learn Zig Series (#19) - SIMD with @Vector
- Learn Zig Series (#20) - Working with JSON
- Learn Zig Series (#21) - Networking and TCP Sockets
- Learn Zig Series (#22) - Hash Maps and Data Structures
- Learn Zig Series (#23) - Iterators and Lazy Evaluation
- Learn Zig Series (#24) - Logging, Formatting, and Debug Output
- Learn Zig Series (#25) - Mini Project: HTTP Status Checker
- Learn Zig Series (#26) - Writing a Custom Allocator
- Learn Zig Series (#27) - C Interop: Calling C from Zig
- Learn Zig Series (#28) - C Interop: Exposing Zig to C
- Learn Zig Series (#29) - Inline Assembly and Low-Level Control
- Learn Zig Series (#30) - Thread Safety and Atomics
- Learn Zig Series (#31) - Memory-Mapped I/O and Files
- Learn Zig Series (#32) - Compile-Time Reflection with @typeInfo
- Learn Zig Series (#33) - Building a State Machine with Tagged Unions
- Learn Zig Series (#34) - Performance Profiling and Optimization
- Learn Zig Series (#35) - Cross-Compilation and Target Triples
- Learn Zig Series (#36) - Mini Project: CLI Task Runner
- Learn Zig Series (#37) - Markdown to HTML: Tokenizer and Lexer
- Learn Zig Series (#38) - Markdown to HTML: Parser and AST
- Learn Zig Series (#39) - Markdown to HTML: Renderer and CLI
- Learn Zig Series (#40) - Key-Value Store: In-Memory Store
- Learn Zig Series (#41) - Key-Value Store: Write-Ahead Log
- Learn Zig Series (#42) - Key-Value Store: TCP Server
- Learn Zig Series (#43) - Key-Value Store: Client Library and Benchmarks
- Learn Zig Series (#44) - Image Tool: Reading and Writing PPM/BMP
- Learn Zig Series (#45) - Image Tool: Pixel Operations (this post)
Learn Zig Series (#45) - Image Tool: Pixel Operations
Last episode we built the I/O layer for our image tool -- reading and writing PPM and BMP files, with an Image struct that holds raw RGB pixel data in a flat []u8 buffer. Now comes the fun part: actually doing things with those pixels. Brightness, contrast, grayscale, blur, edge detection, sepia -- the classic image processing operations that turn a boring photo into something interesting (or at least differently boring).
This episode is all about the math and memory patterns behind pixel manipulation. Every operation here boils down to the same fundamental thing: iterate over pixels, read channel values, apply a formula, write the results. But the details matter a lot. Some operations can modify pixels in-place. Some need a second buffer because reading neighbors while simultaneously writing to those neighbors would corrupt the result. And some operations are embarrassingly parallel -- perfect candidates for the SIMD @Vector tricks we learned back in episode 19.
Let's start with the simplest operations and build up to spatial filters.
Per-pixel operations: the simple ones
The easiest category of image operations are point operations -- transformations where each output pixel depends only on the corresponding input pixel. No neighbors, no kernel, no spatial context. Just a function f(r, g, b) -> (r', g', b') applied to every pixel independently.
These are all safe to do in-place: since we only read a pixel once and write it once, we can overwrite the input buffer without corrupting anything.
Grayscale
Converting a color image to grayscale means reducing three channels (R, G, B) to a single luminance value, then storing that value in all three channels. The naive approach would be to average them: (r + g + b) / 3. That works, but it looks wrong -- human eyes are much more sensitive to green light than red, and much more sensitive to red than blue. A pure green pixel and a pure blue pixel have the same average, but green looks WAY brighter to us.
The standard fix is the ITU-R BT.601 luminance formula: Y = 0.299 * R + 0.587 * G + 0.114 * B. These weights come from how CRT phosphors and human cone cells interact (it's a long story involving television standards from the 1950s). The point is: green gets the most weight because our eyes have the most green-sensitive cones.
In Zig, we need to be careful about the arithmetic. Our channel values are u8 (0-255), the weights are fractional, and we want integer arithmetic for speed. The standard trick is to multiply by 256-scaled integer weights and then shift right by 8:
pub fn grayscale(img: *Image) void {
const pixel_count = img.pixelCount();
var i: usize = 0;
while (i < pixel_count) : (i += 1) {
const idx = i * 3;
const r = @as(u16, img.pixels[idx]);
const g = @as(u16, img.pixels[idx + 1]);
const b = @as(u16, img.pixels[idx + 2]);
// BT.601 luminance: 0.299*R + 0.587*G + 0.114*B
// Scaled by 256: 77*R + 150*G + 29*B, then >> 8
const lum: u8 = @intCast((r * 77 + g * 150 + b * 29) >> 8);
img.pixels[idx] = lum;
img.pixels[idx + 1] = lum;
img.pixels[idx + 2] = lum;
}
}
Why u16 for the intermediates? Because 255 * 77 = 19,635 which fits in a u16 (max 65,535). The sum of all three channels times their weights is at most 255 * (77 + 150 + 29) = 255 * 256 = 65,280, which also fits in u16. After the right shift by 8 we're back in u8 range. No overflow, no floating point, no rounding surprises.
The weights 77, 150, 29 sum to 256, which is exactly 2^8. This is intentional -- it means our integer approximation produces the same result as the floating-point formula (within +/- 1 from rounding). If they summed to 255 or 257 you'd get a consistent brightness bias across the entire image.
Invert
Inverting is the simplest possible pixel operation: output = 255 - input for each channel. It flips the color space so black becomes white, red becomes cyan, and so on. Photographic negatives, basically.
pub fn invert(img: *Image) void {
for (img.pixels) |*pixel| {
pixel.* = 255 - pixel.*;
}
}
That's it. Three lines of actual logic. We iterate over every byte in the pixel buffer (not every pixel -- every byte) because the operation is identical for all three channels. Zig's for loop with pointer capture (|*pixel|) gives us a mutable reference to each element, so pixel.* = 255 - pixel.* reads the value, subtracts from 255, and writes it back in place.
This is one of those cases where working with a flat []u8 buffer pays off. We don't need getPixel/setPixel or any coordinate math. The buffer is just bytes, the operation works on bytes, done.
Brightness
Brightness adjustment multiplies each channel by a factor. A factor of 1.0 means no change, less than 1.0 darkens, greater than 1.0 brightens. The tricky part is clamping: if a channel value of 200 gets multiplied by 1.5, the result is 300 -- which doesn't fit in a u8. We need to clamp to 255.
pub fn brightness(img: *Image, factor: f32) void {
for (img.pixels) |*pixel| {
const val = @as(f32, @floatFromInt(pixel.*)) * factor;
pixel.* = if (val > 255.0) 255 else if (val < 0.0) 0 else @intFromFloat(val);
}
}
We convert to f32 for the multiplication, then clamp and convert back. The val < 0.0 check handles the case where someone passes a negative factor (which inverts and darkens simultaneously -- not common, but we should handle it rather than undefined-behavior our way into a crash).
Could we do this with integer math? Yes -- if the factor is always a simple fraction like 0.5 or 2.0. But for arbitrary floating-point factors like 1.37 you'd need fixed-point arithmetic which is more code for no real gain. The f32 conversion is cheap on any modern CPU.
Contrast
Contrast adjustment scales the distance of each channel from the midpoint (128). Low contrast pulls everything towards gray. High contrast pushes everything towards black or white. The formula: output = 128 + factor * (input - 128).
pub fn contrast(img: *Image, factor: f32) void {
for (img.pixels) |*pixel| {
const val = 128.0 + (@as(f32, @floatFromInt(pixel.*)) - 128.0) * factor;
pixel.* = if (val > 255.0) 255 else if (val < 0.0) 0 else @intFromFloat(val);
}
}
Same pattern as brightness: convert, compute, clamp, convert back. A factor of 0 collapses everything to 128 (flat gray). A factor of 1 does nothing. A factor of 2 doubles the distance from midpoint, making darks darker and lights lighter. Large factors create a posterization effect where most pixels clip to either 0 or 255.
Sepia
Sepia gives photos that old-timey brownish-yellow tint. It's implemented as a matrix transformation on the RGB channels -- each output channel is a weighted sum of all three input channels:
new_r = 0.393*R + 0.769*G + 0.189*B
new_g = 0.349*R + 0.686*G + 0.168*B
new_b = 0.272*R + 0.534*G + 0.131*B
These coefficients are the standard sepia matrix from the W3C CSS filter specification. The red channel gets the highest weights (warmest), blue gets the lowest (coolest), giving that characteristic warm brown.
pub fn sepia(img: *Image) void {
const pixel_count = img.pixelCount();
var i: usize = 0;
while (i < pixel_count) : (i += 1) {
const idx = i * 3;
const r = @as(f32, @floatFromInt(img.pixels[idx]));
const g = @as(f32, @floatFromInt(img.pixels[idx + 1]));
const b = @as(f32, @floatFromInt(img.pixels[idx + 2]));
const new_r = r * 0.393 + g * 0.769 + b * 0.189;
const new_g = r * 0.349 + g * 0.686 + b * 0.168;
const new_b = r * 0.272 + g * 0.534 + b * 0.131;
img.pixels[idx] = clampU8(new_r);
img.pixels[idx + 1] = clampU8(new_g);
img.pixels[idx + 2] = clampU8(new_b);
}
}
fn clampU8(val: f32) u8 {
if (val > 255.0) return 255;
if (val < 0.0) return 0;
return @intFromFloat(val);
}
I pulled out the clamping into a helper function because we're using it on three channels per pixel now and the inline if/else chain was getting messy. The clampU8 function is small enough that the compiler will inline it anyway -- Zig's optimizer is aggressive about that.
Notice the pattern: we read all three input channels first, compute all three outputs, then write all three back. This is important because sepia is NOT channel-independent -- new_r depends on input g and b. If we wrote new_r back to pixels[idx] before computing new_g, we'd be using the wrong red value. For in-place point operations, always read all inputs before writing any outputs when channels are mixed.
Color channel manipulation
Sometimes you want to isolate or boost individual color channels. Red channel only, drop the blue, double the green -- these are useful for debugging (seeing what's actually in each channel) and for artistic effects.
pub fn adjustChannels(img: *Image, r_factor: f32, g_factor: f32, b_factor: f32) void {
const pixel_count = img.pixelCount();
var i: usize = 0;
while (i < pixel_count) : (i += 1) {
const idx = i * 3;
const r = @as(f32, @floatFromInt(img.pixels[idx])) * r_factor;
const g = @as(f32, @floatFromInt(img.pixels[idx + 1])) * g_factor;
const b = @as(f32, @floatFromInt(img.pixels[idx + 2])) * b_factor;
img.pixels[idx] = clampU8(r);
img.pixels[idx + 1] = clampU8(g);
img.pixels[idx + 2] = clampU8(b);
}
}
Call adjustChannels(img, 1.0, 0.0, 0.0) to extract the red channel only. Call adjustChannels(img, 1.0, 1.0, 0.5) to halve the blue. Call adjustChannels(img, 0.0, 1.2, 0.0) to boost the green and kill everything else. It's a general-purpose tool and its simplicity is the whole point.
Spatial operations: when neighbors matter
Now things get more interesting. Spatial filters (also called kernel operations or convolution filters) compute each output pixel from a neighborhood of input pixels -- typically a 3x3 or 5x5 window centered on the target pixel. This is where blur, sharpen, and edge detection come from.
The critical difference from point operations: spatial filters CANNOT safely operate in-place. Think about it -- if you're computing the output for pixel (5, 3) by averaging its neighbors (4,2), (5,2), (6,2), (4,3), (5,3), (6,3), (4,4), (5,4), (6,4), and you've already written new values to pixels (4,2) through (4,3) in a previous iteration, you're mixing old and new values. The result is garbage.
The standard solution is double-buffering: read from the source image, write to a separate output buffer. After the operation completes, either swap the buffers or copy the output back to the source.
Box blur
A box blur (also called a mean filter) replaces each pixel with the average of itself and its neighbors. For a 3x3 kernel, that's 9 pixels averaged together. For a 5x5 kernel, 25 pixels. Larger kernels produce more blur.
pub fn boxBlur(img: *Image, radius: u32) !void {
if (radius == 0) return;
const width = img.width;
const height = img.height;
const allocator = img.allocator;
// Allocate output buffer
const buf = try allocator.alloc(u8, img.pixels.len);
defer allocator.free(buf);
const kernel_size = radius * 2 + 1;
const kernel_area = kernel_size * kernel_size;
for (0..height) |y_usize| {
for (0..width) |x_usize| {
var sum_r: u32 = 0;
var sum_g: u32 = 0;
var sum_b: u32 = 0;
var count: u32 = 0;
// Sample the kernel neighborhood
var ky: i64 = -@as(i64, @intCast(radius));
while (ky <= @as(i64, @intCast(radius))) : (ky += 1) {
var kx: i64 = -@as(i64, @intCast(radius));
while (kx <= @as(i64, @intCast(radius))) : (kx += 1) {
const sx = @as(i64, @intCast(x_usize)) + kx;
const sy = @as(i64, @intCast(y_usize)) + ky;
// Bounds check -- skip pixels outside the image
if (sx >= 0 and sx < @as(i64, @intCast(width)) and
sy >= 0 and sy < @as(i64, @intCast(height)))
{
const src_x: usize = @intCast(sx);
const src_y: usize = @intCast(sy);
const src_idx = (src_y * @as(usize, width) + src_x) * 3;
sum_r += img.pixels[src_idx];
sum_g += img.pixels[src_idx + 1];
sum_b += img.pixels[src_idx + 2];
count += 1;
}
}
}
const dst_idx = (y_usize * @as(usize, width) + x_usize) * 3;
buf[dst_idx] = @intCast(sum_r / count);
buf[dst_idx + 1] = @intCast(sum_g / count);
buf[dst_idx + 2] = @intCast(sum_b / count);
}
}
// Copy result back to source
@memcpy(img.pixels, buf);
}
A few things worth calling out here.
The bounds checking strategy: at image edges, the kernel extends beyond the image boundary. We handle this by simply skipping out-of-bounds samples and dividing by the actual number of samples used (count) instead of the theoretical kernel_area. This means edge pixels get averaged over fewer samples, which is the "shrink kernel at edges" approach. Alternatives include clamping coordinates to the nearest edge pixel (extending the border), wrapping around (tileable textures), or reflecting the image at boundaries. All are valid -- we picked the simplest one.
The signed arithmetic: pixel coordinates are usize (unsigned) but kernel offsets are negative. We cast everything to i64 for the arithmetic and then bounds-check before casting back. This is one of those places where Zig's strict integer types force you to think about the signedness, which is annoying but correct. In C you'd just use int for everything and hope nothing wraps -- in Zig you explicitly handle it.
The double buffer: we allocate buf at the start, write all results there, then @memcpy back to img.pixels at the end. The defer allocator.free(buf) ensures the temporary buffer gets freed even if something goes wrong. This is classic Zig resource management -- we saw the same pattern in episode 7.
Performance note: this naive box blur is O(width * height * kernel_size^2). For a radius-10 blur on a 1920x1080 image, that's 1920 * 1080 * 21 * 21 ≈ 914 million operations. Not great. The standard optimization is a separable filter: blur horizontally first (1D kernel), then blur vertically on the result. This reduces the complexity from O(n * k^2) to O(n * 2k). We won't implement that here because the naive version is clearer for learning, but if you ever need to blur large images in production -- use the separable approach, its dramatically faster.
Edge detection with Sobel operators
Edge detection finds boundaries in images -- places where pixel values change rapidly. The Sobel operator does this by computing two directional gradients (horizontal and vertical) and combining them.
The Sobel kernels are:
Horizontal (Gx): Vertical (Gy):
-1 0 +1 -1 -2 -1
-2 0 +2 0 0 0
-1 0 +1 +1 +2 +1
For each pixel, we convolve the neighborhood with both kernels to get gx and gy, then compute the gradient magnitude: magnitude = sqrt(gx^2 + gy^2). The magnitude tells us how "edgy" the pixel is -- high magnitude means a strong edge.
pub fn sobelEdge(img: *Image) !void {
const width = img.width;
const height = img.height;
const allocator = img.allocator;
// First convert to grayscale -- edge detection works on luminance
grayscale(img);
const buf = try allocator.alloc(u8, img.pixels.len);
defer allocator.free(buf);
@memset(buf, 0);
// Skip border pixels (1px border)
for (1..height - 1) |y| {
for (1..width - 1) |x| {
// Sample 3x3 neighborhood (grayscale, so just read R channel)
const tl = @as(i32, img.pixels[((y - 1) * @as(usize, width) + (x - 1)) * 3]);
const tc = @as(i32, img.pixels[((y - 1) * @as(usize, width) + x) * 3]);
const tr = @as(i32, img.pixels[((y - 1) * @as(usize, width) + (x + 1)) * 3]);
const ml = @as(i32, img.pixels[(y * @as(usize, width) + (x - 1)) * 3]);
const mr = @as(i32, img.pixels[(y * @as(usize, width) + (x + 1)) * 3]);
const bl = @as(i32, img.pixels[((y + 1) * @as(usize, width) + (x - 1)) * 3]);
const bc = @as(i32, img.pixels[((y + 1) * @as(usize, width) + x) * 3]);
const br = @as(i32, img.pixels[((y + 1) * @as(usize, width) + (x + 1)) * 3]);
// Sobel X gradient
const gx = (-tl + tr) + (-2 * ml + 2 * mr) + (-bl + br);
// Sobel Y gradient
const gy = (-tl - 2 * tc - tr) + (bl + 2 * bc + br);
// Magnitude (using absolute sum instead of sqrt for speed)
var mag = @abs(gx) + @abs(gy);
if (mag > 255) mag = 255;
const dst_idx = (y * @as(usize, width) + x) * 3;
const edge_val: u8 = @intCast(mag);
buf[dst_idx] = edge_val;
buf[dst_idx + 1] = edge_val;
buf[dst_idx + 2] = edge_val;
}
}
@memcpy(img.pixels, buf);
}
A few design decisions here:
Grayscale first: we convert the entire image to grayscale before running edge detection. You could run Sobel on each color channel independently and combine the results, but for most use cases grayscale-first produces better results with less computation. The edges we care about are luminance edges, not color edges.
Approximate magnitude: instead of sqrt(gx*gx + gy*gy) we use |gx| + |gy|. This is a well-known approximation called the "Manhattan distance" or "L1 norm" of the gradient. It's not geometrically correct (it overestimates diagonal edges by a factor of sqrt(2)), but it's fast and visually pretty close. Real image processing libraries often use this in preview modes and switch to the proper Euclidean norm for final output.
Border handling: we skip the 1-pixel border entirely (leaving it black). For a 1920x1080 image you lose 2 pixels on each edge -- nobody notices. The alternative is bounds-checking every sample (like the blur does), which adds branching to the inner loop. For an edge detector where performance matters and borders are negligible, just skip them.
i32 intermediates: Sobel gradients can be negative (that's the whole point -- they detect transitions in both directions). We need signed arithmetic. The maximum possible gx value is 255 * 4 = 1020 and minimum is -1020, well within i32 range.
In-place vs new buffer: the decision framework
By now you've seen both patterns. Here's the rule of thumb:
In-place is safe when: each output pixel depends ONLY on the corresponding input pixel. Point operations like grayscale, invert, brightness, contrast, sepia are all safe in-place. You read pixel (x,y), compute, write pixel (x,y). No other pixel is touched.
New buffer is required when: output pixels depend on neighboring input pixels. Blur, sharpen, edge detection, any convolution -- you need a separate output buffer. If you write to the source while reading neighbors from it, you'll read a mix of old and new values.
There's a middle ground too: separable operations where you can process rows independently. Horizontal blur, for example, only reads from the same row -- so you could process rows one at a time, using a single row buffer instead of a full image buffer. This saves memory for large images.
Here's the memory impact. For a 4K image (3840 x 2160 x 3 channels = 24,883,200 bytes ≈ 24 MB):
| Strategy | Extra memory | Use case |
|---|---|---|
| In-place | 0 bytes | Point operations |
| Full copy | 24 MB | General spatial filters |
| Row buffer | 11,520 bytes | Separable filters |
| Ring buffer (3 rows) | 34,560 bytes | 3x3 kernels processed row-by-row |
For our learning tool, full copy is fine. But in production code on embedded systems or when processing video frames at 60fps, the memory strategy matters a lot. This is exactly the kind of tradeoff Zig was designed for -- you choose the strategy, the language doesn't impose one.
SIMD-accelerated pixel operations
Remember episode 19 where we covered SIMD with @Vector? Image processing is one of the best use cases for SIMD. Pixel operations are embarrassingly parallel -- every pixel gets the same operation, and modern CPUs have 128-bit or 256-bit SIMD registers that can process 16 or 32 bytes simultaneously.
Let's SIMD-accelerate our invert function first, since it's the simplest:
pub fn invertSimd(img: *Image) void {
const pixels = img.pixels;
const len = pixels.len;
// Process 16 bytes at a time using @Vector
const vec_len = 16;
const full_chunks = len / vec_len;
var i: usize = 0;
while (i < full_chunks * vec_len) : (i += vec_len) {
const chunk: @Vector(vec_len, u8) = pixels[i..][0..vec_len].*;
const inverted = @as(@Vector(vec_len, u8), @splat(255)) - chunk;
pixels[i..][0..vec_len].* = inverted;
}
// Handle remaining bytes (tail that doesn't fill a full vector)
while (i < len) : (i += 1) {
pixels[i] = 255 - pixels[i];
}
}
The @Vector(16, u8) type represents 16 bytes packed into a single SIMD register. On x86 this maps to an SSE register (128 bits = 16 bytes). On ARM it maps to a NEON register. The subtraction @splat(255) - chunk becomes a single SIMD instruction that subtracts all 16 bytes simultaneously.
@splat(255) creates a vector with all 16 lanes set to 255. The subtraction happens in parallel across all lanes. One instruction instead of 16 scalar subtractions. That's the SIMD advantage.
The pixels[i..][0..vec_len].* syntax might look weird. It first slices from index i to the end, then takes exactly vec_len elements from that slice, and the .* dereferences to copy the values. Zig optimizes this into a single aligned (or unaligned, depending on the pointer) vector load. We covered this slicing pattern in episode 5.
The tail loop handles the remaining bytes that don't fill a complete 16-byte vector. For a typical image this is at most 15 bytes out of millions -- negligible.
Now let's do SIMD grayscale. This is trickier because the luminance formula mixes channels, and our pixel data has interleaved R, G, B bytes:
pub fn grayscaleSimd(img: *Image) void {
const pixel_count = img.pixelCount();
const pixels = img.pixels;
// Process 4 pixels at a time (12 bytes = 4 RGB triplets)
const batch = 4;
const full_batches = pixel_count / batch;
var p: usize = 0;
while (p < full_batches * batch) : (p += batch) {
const base = p * 3;
// Load 4 pixels worth of R, G, B into separate vectors
var r_vec: @Vector(batch, u16) = undefined;
var g_vec: @Vector(batch, u16) = undefined;
var b_vec: @Vector(batch, u16) = undefined;
inline for (0..batch) |j| {
r_vec[j] = pixels[base + j * 3];
g_vec[j] = pixels[base + j * 3 + 1];
b_vec[j] = pixels[base + j * 3 + 2];
}
// BT.601 luminance: 77*R + 150*G + 29*B >> 8
const r_weight: @Vector(batch, u16) = @splat(77);
const g_weight: @Vector(batch, u16) = @splat(150);
const b_weight: @Vector(batch, u16) = @splat(29);
const lum = (r_vec * r_weight + g_vec * g_weight + b_vec * b_weight) >> @splat(8);
// Write back
inline for (0..batch) |j| {
const val: u8 = @intCast(lum[j]);
pixels[base + j * 3] = val;
pixels[base + j * 3 + 1] = val;
pixels[base + j * 3 + 2] = val;
}
}
// Tail: remaining pixels
while (p < pixel_count) : (p += 1) {
const idx = p * 3;
const r = @as(u16, pixels[idx]);
const g = @as(u16, pixels[idx + 1]);
const b = @as(u16, pixels[idx + 2]);
const lum: u8 = @intCast((r * 77 + g * 150 + b * 29) >> 8);
pixels[idx] = lum;
pixels[idx + 1] = lum;
pixels[idx + 2] = lum;
}
}
The complication here is structure of arrays vs array of structures. Our pixel buffer is stored as RGBRGBRGB... (array of structures -- each pixel is a 3-byte struct). SIMD wants data in parallel lanes: RRRR, GGGG, BBBB (structure of arrays). So we have to "de-interleave" the data: extract R from pixel 0, R from pixel 1, R from pixel 2, R from pixel 3 into one vector, same for G and B. Then we do the vectorized computation. Then we "re-interleave" and write back.
This gather/scatter overhead eats into the SIMD speedup. For 4-wide vectors the win is modest. With wider vectors (AVX-256 = 32 bytes = 8 pixels, AVX-512 = 64 bytes = 16 pixels) the amortized cost of the gather/scatter drops and the throughput gain dominates. On a real project you might consider storing pixels in planar format (all Rs, then all Gs, then all Bs) if you're doing heavy processing -- but that makes I/O harder. Tradeoffs everywhere ;-)
The inline for is important here. Regular for (0..batch) would be a runtime loop. inline for unrolls the loop at compile time, so the compiler sees four explicit load/store operations and can optimize them together. Without inline, you'd get a loop variable, a branch, and the compiler might not vectorize the gather pattern. We touched on inline for back in episode 9 (comptime).
Putting it all together
Let's write a demo that loads an image, applies a pipeline of operations, and saves the results. This previews what the next episode's CLI will do -- but for now we'll hardcode the operations:
const std = @import("std");
const image_mod = @import("image.zig");
const Image = image_mod.Image;
const ops = @import("operations.zig");
const fmt = @import("format.zig");
pub fn main() !void {
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
defer {
const check = gpa.deinit();
if (check == .leak) std.debug.print("WARNING: memory leak detected\n", .{});
}
const allocator = gpa.allocator();
// Create a test image with a gradient
var img = try Image.init(allocator, 256, 256);
defer img.deinit();
for (0..256) |y| {
for (0..256) |x| {
img.setPixel(
@intCast(x),
@intCast(y),
@intCast(x),
@intCast(y),
128,
);
}
}
// Save original
try fmt.writeImage(&img, "original.ppm");
std.debug.print("Saved original.ppm\n", .{});
// Grayscale
var gray = try Image.init(allocator, img.width, img.height);
defer gray.deinit();
@memcpy(gray.pixels, img.pixels);
ops.grayscale(&gray);
try fmt.writeImage(&gray, "grayscale.ppm");
std.debug.print("Saved grayscale.ppm\n", .{});
// Sepia
var sep = try Image.init(allocator, img.width, img.height);
defer sep.deinit();
@memcpy(sep.pixels, img.pixels);
ops.sepia(&sep);
try fmt.writeImage(&sep, "sepia.ppm");
std.debug.print("Saved sepia.ppm\n", .{});
// Blur
var blurred = try Image.init(allocator, img.width, img.height);
defer blurred.deinit();
@memcpy(blurred.pixels, img.pixels);
try ops.boxBlur(&blurred, 3);
try fmt.writeImage(&blurred, "blurred.ppm");
std.debug.print("Saved blurred.ppm\n", .{});
// Edge detection
var edges = try Image.init(allocator, img.width, img.height);
defer edges.deinit();
@memcpy(edges.pixels, img.pixels);
try ops.sobelEdge(&edges);
try fmt.writeImage(&edges, "edges.ppm");
std.debug.print("Saved edges.ppm\n", .{});
// Inverted (SIMD)
var inv = try Image.init(allocator, img.width, img.height);
defer inv.deinit();
@memcpy(inv.pixels, img.pixels);
ops.invertSimd(&inv);
try fmt.writeImage(&inv, "inverted.ppm");
std.debug.print("Saved inverted.ppm\n", .{});
std.debug.print("All operations completed. Open the .ppm files to see results.\n", .{});
}
Every operation gets a fresh copy of the original image (via @memcpy) so we can see the individual effects without them stacking on top of each other. In the final CLI (next episode) the user will chain operations sequentially on the same image -- but for testing, independent copies are clearer.
Notice how the GPA (GeneralPurposeAllocator) leak check at the end catches any forgotten deinit calls. If you add a new image copy and forget its defer deinit(), the program tells you. This is one of Zig's biggest practiacl advantages over C -- not that it prevents leaks (it doesn't), but that it makes them loud and obvious during development.
Updated project structure
img-tool/
src/
image.zig -- Image struct (ep044)
ppm.zig -- PPM reader/writer (ep044)
bmp.zig -- BMP reader/writer (ep044)
format.zig -- detectFormat, readImage, writeImage (ep044)
operations.zig -- grayscale, invert, brightness, contrast, sepia,
adjustChannels, boxBlur, sobelEdge,
invertSimd, grayscaleSimd (this episode)
main.zig -- demo pipeline (this episode)
image_test.zig -- round-trip tests (ep044)
build.zig
Next time we'll add pipeline.zig and cli.zig to tie everything together with command-line argument parsing, operation chaining, and proper error reporting. We'll build the final tool so you can do things like img-tool input.bmp --grayscale --blur 3 --brightness 1.2 output.ppm -- reading one format, piping through a sequence of operations, and writing a different format.
Wat we geleerd hebben
- Grayscale conversion using the BT.601 luminance formula (0.299R + 0.587G + 0.114B), implemented with integer math (77R + 150G + 29B >> 8) to avoid floating-point overhead
- Pixel inversion by iterating the flat byte buffer directly, without needing coordinate math
- Brightness (multiply by factor) and contrast (scale distance from midpoint 128) with clamping to the 0-255 range
- Sepia as a 3x3 color matrix transformation where each output channel depends on all three input channels -- must read all inputs before writing any outputs
- Color channel manipulation with per-channel scaling factors for isolation and artistic effects
- Box blur as a spatial filter that averages a kernel neighborhood, requiring a separate output buffer because reading from and writing to the same buffer would corrupt neighbor values
- Sobel edge detection: two directional gradient kernels (horizontal and vertical) convolved over grayscale pixels, combined with approximate magnitude (|gx| + |gy|)
- The in-place vs new buffer decision framework: point operations are safe in-place, spatial operations need double-buffering
- SIMD acceleration of pixel operations using @Vector with
for broadcasting constants, inline for for compile-time loop unrolling, and the structure-of-arrays gather/scatter pattern
Thanks for reading!