Learn Zig Series (#60) - Assembler: Two-Pass Assembly

Project G: Assembler/Disassembler (2/3)

What will I learn

How assembly language text syntax works: MOV R0, 42, label: ADD R1, R2, comments, and blank lines;
Why a two-pass approach is necessary: forward references to labels that haven't been defined yet;
Building pass 1: scanning source lines, recording label addresses in a symbol table using a hash map;
Building pass 2: parsing instructions, resolving label references to concrete addresses, encoding to binary;
Error reporting with line numbers: undefined labels, invalid registers, malformed syntax;
Producing flat binary output: a byte array of encoded instructions ready for the VM;
Reading assembly source from a file and writing a binary output file;
Testing the full pipeline: assemble a program, load it into the VM from episode 59, verify register state after execution.

Requirements

A working modern computer running macOS, Windows or Ubuntu;
An installed Zig 0.14+ distribution (download from ziglang.org);
The ambition to learn Zig programming.

Difficulty

Advanced

Curriculum (of the `Learn Zig Series`):

Learn Zig Series (#60) - Assembler: Two-Pass Assembly

In episode 59 we built the foundations of our assembler project: an instruction set with 16 opcodes, a packed struct encoder, and a virtual machine that fetches-decodes-executes 16-bit instructions. We wrote programs by calling helper functions like asm_builder.mov_imm(0, 42) and manually computing branch target addresses. That worked, but it's the programming equivalent of writing machine code on paper and toggling it into a front panel one switch at a time. Painful, error-prone, and nobody does it by choice.

Today we fix that. We're building a two-pass assembler that reads human-readable text -- MOV R0, 42, loop: ADD R0, R1, JNE loop -- and produces the same binary output our VM already knows how to execute. The key challenge? Forward references. When the assembler sees JNE loop on line 3 but loop: isn't defined until line 7, it can't encode the jump address yet because it doesn't know it. The classic solution is two passes: the first pass collects all label addresses, the second pass encodes all instructions using those addresses. This technique dates back to the 1950s and every assembler since has used some variation of it.

We'll reuse the Opcode, Instruction, and encoding functions from episode 59 without modification. Everything today builds on top of that layer. Here we go!

Assembly language syntax

Our assembler needs to understand a simple text format. Each line is one of: an instruction, a label, a label followed by an instruction, a comment, or blank. Here's what a complete program looks like:

; Sum numbers 1 through 10
    MOV R0, 0       ; accumulator
    MOV R1, 1       ; counter
    MOV R2, 10      ; limit
loop:
    ADD R0, R1      ; accumulate
    ADD R1, 1       ; increment counter
    CMP R1, R2      ; reached limit?
    JNE loop        ; if not, keep going
    ADD R0, R1      ; add the final value
    HLT

The rules: labels end with a colon and must start with a letter or underscore. Instruction mnemonics are case-insensitive (MOV, mov, Mov all work). Register operands are R0 through R7. Immediate values are plain decimal numbers. Comments start with ; and extend to end of line. Whitespace is flexible -- leading spaces/tabs are ignored.

This is basically the same syntax as the assembly listings in episode 59's comments, just formalized into something a parser can handle. Let's define the data structures:

const std = @import("std");
const Allocator = std.mem.Allocator;

const TokenKind = enum {
    label_def,    // "loop:"
    mnemonic,     // "MOV", "ADD", etc.
    register,     // "R0" through "R7"
    immediate,    // numeric literal like 42
    identifier,   // label reference like "loop" (in operand position)
    comma,        // ","
    eof,
};

const Token = struct {
    kind: TokenKind,
    text: []const u8,
    line: usize,
};

Each token knows what line it came from. That line number is critical for error reporting -- when something goes wrong (and it will, trust me), you need to tell the user exactly where.

The line tokenizer

We tokenize one line at a time rather than the whole file. This keeps the tokenizer simple and gives us natural line-number tracking:

const Tokenizer = struct {
    tokens: std.ArrayList(Token),
    allocator: Allocator,

    fn init(allocator: Allocator) Tokenizer {
        return .{
            .tokens = std.ArrayList(Token).init(allocator),
            .allocator = allocator,
        };
    }

    fn deinit(self: *Tokenizer) void {
        self.tokens.deinit();
    }

    fn reset(self: *Tokenizer) void {
        self.tokens.clearRetainingCapacity();
    }

    fn tokenizeLine(self: *Tokenizer, line: []const u8, line_num: usize) !void {
        self.reset();
        var i: usize = 0;

        while (i < line.len) {
            // skip whitespace
            if (line[i] == ' ' or line[i] == '\t') {
                i += 1;
                continue;
            }

            // comment -- rest of line is ignored
            if (line[i] == ';') break;

            // comma
            if (line[i] == ',') {
                try self.tokens.append(.{
                    .kind = .comma,
                    .text = line[i .. i + 1],
                    .line = line_num,
                });
                i += 1;
                continue;
            }

            // number (immediate value)
            if (std.ascii.isDigit(line[i])) {
                const start = i;
                while (i < line.len and std.ascii.isDigit(line[i])) : (i += 1) {}
                try self.tokens.append(.{
                    .kind = .immediate,
                    .text = line[start..i],
                    .line = line_num,
                });
                continue;
            }

            // identifier or keyword
            if (std.ascii.isAlphabetic(line[i]) or line[i] == '_') {
                const start = i;
                while (i < line.len and (std.ascii.isAlphanumeric(line[i]) or line[i] == '_')) : (i += 1) {}

                // check for label definition (ends with ':')
                if (i < line.len and line[i] == ':') {
                    try self.tokens.append(.{
                        .kind = .label_def,
                        .text = line[start..i],
                        .line = line_num,
                    });
                    i += 1; // skip the colon
                    continue;
                }

                const word = line[start..i];

                // check if it's a register (R0-R7)
                if (word.len == 2 and (word[0] == 'R' or word[0] == 'r') and
                    word[1] >= '0' and word[1] <= '7')
                {
                    try self.tokens.append(.{
                        .kind = .register,
                        .text = word,
                        .line = line_num,
                    });
                    continue;
                }

                // check if it's a known mnemonic
                if (isMnemonic(word)) {
                    try self.tokens.append(.{
                        .kind = .mnemonic,
                        .text = word,
                        .line = line_num,
                    });
                    continue;
                }

                // otherwise it's a label reference
                try self.tokens.append(.{
                    .kind = .identifier,
                    .text = word,
                    .line = line_num,
                });
                continue;
            }

            // unknown character -- skip it (we'll catch errors in parsing)
            i += 1;
        }
    }

    fn isMnemonic(word: []const u8) bool {
        const mnemonics = [_][]const u8{
            "HLT", "hlt", "MOV", "mov", "ADD", "add", "SUB", "sub",
            "MUL", "mul", "CMP", "cmp", "JMP", "jmp", "JEQ", "jeq",
            "JNE", "jne", "LOAD", "load", "STORE", "store",
            "PUSH", "push", "POP", "pop", "CALL", "call",
            "RET", "ret", "NOP", "nop",
        };
        for (mnemonics) |m| {
            if (std.mem.eql(u8, word, m)) return true;
        }
        return false;
    }
};

A few decisions worth explaining. First, we handle case-insensitive mnemonics by listing both upper and lower variants in the lookup table. An alternative would be to toLower() the input first, but then error messages would show lowercased text which is less helpful. Second, the label definition detection is position-sensitive -- we only recognize foo: as a label when the colon immediately follows the identifier. foo : (with a space) would parse foo as an identifier and : as an unknown character. This is consistent with how most real assemblers work.

The reset() method lets us reuse the same tokenizer across lines without reallocating. The ArrayList keeps its capacity between calls thanks to clearRetainingCapacity() -- a pattern we've used in several previous projects. Memory-friendly and cache-friendly.

The symbol table

The symbol table maps label names to instruction addresses. It's populated during pass 1 and consumed during pass 2. We use a StringHashMap from Zig's standard library, which we covered in detail in episode 22:

const SymbolTable = struct {
    map: std.StringHashMap(u16),
    allocator: Allocator,

    fn init(allocator: Allocator) SymbolTable {
        return .{
            .map = std.StringHashMap(u16).init(allocator),
            .allocator = allocator,
        };
    }

    fn deinit(self: *SymbolTable) void {
        self.map.deinit();
    }

    fn define(self: *SymbolTable, name: []const u8, address: u16) !void {
        const result = try self.map.getOrPut(name);
        if (result.found_existing) {
            return error.DuplicateLabel;
        }
        result.value_ptr.* = address;
    }

    fn resolve(self: *const SymbolTable, name: []const u8) ?u16 {
        return self.map.get(name);
    }

    fn dump(self: *const SymbolTable) void {
        std.debug.print("\n--- Symbol Table ---\n", .{});
        var it = self.map.iterator();
        while (it.next()) |entry| {
            std.debug.print("  {s} = 0x{X:0>4}\n", .{ entry.key_ptr.*, entry.value_ptr.* });
        }
    }
};

The define function uses getOrPut which is one of the more elegant hash map operations. It performs a single lookup: if the key exists, it tells us (and we return a DuplicateLabel error because defining a label twice is always a bug). If it doesn't exist, it inserts an empty slot and we fill in the address. One hash computation instead of two (one for "check if exists" and another for "insert") -- a small optimization that matters when you're processing thousands of lines.

The symbol table keys are slices into the original source text. We don't copy the label names. This works because the source text outlives the assembler (we load the entire file into memory upfront and keep it around). No allocations, no ownership headaches.

Error handling with line context

Before we build the assembler itself, let's set up proper error reporting. A useful error message needs the line number, the problem, and ideally the offending text:

const AsmError = struct {
    line: usize,
    message: []const u8,
    detail: ?[]const u8,

    fn format(self: AsmError, writer: anytype) !void {
        try writer.print("error at line {d}: {s}", .{ self.line, self.message });
        if (self.detail) |d| {
            try writer.print(" '{s}'", .{d});
        }
        try writer.print("\n", .{});
    }
};

const AssemblerResult = union(enum) {
    success: struct {
        binary: []u8,
        instruction_count: usize,
    },
    failure: struct {
        errors: []const AsmError,
    },
};

We use a tagged union (AssemblerResult) for the result instead of an error union. Why? Because we want to collect multiple errors before stopping. If your assembly file has three typos, a good assembler reports all three at once rather than making you fix one, re-run, fix the next, re-run. Real-world assemblers like NASM and GAS both work this way. Zig's error unions are great for single-error-then-bail, but for multi-error collection you need a different pattern.

The AsmError struct is intentionally simple. Line number, message, optional detail string. No fancy formatting, no ANSI colors. Those are presentation concerns that don't belong in the core assembler logic. If we want pretty output later, we format it at the call site.

Pass 1: collecting labels

Pass 1 scans every line but only cares about two things: labels and instructions. For each label it records the current address in the symbol table. For each instruction it advances the address counter by 2 (each instruction is one 16-bit word = 2 bytes). Pass 1 ignores operands entirely -- it doesn't care whether MOV R0, 42 or MOV R0, R1 is valid. That's pass 2's job.

const Assembler = struct {
    allocator: Allocator,
    symbols: SymbolTable,
    tokenizer: Tokenizer,
    errors: std.ArrayList(AsmError),
    source_lines: std.ArrayList([]const u8),
    output: std.ArrayList(u8),

    fn init(allocator: Allocator) Assembler {
        return .{
            .allocator = allocator,
            .symbols = SymbolTable.init(allocator),
            .tokenizer = Tokenizer.init(allocator),
            .errors = std.ArrayList(AsmError).init(allocator),
            .source_lines = std.ArrayList([]const u8).init(allocator),
            .output = std.ArrayList(u8).init(allocator),
        };
    }

    fn deinit(self: *Assembler) void {
        self.symbols.deinit();
        self.tokenizer.deinit();
        self.errors.deinit();
        self.source_lines.deinit();
        self.output.deinit();
    }

    fn addError(self: *Assembler, line: usize, message: []const u8, detail: ?[]const u8) void {
        self.errors.append(.{
            .line = line,
            .message = message,
            .detail = detail,
        }) catch {};
    }

    fn pass1(self: *Assembler, source: []const u8) !void {
        var address: u16 = 0;

        // split source into lines
        var line_iter = std.mem.splitScalar(u8, source, '\n');
        var line_num: usize = 1;

        while (line_iter.next()) |line| {
            try self.source_lines.append(line);
            try self.tokenizer.tokenizeLine(line, line_num);

            const tokens = self.tokenizer.tokens.items;

            if (tokens.len == 0) {
                line_num += 1;
                continue;
            }

            var ti: usize = 0;

            // handle label definition
            if (tokens[ti].kind == .label_def) {
                self.symbols.define(tokens[ti].text, address) catch |err| {
                    if (err == error.DuplicateLabel) {
                        self.addError(line_num, "duplicate label", tokens[ti].text);
                    }
                };
                ti += 1;
            }

            // if there's a mnemonic after the label (or as the first token), count it
            if (ti < tokens.len and tokens[ti].kind == .mnemonic) {
                address += 2; // each instruction is 2 bytes
            }

            line_num += 1;
        }
    }

The address counter starts at 0 and increments by 2 for each instruction. Labels don't consume space -- loop: on its own line gets the address of whatever instruction comes next. A label on the same line as an instruction (loop: ADD R0, R1) gets the address of that instruction.

Notice how pass 1 completely ignores operands. It doesn't try to parse MOV R0, 42 -- it just sees "there's a mnemonic, so there's an instruction here, address += 2". This separation of concerns is the whole point of two passes. Pass 1 answers "where is everything?", pass 2 answers "what does everything mean?".

The addError function swallows allocation errors on the error list itself (the catch {} on append). This is a pragmatic choice: if we can't even allocate space for an error message, we're in deep trouble and should probably just bail. But we don't want the allocator failure to mask the actual assembly error. In production assemblers this would be more sophisticated, but for a teaching project it's fine.

Pass 2: encoding instructions

Pass 2 re-scans every line, parses instruction operands, resolves label references through the symbol table, and encodes each instruction using the Instruction functions from episode 59. This is where the real work happens:

    fn pass2(self: *Assembler) !void {
        for (self.source_lines.items, 0..) |line, idx| {
            const line_num = idx + 1;
            try self.tokenizer.tokenizeLine(line, line_num);

            const tokens = self.tokenizer.tokens.items;
            if (tokens.len == 0) continue;

            var ti: usize = 0;

            // skip label definition
            if (tokens[ti].kind == .label_def) {
                ti += 1;
            }

            // no mnemonic on this line -- label-only or comment-only
            if (ti >= tokens.len or tokens[ti].kind != .mnemonic) continue;

            const mnemonic = tokens[ti];
            ti += 1;

            const op = mnemonicToOpcode(mnemonic.text) orelse {
                self.addError(line_num, "unknown mnemonic", mnemonic.text);
                continue;
            };

            const word = switch (op) {
                .hlt, .nop => Instruction.encodeNoArgs(op).toU16(),
                .ret => Instruction.encodeNoArgs(op).toU16(),

                .jmp, .jeq, .jne, .call => blk: {
                    const target = self.parseJumpTarget(tokens[ti..], line_num) orelse {
                        break :blk @as(u16, 0);
                    };
                    break :blk Instruction.encodeImm(op, 0, target).toU16();
                },

                .push => blk: {
                    const reg = self.parseRegister(tokens[ti..], line_num) orelse {
                        break :blk @as(u16, 0);
                    };
                    break :blk Instruction.encodeReg(op, reg, 0).toU16();
                },

                .pop => blk: {
                    const reg = self.parseRegister(tokens[ti..], line_num) orelse {
                        break :blk @as(u16, 0);
                    };
                    break :blk Instruction.encodeReg(op, reg, 0).toU16();
                },

                .mov, .add, .sub, .mul, .cmp, .load, .store => blk: {
                    const result = self.parseDstSrc(tokens[ti..], line_num) orelse {
                        break :blk @as(u16, 0);
                    };
                    if (result.is_immediate) {
                        break :blk Instruction.encodeImm(op, result.dst, result.imm).toU16();
                    } else {
                        break :blk Instruction.encodeReg(op, result.dst, result.src).toU16();
                    }
                },
            };

            // emit little-endian
            try self.output.append(@truncate(word));
            try self.output.append(@truncate(word >> 8));
        }
    }

The switch on the opcode is the central dispatch. Each instruction category has different operand requirements: HLT/NOP/RET take no operands, JMP/JEQ/JNE/CALL take a jump target (label or immediate), PUSH/POP take a single register, and the arithmetic/memory instructions take destination + source (register or immediate).

The blk: labeled blocks with break :blk are something we haven't used much in the series. They let us compute a value inside a switch prong using multiple statements. The switch expects each prong to produce a u16 value, and the block's break :blk value provides that. If parsing fails, we emit 0 as a placeholder (the error is already recorded by the parse function).

Let me show the operand parsers:

    fn parseRegister(self: *Assembler, tokens: []const Token, line_num: usize) ?u3 {
        if (tokens.len == 0 or tokens[0].kind != .register) {
            self.addError(line_num, "expected register", null);
            return null;
        }
        return @truncate(tokens[0].text[1] - '0');
    }

    const DstSrcResult = struct {
        dst: u3,
        is_immediate: bool,
        src: u3,
        imm: u8,
    };

    fn parseDstSrc(self: *Assembler, tokens: []const Token, line_num: usize) ?DstSrcResult {
        // expect: register, comma, (register | immediate | identifier)
        if (tokens.len < 3) {
            self.addError(line_num, "expected 'REG, operand'", null);
            return null;
        }

        if (tokens[0].kind != .register) {
            self.addError(line_num, "expected destination register", tokens[0].text);
            return null;
        }
        const dst: u3 = @truncate(tokens[0].text[1] - '0');

        if (tokens[1].kind != .comma) {
            self.addError(line_num, "expected comma after destination register", null);
            return null;
        }

        // source operand
        if (tokens[2].kind == .register) {
            const src: u3 = @truncate(tokens[2].text[1] - '0');
            return .{ .dst = dst, .is_immediate = false, .src = src, .imm = 0 };
        }

        if (tokens[2].kind == .immediate) {
            const val = std.fmt.parseInt(u8, tokens[2].text, 10) catch {
                self.addError(line_num, "immediate value out of range (0-255)", tokens[2].text);
                return null;
            };
            return .{ .dst = dst, .is_immediate = true, .src = 0, .imm = val };
        }

        if (tokens[2].kind == .identifier) {
            // label reference -- resolve through symbol table
            const addr = self.symbols.resolve(tokens[2].text) orelse {
                self.addError(line_num, "undefined label", tokens[2].text);
                return null;
            };
            if (addr > 255) {
                self.addError(line_num, "label address exceeds 8-bit immediate range", tokens[2].text);
                return null;
            }
            return .{ .dst = dst, .is_immediate = true, .src = 0, .imm = @truncate(addr) };
        }

        self.addError(line_num, "unexpected operand", tokens[2].text);
        return null;
    }

    fn parseJumpTarget(self: *Assembler, tokens: []const Token, line_num: usize) ?u8 {
        if (tokens.len == 0) {
            self.addError(line_num, "expected jump target", null);
            return null;
        }

        if (tokens[0].kind == .immediate) {
            return std.fmt.parseInt(u8, tokens[0].text, 10) catch {
                self.addError(line_num, "jump target out of range (0-255)", tokens[0].text);
                return null;
            };
        }

        if (tokens[0].kind == .identifier) {
            const addr = self.symbols.resolve(tokens[0].text) orelse {
                self.addError(line_num, "undefined label", tokens[0].text);
                return null;
            };
            if (addr > 255) {
                self.addError(line_num, "label address exceeds 8-bit range", tokens[0].text);
                return null;
            }
            return @truncate(addr);
        }

        self.addError(line_num, "expected label or address", tokens[0].text);
        return null;
    }

The parseDstSrc function handles the most complex case: instructions with two operands separated by a comma. The second operand can be a register (ADD R0, R1), an immediate (ADD R0, 5), or a label reference (MOV R0, data_addr). Label references get resolved through the symbol table, which was populated in pass 1. If the label doesn't exist, we report "undefined label" with the offending name.

There's an important constraint here: because our instruction format uses 8 bits for immediate values, label addresses must fit in 0-255. That limits our programs to 128 instructions (256 bytes / 2 bytes per instruction). A real assembler would use wider instruction formats or relative addressing to handle larger programs. For our teaching project, 128 instructions is more than enough for any example.

The mnemonic-to-opcode lookup

This is straightforward -- map text to the opcode enum, case-insensitive:

    fn mnemonicToOpcode(text: []const u8) ?Opcode {
        const pairs = [_]struct { name: []const u8, op: Opcode }{
            .{ .name = "HLT", .op = .hlt },
            .{ .name = "MOV", .op = .mov },
            .{ .name = "ADD", .op = .add },
            .{ .name = "SUB", .op = .sub },
            .{ .name = "MUL", .op = .mul },
            .{ .name = "CMP", .op = .cmp },
            .{ .name = "JMP", .op = .jmp },
            .{ .name = "JEQ", .op = .jeq },
            .{ .name = "JNE", .op = .jne },
            .{ .name = "LOAD", .op = .load },
            .{ .name = "STORE", .op = .store },
            .{ .name = "PUSH", .op = .push },
            .{ .name = "POP", .op = .pop },
            .{ .name = "CALL", .op = .call },
            .{ .name = "RET", .op = .ret },
            .{ .name = "NOP", .op = .nop },
        };

        for (pairs) |pair| {
            if (std.ascii.eqlIgnoreCase(pair.name, text)) return pair.op;
        }
        return null;
    }

We use std.ascii.eqlIgnoreCase for the comparison so MOV, mov, and Mov all match. The linear scan through 16 entries is fast enough -- for a table this small, a hash map would actually be slower due to hashing overhead. You'd need hundreds of mnemonics before hashing becomes worthwhile.

Putting it together: the assemble function

The public API ties pass 1 and pass 2 together:

    fn assemble(self: *Assembler, source: []const u8) !AssemblerResult {
        // pass 1: collect labels
        try self.pass1(source);

        if (self.errors.items.len > 0) {
            return .{ .failure = .{ .errors = self.errors.items } };
        }

        // pass 2: encode instructions
        try self.pass2();

        if (self.errors.items.len > 0) {
            return .{ .failure = .{ .errors = self.errors.items } };
        }

        return .{
            .success = .{
                .binary = self.output.items,
                .instruction_count = self.output.items.len / 2,
            },
        };
    }
};

We check for errors after each pass. Pass 1 errors (like duplicate labels) mean pass 2 would operate on a broken symbol table, so we bail early. Pass 2 errors (undefined labels, bad operands) are collected and reported together.

The result includes both the raw binary and the instruction count. The binary is the output ArrayList's items slice -- valid as long as the Assembler hasn't been deinitialized. The caller should copy it if they need it to outlive the assembler.

Why two passes: forward references in action

Let me show concretely why one pass isn't enough. Consider this program:

    JNE skip       ; line 1: jump forward to 'skip'
    MOV R0, 99     ; line 2: this should be skipped
skip:
    MOV R0, 42     ; line 3: this is the target
    HLT            ; line 4

When a single-pass assembler hits line 1, it sees JNE skip. But skip: is defined on line 3 -- we haven't seen it yet. The assembler doesn't know what address to encode for the jump target. You could emit a placeholder and go back to fix it later (that's called "backpatching" and some assemblers do use it), but the two-pass approach is cleaner: pass 1 discovers that skip is at address 0x04 (two instructions before it, 2 bytes each = 4), and pass 2 uses that information to encode JNE 4.

With our assembler both forward and backward references work identically. The symbol table doesn't care about order -- by the time pass 2 runs, every label in the program has been resolved.

Having said that, backpatching has advantages in some contexts. If the program is very large, reading it twice is slower than reading it once. Linkers (which process object files from separate compilation units) almost always use backpatching with relocation tables. But for an assembler procesing a single source file, two passes are simple, correct, and fast enough.

Reading a file and writing binary output

Let's wire up file I/O so we can assemble real .asm files into .bin files:

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer {
        const check = gpa.deinit();
        if (check == .leak) @panic("memory leak detected");
    }
    const allocator = gpa.allocator();

    // get command-line arguments
    const args = try std.process.argsAlloc(allocator);
    defer std.process.argsFree(allocator, args);

    if (args.len < 2) {
        std.debug.print("Usage: assembler  [output.bin]\n", .{});
        std.process.exit(1);
    }

    const input_path = args[1];
    const output_path = if (args.len >= 3) args[2] else "out.bin";

    // read input file
    const source = std.fs.cwd().readFileAlloc(
        allocator,
        input_path,
        1024 * 1024, // 1MB max source file
    ) catch |err| {
        std.debug.print("error: could not read '{s}': {}\n", .{ input_path, err });
        std.process.exit(1);
    };
    defer allocator.free(source);

    // assemble
    var asm_state = Assembler.init(allocator);
    defer asm_state.deinit();

    const result = try asm_state.assemble(source);

    switch (result) {
        .failure => |f| {
            std.debug.print("Assembly failed with {d} error(s):\n", .{f.errors.len});
            for (f.errors) |err| {
                err.format(std.io.getStdErr().writer()) catch {};
            }
            std.process.exit(1);
        },
        .success => |s| {
            // write binary output
            const out_file = try std.fs.cwd().createFile(output_path, .{});
            defer out_file.close();
            try out_file.writeAll(s.binary);

            std.debug.print("Assembled {d} instructions ({d} bytes) -> {s}\n", .{
                s.instruction_count,
                s.binary.len,
                output_path,
            });

            // also dump symbol table for debugging
            asm_state.symbols.dump();
        },
    }
}

The readFileAlloc function reads the whole file into a heap-allocated buffer. We set a 1MB limit -- more than enough for any assembly source file we'd ever write. The file I/O pattern here is what we established in episode 10, but now we're using it in a real tool.

The output is a flat binary blob. No headers, no sections, no metadata -- just raw instruction bytes. You load this into the VM's memory starting at address 0 and run it. Real object file formats (ELF, Mach-O, PE) add headers describing sections, symbols, and relocations, but those are complexities for a different project. Our flat binary is the simplest useful output format.

Testing the assembler: end-to-end

The real test is: assemble source text, load the binary into the VM, run it, check register state. This validates the entire pipeline -- tokenizer, pass 1, pass 2, encoder, and the VM from episode 59:

test "assemble and run: sum 1 to 10" {
    const allocator = testing.allocator;

    const source =
        \\; Sum numbers 1 through 10
        \\    MOV R0, 0
        \\    MOV R1, 1
        \\    MOV R2, 10
        \\loop:
        \\    ADD R0, R1
        \\    ADD R1, 1
        \\    CMP R1, R2
        \\    JNE loop
        \\    ADD R0, R1
        \\    HLT
    ;

    var asm_state = Assembler.init(allocator);
    defer asm_state.deinit();

    const result = try asm_state.assemble(source);

    switch (result) {
        .failure => |f| {
            for (f.errors) |err| {
                std.debug.print("ASM ERROR: line {d}: {s}\n", .{ err.line, err.message });
            }
            return error.AssemblyFailed;
        },
        .success => |s| {
            try testing.expectEqual(@as(usize, 10), s.instruction_count);

            // load into VM and run
            var vm = try VM.init(allocator);
            defer vm.deinit();

            @memcpy(vm.memory[0..s.binary.len], s.binary);
            vm.run();

            // 1+2+3+4+5+6+7+8+9+10 = 55
            try testing.expectEqual(@as(u16, 55), vm.regs.get(0));
        },
    }
}

test "assemble and run: forward reference" {
    const allocator = testing.allocator;

    const source =
        \\    JMP skip
        \\    MOV R0, 99
        \\skip:
        \\    MOV R0, 42
        \\    HLT
    ;

    var asm_state = Assembler.init(allocator);
    defer asm_state.deinit();

    const result = try asm_state.assemble(source);

    switch (result) {
        .failure => return error.AssemblyFailed,
        .success => |s| {
            var vm = try VM.init(allocator);
            defer vm.deinit();

            @memcpy(vm.memory[0..s.binary.len], s.binary);
            vm.run();

            // MOV R0, 99 should have been skipped
            try testing.expectEqual(@as(u16, 42), vm.regs.get(0));
        },
    }
}

test "assemble and run: subroutine with CALL/RET" {
    const allocator = testing.allocator;

    const source =
        \\    MOV R0, 7
        \\    MOV R1, 3
        \\    CALL multiply
        \\    HLT
        \\multiply:
        \\    MUL R0, R1
        \\    RET
    ;

    var asm_state = Assembler.init(allocator);
    defer asm_state.deinit();

    const result = try asm_state.assemble(source);

    switch (result) {
        .failure => return error.AssemblyFailed,
        .success => |s| {
            var vm = try VM.init(allocator);
            defer vm.deinit();

            @memcpy(vm.memory[0..s.binary.len], s.binary);
            vm.run();

            try testing.expectEqual(@as(u16, 21), vm.regs.get(0));
        },
    }
}

The first test assembles the same loop program from episode 59, but now written in proper assembly text instead of helper function calls. The result should be identical: R0 = 55. If the assembler produces even one wrong bit in any instruction, the VM will compute a different result and the test fails. This is the beauty of end-to-end testing -- it catches bugs anywhere in the chain.

The forward reference test is the one that would break a single-pass assembler. JMP skip on line 1 references skip: which is defined on line 3. If the symbol table wasn't populated in pass 1, pass 2 would report "undefined label 'skip'" and fail. The test proves our two-pass approach works.

The subroutine test uses CALL multiply -- a forward reference to a label. The assembler resolves multiply to the address of the MUL instruction, encodes it as the CALL target, and the VM pushes the return address, jumps to the multiply routine, executes MUL, and RETs back. R0 = 7 * 3 = 21.

Testing error cases

Good assemblers produce helpful error messages. Let's verify ours does:

test "error: undefined label" {
    const allocator = testing.allocator;

    const source =
        \\    JMP nowhere
        \\    HLT
    ;

    var asm_state = Assembler.init(allocator);
    defer asm_state.deinit();

    const result = try asm_state.assemble(source);

    switch (result) {
        .failure => |f| {
            try testing.expectEqual(@as(usize, 1), f.errors.len);
            try testing.expectEqualStrings("undefined label", f.errors[0].message);
        },
        .success => return error.ExpectedFailure,
    }
}

test "error: duplicate label" {
    const allocator = testing.allocator;

    const source =
        \\foo:
        \\    MOV R0, 1
        \\foo:
        \\    MOV R1, 2
        \\    HLT
    ;

    var asm_state = Assembler.init(allocator);
    defer asm_state.deinit();

    const result = try asm_state.assemble(source);

    switch (result) {
        .failure => |f| {
            try testing.expectEqual(@as(usize, 1), f.errors.len);
            try testing.expectEqualStrings("duplicate label", f.errors[0].message);
        },
        .success => return error.ExpectedFailure,
    }
}

test "error: immediate out of range" {
    const allocator = testing.allocator;

    const source =
        \\    MOV R0, 300
        \\    HLT
    ;

    var asm_state = Assembler.init(allocator);
    defer asm_state.deinit();

    const result = try asm_state.assemble(source);

    switch (result) {
        .failure => |f| {
            try testing.expect(f.errors.len > 0);
        },
        .success => return error.ExpectedFailure,
    }
}

test "label on same line as instruction" {
    const allocator = testing.allocator;

    const source =
        \\start: MOV R0, 5
        \\       ADD R0, 10
        \\       HLT
    ;

    var asm_state = Assembler.init(allocator);
    defer asm_state.deinit();

    const result = try asm_state.assemble(source);

    switch (result) {
        .failure => return error.AssemblyFailed,
        .success => |s| {
            var vm = try VM.init(allocator);
            defer vm.deinit();

            @memcpy(vm.memory[0..s.binary.len], s.binary);
            vm.run();

            try testing.expectEqual(@as(u16, 15), vm.regs.get(0));
        },
    }
}

Error tests are just as important as success tests. The undefined label test ensures we catch references to labels that were never defined. The duplicate label test catches the case where someone accidentally reuses a label name. The out-of-range test verifies we don't silently truncate values that don't fit in 8 bits. And the "label on same line" test verifies that start: MOV R0, 5 works correctly -- the label gets the address of the MOV instruction, not the next one.

Comparing hand-assembled vs text-assembled output

Here's a satisfying verification: the assembler should produce the exact same bytes as the hand-built programs from episode 59. Let's test that:

test "assembler output matches hand-built program" {
    const allocator = testing.allocator;

    // This is the hand-built program from episode 59
    const hand_built = [_]u16{
        asm_builder.mov_imm(0, 5),
        asm_builder.mov_imm(1, 10),
        asm_builder.mov_imm(2, 20),
        asm_builder.add_reg(0, 1),
        asm_builder.add_reg(0, 2),
        asm_builder.halt(),
    };

    // Same program in assembly text
    const source =
        \\    MOV R0, 5
        \\    MOV R1, 10
        \\    MOV R2, 20
        \\    ADD R0, R1
        \\    ADD R0, R2
        \\    HLT
    ;

    var asm_state = Assembler.init(allocator);
    defer asm_state.deinit();

    const result = try asm_state.assemble(source);

    switch (result) {
        .failure => return error.AssemblyFailed,
        .success => |s| {
            try testing.expectEqual(hand_built.len * 2, s.binary.len);

            // compare word by word
            for (hand_built, 0..) |expected_word, i| {
                const offset = i * 2;
                const lo: u16 = s.binary[offset];
                const hi: u16 = s.binary[offset + 1];
                const actual_word = lo | (hi << 8);
                try testing.expectEqual(expected_word, actual_word);
            }
        },
    }
}

If this test passes, we know the assembler produces byte-identical output to the programmatic encoder. Same instruction encoding, same byte order, same everything. The assembler is just a friendlier interface to the same binary format.

A more complex program: the Fibonacci sequnce

Let's assemble something non-trivial that exercises multiple features -- labels, forward references, conditioal branches, and subroutines:

test "assemble and run: fibonacci" {
    const allocator = testing.allocator;

    const source =
        \\; Compute 8th Fibonacci number
        \\; R0 = current, R1 = previous, R2 = counter, R3 = limit, R4 = temp
        \\    MOV R0, 1       ; fib(1) = 1
        \\    MOV R1, 0       ; fib(0) = 0
        \\    MOV R2, 1       ; counter starts at 1
        \\    MOV R3, 8       ; compute fib(8)
        \\fib_loop:
        \\    CMP R2, R3
        \\    JEQ done
        \\    MOV R4, R0      ; temp = current
        \\    ADD R0, R1      ; current = current + previous
        \\    MOV R1, R4      ; previous = temp
        \\    ADD R2, 1       ; counter++
        \\    JMP fib_loop
        \\done:
        \\    HLT
    ;

    var asm_state = Assembler.init(allocator);
    defer asm_state.deinit();

    const result = try asm_state.assemble(source);

    switch (result) {
        .failure => |f| {
            for (f.errors) |err| {
                std.debug.print("ERROR line {d}: {s}\n", .{ err.line, err.message });
            }
            return error.AssemblyFailed;
        },
        .success => |s| {
            var vm = try VM.init(allocator);
            defer vm.deinit();

            @memcpy(vm.memory[0..s.binary.len], s.binary);
            vm.run();

            // fib(8) = 21
            try testing.expectEqual(@as(u16, 21), vm.regs.get(0));
        },
    }
}

This computes fib(8) = 21 using the iterative method. The program has both a forward reference (JEQ done jumps past the loop body to a label defined later) and a backward reference (JMP fib_loop jumps back to the top of the loop). Both get resolved correctly through the symbol table.

Five registers are in use simultaneously (R0-R4), the loop uses CMP/JEQ for the exit condition and JMP for the back-edge, and MOV shuffles values between registers for the swap. This is about as complex as you can get in 13 instructions, and it exercises most of our instruction set. Kind of wild that we can compute Fibonacci numbers using nothing but the types and functions we wrote ourselves, from bit encoding all the way up ;-)

Wat we geleerd hebben

A two-pass assembler solves the forward reference problem: pass 1 collects all label addresses into a symbol table, pass 2 uses that table to encode jump targets that reference labels defined later in the source
Assembly text syntax needs a tokenizer that handles labels (ending with :), mnemonics (case-insensitive), registers (R0-R7), immediates (decimal numbers), and identifiers (label references in operand position)
The symbol table maps label names to byte addresses using a StringHashMap -- keys are slices into the source text so no string copying is needed
Error collection (rather than fail-on-first-error) gives the programmer feedback on multiple problems at once, matching how real assemblers like NASM behave
Operand parsing dispatches by opcode category: no-arg (HLT, NOP, RET), single-target (JMP, CALL), single-register (PUSH, POP), and dst-source pairs (MOV, ADD, etc.)
The assembler output is a flat binary blob: raw encoded instructions in memory order, no headers, loadable directly into the VM at address 0
End-to-end tests (assemble text, load binary into VM, run, check registers) validate the entire pipeline and catch bugs in any layer
Comparing assembler output against hand-built programs from episode 59 proves byte-level correctness of the encoding

This is part 2 of Project G. We have a working two-pass assembler that turns human-readable text into binary programs our VM can execute. Next time we'll build the reverse: a disassembler that reads binary and produces readable assembly text, plus a binary inspector that shows you exactly what's encoded at each address. That closes the loop -- text to binary to text again.

Thanks for reading!

Hive account@scipio

Learn Zig Series (#60) - Assembler: Two-Pass Assembly

Learn Zig Series (#60) - Assembler: Two-Pass Assembly

What will I learn

Requirements

Difficulty

Curriculum (of the Learn Zig Series):

Learn Zig Series (#60) - Assembler: Two-Pass Assembly

Assembly language syntax

The line tokenizer

The symbol table

Error handling with line context

Pass 1: collecting labels

Pass 2: encoding instructions

The mnemonic-to-opcode lookup

Putting it together: the assemble function

Why two passes: forward references in action

Reading a file and writing binary output

Testing the assembler: end-to-end

Testing error cases

Comparing hand-assembled vs text-assembled output

A more complex program: the Fibonacci sequnce

Wat we geleerd hebben

Curriculum (of the `Learn Zig Series`):