Learn Zig Series (#60) - Assembler: Two-Pass Assembly
Project G: Assembler/Disassembler (2/3)
What will I learn
- How assembly language text syntax works:
MOV R0, 42,label: ADD R1, R2, comments, and blank lines; - Why a two-pass approach is necessary: forward references to labels that haven't been defined yet;
- Building pass 1: scanning source lines, recording label addresses in a symbol table using a hash map;
- Building pass 2: parsing instructions, resolving label references to concrete addresses, encoding to binary;
- Error reporting with line numbers: undefined labels, invalid registers, malformed syntax;
- Producing flat binary output: a byte array of encoded instructions ready for the VM;
- Reading assembly source from a file and writing a binary output file;
- Testing the full pipeline: assemble a program, load it into the VM from episode 59, verify register state after execution.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Zig 0.14+ distribution (download from ziglang.org);
- The ambition to learn Zig programming.
Difficulty
- Advanced
Curriculum (of the Learn Zig Series):
- Zig Programming Tutorial - ep001 - Intro
- Learn Zig Series (#2) - Hello Zig, Variables and Types
- Learn Zig Series (#3) - Functions and Control Flow
- Learn Zig Series (#4) - Error Handling (Zig's Best Feature)
- Learn Zig Series (#5) - Arrays, Slices, and Strings
- Learn Zig Series (#6) - Structs, Enums, and Tagged Unions
- Learn Zig Series (#7) - Memory Management and Allocators
- Learn Zig Series (#8) - Pointers and Memory Layout
- Learn Zig Series (#9) - Comptime (Zig's Superpower)
- Learn Zig Series (#10) - Project Structure, Modules, and File I/O
- Learn Zig Series (#11) - Mini Project: Building a Step Sequencer
- Learn Zig Series (#12) - Testing and Test-Driven Development
- Learn Zig Series (#13) - Interfaces via Type Erasure
- Learn Zig Series (#14) - Generics with Comptime Parameters
- Learn Zig Series (#15) - The Build System (build.zig)
- Learn Zig Series (#16) - Sentinel-Terminated Types and C Strings
- Learn Zig Series (#17) - Packed Structs and Bit Manipulation
- Learn Zig Series (#18) - Async Concepts and Event Loops
- Learn Zig Series (#18b) - Addendum: Async Returns in Zig 0.16
- Learn Zig Series (#19) - SIMD with @Vector
- Learn Zig Series (#20) - Working with JSON
- Learn Zig Series (#21) - Networking and TCP Sockets
- Learn Zig Series (#22) - Hash Maps and Data Structures
- Learn Zig Series (#23) - Iterators and Lazy Evaluation
- Learn Zig Series (#24) - Logging, Formatting, and Debug Output
- Learn Zig Series (#25) - Mini Project: HTTP Status Checker
- Learn Zig Series (#26) - Writing a Custom Allocator
- Learn Zig Series (#27) - C Interop: Calling C from Zig
- Learn Zig Series (#28) - C Interop: Exposing Zig to C
- Learn Zig Series (#29) - Inline Assembly and Low-Level Control
- Learn Zig Series (#30) - Thread Safety and Atomics
- Learn Zig Series (#31) - Memory-Mapped I/O and Files
- Learn Zig Series (#32) - Compile-Time Reflection with @typeInfo
- Learn Zig Series (#33) - Building a State Machine with Tagged Unions
- Learn Zig Series (#34) - Performance Profiling and Optimization
- Learn Zig Series (#35) - Cross-Compilation and Target Triples
- Learn Zig Series (#36) - Mini Project: CLI Task Runner
- Learn Zig Series (#37) - Markdown to HTML: Tokenizer and Lexer
- Learn Zig Series (#38) - Markdown to HTML: Parser and AST
- Learn Zig Series (#39) - Markdown to HTML: Renderer and CLI
- Learn Zig Series (#40) - Key-Value Store: In-Memory Store
- Learn Zig Series (#41) - Key-Value Store: Write-Ahead Log
- Learn Zig Series (#42) - Key-Value Store: TCP Server
- Learn Zig Series (#43) - Key-Value Store: Client Library and Benchmarks
- Learn Zig Series (#44) - Image Tool: Reading and Writing PPM/BMP
- Learn Zig Series (#45) - Image Tool: Pixel Operations
- Learn Zig Series (#46) - Image Tool: CLI Pipeline
- Learn Zig Series (#47) - Build a Shell: Parsing Commands
- Learn Zig Series (#48) - Build a Shell: Process Spawning
- Learn Zig Series (#49) - Build a Shell: Built-in Commands
- Learn Zig Series (#50) - Build a Shell: Job Control and Signals
- Learn Zig Series (#51) - HTTP Server: Accept Loop and Parsing
- Learn Zig Series (#52) - HTTP Server: Router and Responses
- Learn Zig Series (#53) - HTTP Server: Static Files and MIME
- Learn Zig Series (#54) - HTTP Server: Middleware and Logging
- Learn Zig Series (#55) - ECS Game Engine: Architecture
- Learn Zig Series (#56) - ECS Game Engine: Component Storage
- Learn Zig Series (#57) - ECS Game Engine: Systems and Queries
- Learn Zig Series (#58) - ECS Game Engine: Terminal Rendering
- Learn Zig Series (#59) - Assembler: Instruction Encoding
- Learn Zig Series (#60) - Assembler: Two-Pass Assembly (this post)
Learn Zig Series (#60) - Assembler: Two-Pass Assembly
In episode 59 we built the foundations of our assembler project: an instruction set with 16 opcodes, a packed struct encoder, and a virtual machine that fetches-decodes-executes 16-bit instructions. We wrote programs by calling helper functions like asm_builder.mov_imm(0, 42) and manually computing branch target addresses. That worked, but it's the programming equivalent of writing machine code on paper and toggling it into a front panel one switch at a time. Painful, error-prone, and nobody does it by choice.
Today we fix that. We're building a two-pass assembler that reads human-readable text -- MOV R0, 42, loop: ADD R0, R1, JNE loop -- and produces the same binary output our VM already knows how to execute. The key challenge? Forward references. When the assembler sees JNE loop on line 3 but loop: isn't defined until line 7, it can't encode the jump address yet because it doesn't know it. The classic solution is two passes: the first pass collects all label addresses, the second pass encodes all instructions using those addresses. This technique dates back to the 1950s and every assembler since has used some variation of it.
We'll reuse the Opcode, Instruction, and encoding functions from episode 59 without modification. Everything today builds on top of that layer. Here we go!
Assembly language syntax
Our assembler needs to understand a simple text format. Each line is one of: an instruction, a label, a label followed by an instruction, a comment, or blank. Here's what a complete program looks like:
; Sum numbers 1 through 10
MOV R0, 0 ; accumulator
MOV R1, 1 ; counter
MOV R2, 10 ; limit
loop:
ADD R0, R1 ; accumulate
ADD R1, 1 ; increment counter
CMP R1, R2 ; reached limit?
JNE loop ; if not, keep going
ADD R0, R1 ; add the final value
HLT
The rules: labels end with a colon and must start with a letter or underscore. Instruction mnemonics are case-insensitive (MOV, mov, Mov all work). Register operands are R0 through R7. Immediate values are plain decimal numbers. Comments start with ; and extend to end of line. Whitespace is flexible -- leading spaces/tabs are ignored.
This is basically the same syntax as the assembly listings in episode 59's comments, just formalized into something a parser can handle. Let's define the data structures:
const std = @import("std");
const Allocator = std.mem.Allocator;
const TokenKind = enum {
label_def, // "loop:"
mnemonic, // "MOV", "ADD", etc.
register, // "R0" through "R7"
immediate, // numeric literal like 42
identifier, // label reference like "loop" (in operand position)
comma, // ","
eof,
};
const Token = struct {
kind: TokenKind,
text: []const u8,
line: usize,
};
Each token knows what line it came from. That line number is critical for error reporting -- when something goes wrong (and it will, trust me), you need to tell the user exactly where.
The line tokenizer
We tokenize one line at a time rather than the whole file. This keeps the tokenizer simple and gives us natural line-number tracking:
const Tokenizer = struct {
tokens: std.ArrayList(Token),
allocator: Allocator,
fn init(allocator: Allocator) Tokenizer {
return .{
.tokens = std.ArrayList(Token).init(allocator),
.allocator = allocator,
};
}
fn deinit(self: *Tokenizer) void {
self.tokens.deinit();
}
fn reset(self: *Tokenizer) void {
self.tokens.clearRetainingCapacity();
}
fn tokenizeLine(self: *Tokenizer, line: []const u8, line_num: usize) !void {
self.reset();
var i: usize = 0;
while (i < line.len) {
// skip whitespace
if (line[i] == ' ' or line[i] == '\t') {
i += 1;
continue;
}
// comment -- rest of line is ignored
if (line[i] == ';') break;
// comma
if (line[i] == ',') {
try self.tokens.append(.{
.kind = .comma,
.text = line[i .. i + 1],
.line = line_num,
});
i += 1;
continue;
}
// number (immediate value)
if (std.ascii.isDigit(line[i])) {
const start = i;
while (i < line.len and std.ascii.isDigit(line[i])) : (i += 1) {}
try self.tokens.append(.{
.kind = .immediate,
.text = line[start..i],
.line = line_num,
});
continue;
}
// identifier or keyword
if (std.ascii.isAlphabetic(line[i]) or line[i] == '_') {
const start = i;
while (i < line.len and (std.ascii.isAlphanumeric(line[i]) or line[i] == '_')) : (i += 1) {}
// check for label definition (ends with ':')
if (i < line.len and line[i] == ':') {
try self.tokens.append(.{
.kind = .label_def,
.text = line[start..i],
.line = line_num,
});
i += 1; // skip the colon
continue;
}
const word = line[start..i];
// check if it's a register (R0-R7)
if (word.len == 2 and (word[0] == 'R' or word[0] == 'r') and
word[1] >= '0' and word[1] <= '7')
{
try self.tokens.append(.{
.kind = .register,
.text = word,
.line = line_num,
});
continue;
}
// check if it's a known mnemonic
if (isMnemonic(word)) {
try self.tokens.append(.{
.kind = .mnemonic,
.text = word,
.line = line_num,
});
continue;
}
// otherwise it's a label reference
try self.tokens.append(.{
.kind = .identifier,
.text = word,
.line = line_num,
});
continue;
}
// unknown character -- skip it (we'll catch errors in parsing)
i += 1;
}
}
fn isMnemonic(word: []const u8) bool {
const mnemonics = [_][]const u8{
"HLT", "hlt", "MOV", "mov", "ADD", "add", "SUB", "sub",
"MUL", "mul", "CMP", "cmp", "JMP", "jmp", "JEQ", "jeq",
"JNE", "jne", "LOAD", "load", "STORE", "store",
"PUSH", "push", "POP", "pop", "CALL", "call",
"RET", "ret", "NOP", "nop",
};
for (mnemonics) |m| {
if (std.mem.eql(u8, word, m)) return true;
}
return false;
}
};
A few decisions worth explaining. First, we handle case-insensitive mnemonics by listing both upper and lower variants in the lookup table. An alternative would be to toLower() the input first, but then error messages would show lowercased text which is less helpful. Second, the label definition detection is position-sensitive -- we only recognize foo: as a label when the colon immediately follows the identifier. foo : (with a space) would parse foo as an identifier and : as an unknown character. This is consistent with how most real assemblers work.
The reset() method lets us reuse the same tokenizer across lines without reallocating. The ArrayList keeps its capacity between calls thanks to clearRetainingCapacity() -- a pattern we've used in several previous projects. Memory-friendly and cache-friendly.
The symbol table
The symbol table maps label names to instruction addresses. It's populated during pass 1 and consumed during pass 2. We use a StringHashMap from Zig's standard library, which we covered in detail in episode 22:
const SymbolTable = struct {
map: std.StringHashMap(u16),
allocator: Allocator,
fn init(allocator: Allocator) SymbolTable {
return .{
.map = std.StringHashMap(u16).init(allocator),
.allocator = allocator,
};
}
fn deinit(self: *SymbolTable) void {
self.map.deinit();
}
fn define(self: *SymbolTable, name: []const u8, address: u16) !void {
const result = try self.map.getOrPut(name);
if (result.found_existing) {
return error.DuplicateLabel;
}
result.value_ptr.* = address;
}
fn resolve(self: *const SymbolTable, name: []const u8) ?u16 {
return self.map.get(name);
}
fn dump(self: *const SymbolTable) void {
std.debug.print("\n--- Symbol Table ---\n", .{});
var it = self.map.iterator();
while (it.next()) |entry| {
std.debug.print(" {s} = 0x{X:0>4}\n", .{ entry.key_ptr.*, entry.value_ptr.* });
}
}
};
The define function uses getOrPut which is one of the more elegant hash map operations. It performs a single lookup: if the key exists, it tells us (and we return a DuplicateLabel error because defining a label twice is always a bug). If it doesn't exist, it inserts an empty slot and we fill in the address. One hash computation instead of two (one for "check if exists" and another for "insert") -- a small optimization that matters when you're processing thousands of lines.
The symbol table keys are slices into the original source text. We don't copy the label names. This works because the source text outlives the assembler (we load the entire file into memory upfront and keep it around). No allocations, no ownership headaches.
Error handling with line context
Before we build the assembler itself, let's set up proper error reporting. A useful error message needs the line number, the problem, and ideally the offending text:
const AsmError = struct {
line: usize,
message: []const u8,
detail: ?[]const u8,
fn format(self: AsmError, writer: anytype) !void {
try writer.print("error at line {d}: {s}", .{ self.line, self.message });
if (self.detail) |d| {
try writer.print(" '{s}'", .{d});
}
try writer.print("\n", .{});
}
};
const AssemblerResult = union(enum) {
success: struct {
binary: []u8,
instruction_count: usize,
},
failure: struct {
errors: []const AsmError,
},
};
We use a tagged union (AssemblerResult) for the result instead of an error union. Why? Because we want to collect multiple errors before stopping. If your assembly file has three typos, a good assembler reports all three at once rather than making you fix one, re-run, fix the next, re-run. Real-world assemblers like NASM and GAS both work this way. Zig's error unions are great for single-error-then-bail, but for multi-error collection you need a different pattern.
The AsmError struct is intentionally simple. Line number, message, optional detail string. No fancy formatting, no ANSI colors. Those are presentation concerns that don't belong in the core assembler logic. If we want pretty output later, we format it at the call site.
Pass 1: collecting labels
Pass 1 scans every line but only cares about two things: labels and instructions. For each label it records the current address in the symbol table. For each instruction it advances the address counter by 2 (each instruction is one 16-bit word = 2 bytes). Pass 1 ignores operands entirely -- it doesn't care whether MOV R0, 42 or MOV R0, R1 is valid. That's pass 2's job.
const Assembler = struct {
allocator: Allocator,
symbols: SymbolTable,
tokenizer: Tokenizer,
errors: std.ArrayList(AsmError),
source_lines: std.ArrayList([]const u8),
output: std.ArrayList(u8),
fn init(allocator: Allocator) Assembler {
return .{
.allocator = allocator,
.symbols = SymbolTable.init(allocator),
.tokenizer = Tokenizer.init(allocator),
.errors = std.ArrayList(AsmError).init(allocator),
.source_lines = std.ArrayList([]const u8).init(allocator),
.output = std.ArrayList(u8).init(allocator),
};
}
fn deinit(self: *Assembler) void {
self.symbols.deinit();
self.tokenizer.deinit();
self.errors.deinit();
self.source_lines.deinit();
self.output.deinit();
}
fn addError(self: *Assembler, line: usize, message: []const u8, detail: ?[]const u8) void {
self.errors.append(.{
.line = line,
.message = message,
.detail = detail,
}) catch {};
}
fn pass1(self: *Assembler, source: []const u8) !void {
var address: u16 = 0;
// split source into lines
var line_iter = std.mem.splitScalar(u8, source, '\n');
var line_num: usize = 1;
while (line_iter.next()) |line| {
try self.source_lines.append(line);
try self.tokenizer.tokenizeLine(line, line_num);
const tokens = self.tokenizer.tokens.items;
if (tokens.len == 0) {
line_num += 1;
continue;
}
var ti: usize = 0;
// handle label definition
if (tokens[ti].kind == .label_def) {
self.symbols.define(tokens[ti].text, address) catch |err| {
if (err == error.DuplicateLabel) {
self.addError(line_num, "duplicate label", tokens[ti].text);
}
};
ti += 1;
}
// if there's a mnemonic after the label (or as the first token), count it
if (ti < tokens.len and tokens[ti].kind == .mnemonic) {
address += 2; // each instruction is 2 bytes
}
line_num += 1;
}
}
The address counter starts at 0 and increments by 2 for each instruction. Labels don't consume space -- loop: on its own line gets the address of whatever instruction comes next. A label on the same line as an instruction (loop: ADD R0, R1) gets the address of that instruction.
Notice how pass 1 completely ignores operands. It doesn't try to parse MOV R0, 42 -- it just sees "there's a mnemonic, so there's an instruction here, address += 2". This separation of concerns is the whole point of two passes. Pass 1 answers "where is everything?", pass 2 answers "what does everything mean?".
The addError function swallows allocation errors on the error list itself (the catch {} on append). This is a pragmatic choice: if we can't even allocate space for an error message, we're in deep trouble and should probably just bail. But we don't want the allocator failure to mask the actual assembly error. In production assemblers this would be more sophisticated, but for a teaching project it's fine.
Pass 2: encoding instructions
Pass 2 re-scans every line, parses instruction operands, resolves label references through the symbol table, and encodes each instruction using the Instruction functions from episode 59. This is where the real work happens:
fn pass2(self: *Assembler) !void {
for (self.source_lines.items, 0..) |line, idx| {
const line_num = idx + 1;
try self.tokenizer.tokenizeLine(line, line_num);
const tokens = self.tokenizer.tokens.items;
if (tokens.len == 0) continue;
var ti: usize = 0;
// skip label definition
if (tokens[ti].kind == .label_def) {
ti += 1;
}
// no mnemonic on this line -- label-only or comment-only
if (ti >= tokens.len or tokens[ti].kind != .mnemonic) continue;
const mnemonic = tokens[ti];
ti += 1;
const op = mnemonicToOpcode(mnemonic.text) orelse {
self.addError(line_num, "unknown mnemonic", mnemonic.text);
continue;
};
const word = switch (op) {
.hlt, .nop => Instruction.encodeNoArgs(op).toU16(),
.ret => Instruction.encodeNoArgs(op).toU16(),
.jmp, .jeq, .jne, .call => blk: {
const target = self.parseJumpTarget(tokens[ti..], line_num) orelse {
break :blk @as(u16, 0);
};
break :blk Instruction.encodeImm(op, 0, target).toU16();
},
.push => blk: {
const reg = self.parseRegister(tokens[ti..], line_num) orelse {
break :blk @as(u16, 0);
};
break :blk Instruction.encodeReg(op, reg, 0).toU16();
},
.pop => blk: {
const reg = self.parseRegister(tokens[ti..], line_num) orelse {
break :blk @as(u16, 0);
};
break :blk Instruction.encodeReg(op, reg, 0).toU16();
},
.mov, .add, .sub, .mul, .cmp, .load, .store => blk: {
const result = self.parseDstSrc(tokens[ti..], line_num) orelse {
break :blk @as(u16, 0);
};
if (result.is_immediate) {
break :blk Instruction.encodeImm(op, result.dst, result.imm).toU16();
} else {
break :blk Instruction.encodeReg(op, result.dst, result.src).toU16();
}
},
};
// emit little-endian
try self.output.append(@truncate(word));
try self.output.append(@truncate(word >> 8));
}
}
The switch on the opcode is the central dispatch. Each instruction category has different operand requirements: HLT/NOP/RET take no operands, JMP/JEQ/JNE/CALL take a jump target (label or immediate), PUSH/POP take a single register, and the arithmetic/memory instructions take destination + source (register or immediate).
The blk: labeled blocks with break :blk are something we haven't used much in the series. They let us compute a value inside a switch prong using multiple statements. The switch expects each prong to produce a u16 value, and the block's break :blk value provides that. If parsing fails, we emit 0 as a placeholder (the error is already recorded by the parse function).
Let me show the operand parsers:
fn parseRegister(self: *Assembler, tokens: []const Token, line_num: usize) ?u3 {
if (tokens.len == 0 or tokens[0].kind != .register) {
self.addError(line_num, "expected register", null);
return null;
}
return @truncate(tokens[0].text[1] - '0');
}
const DstSrcResult = struct {
dst: u3,
is_immediate: bool,
src: u3,
imm: u8,
};
fn parseDstSrc(self: *Assembler, tokens: []const Token, line_num: usize) ?DstSrcResult {
// expect: register, comma, (register | immediate | identifier)
if (tokens.len < 3) {
self.addError(line_num, "expected 'REG, operand'", null);
return null;
}
if (tokens[0].kind != .register) {
self.addError(line_num, "expected destination register", tokens[0].text);
return null;
}
const dst: u3 = @truncate(tokens[0].text[1] - '0');
if (tokens[1].kind != .comma) {
self.addError(line_num, "expected comma after destination register", null);
return null;
}
// source operand
if (tokens[2].kind == .register) {
const src: u3 = @truncate(tokens[2].text[1] - '0');
return .{ .dst = dst, .is_immediate = false, .src = src, .imm = 0 };
}
if (tokens[2].kind == .immediate) {
const val = std.fmt.parseInt(u8, tokens[2].text, 10) catch {
self.addError(line_num, "immediate value out of range (0-255)", tokens[2].text);
return null;
};
return .{ .dst = dst, .is_immediate = true, .src = 0, .imm = val };
}
if (tokens[2].kind == .identifier) {
// label reference -- resolve through symbol table
const addr = self.symbols.resolve(tokens[2].text) orelse {
self.addError(line_num, "undefined label", tokens[2].text);
return null;
};
if (addr > 255) {
self.addError(line_num, "label address exceeds 8-bit immediate range", tokens[2].text);
return null;
}
return .{ .dst = dst, .is_immediate = true, .src = 0, .imm = @truncate(addr) };
}
self.addError(line_num, "unexpected operand", tokens[2].text);
return null;
}
fn parseJumpTarget(self: *Assembler, tokens: []const Token, line_num: usize) ?u8 {
if (tokens.len == 0) {
self.addError(line_num, "expected jump target", null);
return null;
}
if (tokens[0].kind == .immediate) {
return std.fmt.parseInt(u8, tokens[0].text, 10) catch {
self.addError(line_num, "jump target out of range (0-255)", tokens[0].text);
return null;
};
}
if (tokens[0].kind == .identifier) {
const addr = self.symbols.resolve(tokens[0].text) orelse {
self.addError(line_num, "undefined label", tokens[0].text);
return null;
};
if (addr > 255) {
self.addError(line_num, "label address exceeds 8-bit range", tokens[0].text);
return null;
}
return @truncate(addr);
}
self.addError(line_num, "expected label or address", tokens[0].text);
return null;
}
The parseDstSrc function handles the most complex case: instructions with two operands separated by a comma. The second operand can be a register (ADD R0, R1), an immediate (ADD R0, 5), or a label reference (MOV R0, data_addr). Label references get resolved through the symbol table, which was populated in pass 1. If the label doesn't exist, we report "undefined label" with the offending name.
There's an important constraint here: because our instruction format uses 8 bits for immediate values, label addresses must fit in 0-255. That limits our programs to 128 instructions (256 bytes / 2 bytes per instruction). A real assembler would use wider instruction formats or relative addressing to handle larger programs. For our teaching project, 128 instructions is more than enough for any example.
The mnemonic-to-opcode lookup
This is straightforward -- map text to the opcode enum, case-insensitive:
fn mnemonicToOpcode(text: []const u8) ?Opcode {
const pairs = [_]struct { name: []const u8, op: Opcode }{
.{ .name = "HLT", .op = .hlt },
.{ .name = "MOV", .op = .mov },
.{ .name = "ADD", .op = .add },
.{ .name = "SUB", .op = .sub },
.{ .name = "MUL", .op = .mul },
.{ .name = "CMP", .op = .cmp },
.{ .name = "JMP", .op = .jmp },
.{ .name = "JEQ", .op = .jeq },
.{ .name = "JNE", .op = .jne },
.{ .name = "LOAD", .op = .load },
.{ .name = "STORE", .op = .store },
.{ .name = "PUSH", .op = .push },
.{ .name = "POP", .op = .pop },
.{ .name = "CALL", .op = .call },
.{ .name = "RET", .op = .ret },
.{ .name = "NOP", .op = .nop },
};
for (pairs) |pair| {
if (std.ascii.eqlIgnoreCase(pair.name, text)) return pair.op;
}
return null;
}
We use std.ascii.eqlIgnoreCase for the comparison so MOV, mov, and Mov all match. The linear scan through 16 entries is fast enough -- for a table this small, a hash map would actually be slower due to hashing overhead. You'd need hundreds of mnemonics before hashing becomes worthwhile.
Putting it together: the assemble function
The public API ties pass 1 and pass 2 together:
fn assemble(self: *Assembler, source: []const u8) !AssemblerResult {
// pass 1: collect labels
try self.pass1(source);
if (self.errors.items.len > 0) {
return .{ .failure = .{ .errors = self.errors.items } };
}
// pass 2: encode instructions
try self.pass2();
if (self.errors.items.len > 0) {
return .{ .failure = .{ .errors = self.errors.items } };
}
return .{
.success = .{
.binary = self.output.items,
.instruction_count = self.output.items.len / 2,
},
};
}
};
We check for errors after each pass. Pass 1 errors (like duplicate labels) mean pass 2 would operate on a broken symbol table, so we bail early. Pass 2 errors (undefined labels, bad operands) are collected and reported together.
The result includes both the raw binary and the instruction count. The binary is the output ArrayList's items slice -- valid as long as the Assembler hasn't been deinitialized. The caller should copy it if they need it to outlive the assembler.
Why two passes: forward references in action
Let me show concretely why one pass isn't enough. Consider this program:
JNE skip ; line 1: jump forward to 'skip'
MOV R0, 99 ; line 2: this should be skipped
skip:
MOV R0, 42 ; line 3: this is the target
HLT ; line 4
When a single-pass assembler hits line 1, it sees JNE skip. But skip: is defined on line 3 -- we haven't seen it yet. The assembler doesn't know what address to encode for the jump target. You could emit a placeholder and go back to fix it later (that's called "backpatching" and some assemblers do use it), but the two-pass approach is cleaner: pass 1 discovers that skip is at address 0x04 (two instructions before it, 2 bytes each = 4), and pass 2 uses that information to encode JNE 4.
With our assembler both forward and backward references work identically. The symbol table doesn't care about order -- by the time pass 2 runs, every label in the program has been resolved.
Having said that, backpatching has advantages in some contexts. If the program is very large, reading it twice is slower than reading it once. Linkers (which process object files from separate compilation units) almost always use backpatching with relocation tables. But for an assembler procesing a single source file, two passes are simple, correct, and fast enough.
Reading a file and writing binary output
Let's wire up file I/O so we can assemble real .asm files into .bin files:
pub fn main() !void {
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
defer {
const check = gpa.deinit();
if (check == .leak) @panic("memory leak detected");
}
const allocator = gpa.allocator();
// get command-line arguments
const args = try std.process.argsAlloc(allocator);
defer std.process.argsFree(allocator, args);
if (args.len < 2) {
std.debug.print("Usage: assembler [output.bin]\n", .{});
std.process.exit(1);
}
const input_path = args[1];
const output_path = if (args.len >= 3) args[2] else "out.bin";
// read input file
const source = std.fs.cwd().readFileAlloc(
allocator,
input_path,
1024 * 1024, // 1MB max source file
) catch |err| {
std.debug.print("error: could not read '{s}': {}\n", .{ input_path, err });
std.process.exit(1);
};
defer allocator.free(source);
// assemble
var asm_state = Assembler.init(allocator);
defer asm_state.deinit();
const result = try asm_state.assemble(source);
switch (result) {
.failure => |f| {
std.debug.print("Assembly failed with {d} error(s):\n", .{f.errors.len});
for (f.errors) |err| {
err.format(std.io.getStdErr().writer()) catch {};
}
std.process.exit(1);
},
.success => |s| {
// write binary output
const out_file = try std.fs.cwd().createFile(output_path, .{});
defer out_file.close();
try out_file.writeAll(s.binary);
std.debug.print("Assembled {d} instructions ({d} bytes) -> {s}\n", .{
s.instruction_count,
s.binary.len,
output_path,
});
// also dump symbol table for debugging
asm_state.symbols.dump();
},
}
}
The readFileAlloc function reads the whole file into a heap-allocated buffer. We set a 1MB limit -- more than enough for any assembly source file we'd ever write. The file I/O pattern here is what we established in episode 10, but now we're using it in a real tool.
The output is a flat binary blob. No headers, no sections, no metadata -- just raw instruction bytes. You load this into the VM's memory starting at address 0 and run it. Real object file formats (ELF, Mach-O, PE) add headers describing sections, symbols, and relocations, but those are complexities for a different project. Our flat binary is the simplest useful output format.
Testing the assembler: end-to-end
The real test is: assemble source text, load the binary into the VM, run it, check register state. This validates the entire pipeline -- tokenizer, pass 1, pass 2, encoder, and the VM from episode 59:
test "assemble and run: sum 1 to 10" {
const allocator = testing.allocator;
const source =
\\; Sum numbers 1 through 10
\\ MOV R0, 0
\\ MOV R1, 1
\\ MOV R2, 10
\\loop:
\\ ADD R0, R1
\\ ADD R1, 1
\\ CMP R1, R2
\\ JNE loop
\\ ADD R0, R1
\\ HLT
;
var asm_state = Assembler.init(allocator);
defer asm_state.deinit();
const result = try asm_state.assemble(source);
switch (result) {
.failure => |f| {
for (f.errors) |err| {
std.debug.print("ASM ERROR: line {d}: {s}\n", .{ err.line, err.message });
}
return error.AssemblyFailed;
},
.success => |s| {
try testing.expectEqual(@as(usize, 10), s.instruction_count);
// load into VM and run
var vm = try VM.init(allocator);
defer vm.deinit();
@memcpy(vm.memory[0..s.binary.len], s.binary);
vm.run();
// 1+2+3+4+5+6+7+8+9+10 = 55
try testing.expectEqual(@as(u16, 55), vm.regs.get(0));
},
}
}
test "assemble and run: forward reference" {
const allocator = testing.allocator;
const source =
\\ JMP skip
\\ MOV R0, 99
\\skip:
\\ MOV R0, 42
\\ HLT
;
var asm_state = Assembler.init(allocator);
defer asm_state.deinit();
const result = try asm_state.assemble(source);
switch (result) {
.failure => return error.AssemblyFailed,
.success => |s| {
var vm = try VM.init(allocator);
defer vm.deinit();
@memcpy(vm.memory[0..s.binary.len], s.binary);
vm.run();
// MOV R0, 99 should have been skipped
try testing.expectEqual(@as(u16, 42), vm.regs.get(0));
},
}
}
test "assemble and run: subroutine with CALL/RET" {
const allocator = testing.allocator;
const source =
\\ MOV R0, 7
\\ MOV R1, 3
\\ CALL multiply
\\ HLT
\\multiply:
\\ MUL R0, R1
\\ RET
;
var asm_state = Assembler.init(allocator);
defer asm_state.deinit();
const result = try asm_state.assemble(source);
switch (result) {
.failure => return error.AssemblyFailed,
.success => |s| {
var vm = try VM.init(allocator);
defer vm.deinit();
@memcpy(vm.memory[0..s.binary.len], s.binary);
vm.run();
try testing.expectEqual(@as(u16, 21), vm.regs.get(0));
},
}
}
The first test assembles the same loop program from episode 59, but now written in proper assembly text instead of helper function calls. The result should be identical: R0 = 55. If the assembler produces even one wrong bit in any instruction, the VM will compute a different result and the test fails. This is the beauty of end-to-end testing -- it catches bugs anywhere in the chain.
The forward reference test is the one that would break a single-pass assembler. JMP skip on line 1 references skip: which is defined on line 3. If the symbol table wasn't populated in pass 1, pass 2 would report "undefined label 'skip'" and fail. The test proves our two-pass approach works.
The subroutine test uses CALL multiply -- a forward reference to a label. The assembler resolves multiply to the address of the MUL instruction, encodes it as the CALL target, and the VM pushes the return address, jumps to the multiply routine, executes MUL, and RETs back. R0 = 7 * 3 = 21.
Testing error cases
Good assemblers produce helpful error messages. Let's verify ours does:
test "error: undefined label" {
const allocator = testing.allocator;
const source =
\\ JMP nowhere
\\ HLT
;
var asm_state = Assembler.init(allocator);
defer asm_state.deinit();
const result = try asm_state.assemble(source);
switch (result) {
.failure => |f| {
try testing.expectEqual(@as(usize, 1), f.errors.len);
try testing.expectEqualStrings("undefined label", f.errors[0].message);
},
.success => return error.ExpectedFailure,
}
}
test "error: duplicate label" {
const allocator = testing.allocator;
const source =
\\foo:
\\ MOV R0, 1
\\foo:
\\ MOV R1, 2
\\ HLT
;
var asm_state = Assembler.init(allocator);
defer asm_state.deinit();
const result = try asm_state.assemble(source);
switch (result) {
.failure => |f| {
try testing.expectEqual(@as(usize, 1), f.errors.len);
try testing.expectEqualStrings("duplicate label", f.errors[0].message);
},
.success => return error.ExpectedFailure,
}
}
test "error: immediate out of range" {
const allocator = testing.allocator;
const source =
\\ MOV R0, 300
\\ HLT
;
var asm_state = Assembler.init(allocator);
defer asm_state.deinit();
const result = try asm_state.assemble(source);
switch (result) {
.failure => |f| {
try testing.expect(f.errors.len > 0);
},
.success => return error.ExpectedFailure,
}
}
test "label on same line as instruction" {
const allocator = testing.allocator;
const source =
\\start: MOV R0, 5
\\ ADD R0, 10
\\ HLT
;
var asm_state = Assembler.init(allocator);
defer asm_state.deinit();
const result = try asm_state.assemble(source);
switch (result) {
.failure => return error.AssemblyFailed,
.success => |s| {
var vm = try VM.init(allocator);
defer vm.deinit();
@memcpy(vm.memory[0..s.binary.len], s.binary);
vm.run();
try testing.expectEqual(@as(u16, 15), vm.regs.get(0));
},
}
}
Error tests are just as important as success tests. The undefined label test ensures we catch references to labels that were never defined. The duplicate label test catches the case where someone accidentally reuses a label name. The out-of-range test verifies we don't silently truncate values that don't fit in 8 bits. And the "label on same line" test verifies that start: MOV R0, 5 works correctly -- the label gets the address of the MOV instruction, not the next one.
Comparing hand-assembled vs text-assembled output
Here's a satisfying verification: the assembler should produce the exact same bytes as the hand-built programs from episode 59. Let's test that:
test "assembler output matches hand-built program" {
const allocator = testing.allocator;
// This is the hand-built program from episode 59
const hand_built = [_]u16{
asm_builder.mov_imm(0, 5),
asm_builder.mov_imm(1, 10),
asm_builder.mov_imm(2, 20),
asm_builder.add_reg(0, 1),
asm_builder.add_reg(0, 2),
asm_builder.halt(),
};
// Same program in assembly text
const source =
\\ MOV R0, 5
\\ MOV R1, 10
\\ MOV R2, 20
\\ ADD R0, R1
\\ ADD R0, R2
\\ HLT
;
var asm_state = Assembler.init(allocator);
defer asm_state.deinit();
const result = try asm_state.assemble(source);
switch (result) {
.failure => return error.AssemblyFailed,
.success => |s| {
try testing.expectEqual(hand_built.len * 2, s.binary.len);
// compare word by word
for (hand_built, 0..) |expected_word, i| {
const offset = i * 2;
const lo: u16 = s.binary[offset];
const hi: u16 = s.binary[offset + 1];
const actual_word = lo | (hi << 8);
try testing.expectEqual(expected_word, actual_word);
}
},
}
}
If this test passes, we know the assembler produces byte-identical output to the programmatic encoder. Same instruction encoding, same byte order, same everything. The assembler is just a friendlier interface to the same binary format.
A more complex program: the Fibonacci sequnce
Let's assemble something non-trivial that exercises multiple features -- labels, forward references, conditioal branches, and subroutines:
test "assemble and run: fibonacci" {
const allocator = testing.allocator;
const source =
\\; Compute 8th Fibonacci number
\\; R0 = current, R1 = previous, R2 = counter, R3 = limit, R4 = temp
\\ MOV R0, 1 ; fib(1) = 1
\\ MOV R1, 0 ; fib(0) = 0
\\ MOV R2, 1 ; counter starts at 1
\\ MOV R3, 8 ; compute fib(8)
\\fib_loop:
\\ CMP R2, R3
\\ JEQ done
\\ MOV R4, R0 ; temp = current
\\ ADD R0, R1 ; current = current + previous
\\ MOV R1, R4 ; previous = temp
\\ ADD R2, 1 ; counter++
\\ JMP fib_loop
\\done:
\\ HLT
;
var asm_state = Assembler.init(allocator);
defer asm_state.deinit();
const result = try asm_state.assemble(source);
switch (result) {
.failure => |f| {
for (f.errors) |err| {
std.debug.print("ERROR line {d}: {s}\n", .{ err.line, err.message });
}
return error.AssemblyFailed;
},
.success => |s| {
var vm = try VM.init(allocator);
defer vm.deinit();
@memcpy(vm.memory[0..s.binary.len], s.binary);
vm.run();
// fib(8) = 21
try testing.expectEqual(@as(u16, 21), vm.regs.get(0));
},
}
}
This computes fib(8) = 21 using the iterative method. The program has both a forward reference (JEQ done jumps past the loop body to a label defined later) and a backward reference (JMP fib_loop jumps back to the top of the loop). Both get resolved correctly through the symbol table.
Five registers are in use simultaneously (R0-R4), the loop uses CMP/JEQ for the exit condition and JMP for the back-edge, and MOV shuffles values between registers for the swap. This is about as complex as you can get in 13 instructions, and it exercises most of our instruction set. Kind of wild that we can compute Fibonacci numbers using nothing but the types and functions we wrote ourselves, from bit encoding all the way up ;-)
Wat we geleerd hebben
- A two-pass assembler solves the forward reference problem: pass 1 collects all label addresses into a symbol table, pass 2 uses that table to encode jump targets that reference labels defined later in the source
- Assembly text syntax needs a tokenizer that handles labels (ending with
:), mnemonics (case-insensitive), registers (R0-R7), immediates (decimal numbers), and identifiers (label references in operand position) - The symbol table maps label names to byte addresses using a StringHashMap -- keys are slices into the source text so no string copying is needed
- Error collection (rather than fail-on-first-error) gives the programmer feedback on multiple problems at once, matching how real assemblers like NASM behave
- Operand parsing dispatches by opcode category: no-arg (HLT, NOP, RET), single-target (JMP, CALL), single-register (PUSH, POP), and dst-source pairs (MOV, ADD, etc.)
- The assembler output is a flat binary blob: raw encoded instructions in memory order, no headers, loadable directly into the VM at address 0
- End-to-end tests (assemble text, load binary into VM, run, check registers) validate the entire pipeline and catch bugs in any layer
- Comparing assembler output against hand-built programs from episode 59 proves byte-level correctness of the encoding
This is part 2 of Project G. We have a working two-pass assembler that turns human-readable text into binary programs our VM can execute. Next time we'll build the reverse: a disassembler that reads binary and produces readable assembly text, plus a binary inspector that shows you exactly what's encoded at each address. That closes the loop -- text to binary to text again.
Thanks for reading!