Learn Zig Series (#38) - Markdown to HTML: Parser and AST
Project A: Markdown to HTML Converter (2/3)
What will I learn
- You will learn defining an AST (Abstract Syntax Tree) for Markdown documents;
- You will learn recursive descent parsing from the token stream;
- You will learn nesting: lists inside lists, bold inside links;
- You will learn paragraph detection: consecutive text lines merge into one paragraph;
- You will learn code block parsing: fenced blocks with language tags;
- You will learn building the tree with arena allocation for simple cleanup;
- You will learn visitor pattern for walking the AST;
- You will learn testing: parse known Markdown, verify AST structure.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Zig 0.14+ distribution (download from ziglang.org);
- The ambition to learn Zig programming.
Difficulty
- Intermediate
Curriculum (of the Learn Zig Series):
- Zig Programming Tutorial - ep001 - Intro
- Learn Zig Series (#2) - Hello Zig, Variables and Types
- Learn Zig Series (#3) - Functions and Control Flow
- Learn Zig Series (#4) - Error Handling (Zig's Best Feature)
- Learn Zig Series (#5) - Arrays, Slices, and Strings
- Learn Zig Series (#6) - Structs, Enums, and Tagged Unions
- Learn Zig Series (#7) - Memory Management and Allocators
- Learn Zig Series (#8) - Pointers and Memory Layout
- Learn Zig Series (#9) - Comptime (Zig's Superpower)
- Learn Zig Series (#10) - Project Structure, Modules, and File I/O
- Learn Zig Series (#11) - Mini Project: Building a Step Sequencer
- Learn Zig Series (#12) - Testing and Test-Driven Development
- Learn Zig Series (#13) - Interfaces via Type Erasure
- Learn Zig Series (#14) - Generics with Comptime Parameters
- Learn Zig Series (#15) - The Build System (build.zig)
- Learn Zig Series (#16) - Sentinel-Terminated Types and C Strings
- Learn Zig Series (#17) - Packed Structs and Bit Manipulation
- Learn Zig Series (#18) - Async Concepts and Event Loops
- Learn Zig Series (#18b) - Addendum: Async Returns in Zig 0.16
- Learn Zig Series (#19) - SIMD with @Vector
- Learn Zig Series (#20) - Working with JSON
- Learn Zig Series (#21) - Networking and TCP Sockets
- Learn Zig Series (#22) - Hash Maps and Data Structures
- Learn Zig Series (#23) - Iterators and Lazy Evaluation
- Learn Zig Series (#24) - Logging, Formatting, and Debug Output
- Learn Zig Series (#25) - Mini Project: HTTP Status Checker
- Learn Zig Series (#26) - Writing a Custom Allocator
- Learn Zig Series (#27) - C Interop: Calling C from Zig
- Learn Zig Series (#28) - C Interop: Exposing Zig to C
- Learn Zig Series (#29) - Inline Assembly and Low-Level Control
- Learn Zig Series (#30) - Thread Safety and Atomics
- Learn Zig Series (#31) - Memory-Mapped I/O and Files
- Learn Zig Series (#32) - Compile-Time Reflection with @typeInfo
- Learn Zig Series (#33) - Building a State Machine with Tagged Unions
- Learn Zig Series (#34) - Performance Profiling and Optimization
- Learn Zig Series (#35) - Cross-Compilation and Target Triples
- Learn Zig Series (#36) - Mini Project: CLI Task Runner
- Learn Zig Series (#37) - Markdown to HTML: Tokenizer and Lexer
- Learn Zig Series (#38) - Markdown to HTML: Parser and AST (this post)
Learn Zig Series (#38) - Markdown to HTML: Parser and AST
Welcome back to part two of our Markdown to HTML converter! ;-)
Last episode we built a tokenizer that chews through raw Markdown text and spits out a flat stream of tokens -- headings, bold markers, links, code blocks, all neatly labeled. But a flat list of tokens doesn't know anything about structure. It can't tell you that "this bold text is inside a paragraph" or "these three list items belong to the same list". For that, we need a parser that takes the token stream and builds a tree.
That tree is called an AST -- Abstract Syntax Tree. If you've ever looked at how compilers work (or played with any language tooling at all, really), you've encountered ASTs. They're everywhere. The idea is simple: your document has a hierarchical structure, and a tree is the natural data structure to represent hierarchies. A Markdown document contains paragraphs. Paragraphs contain inline elements (bold, italic, links, code spans). Lists contain list items. List items contain inline elements too. It's trees all the way down.
In this episode we'll define the AST node types, write a recursive descent parser that consumes our token stream, handle all the tricky nesting cases, and test the whole thing. By the end, we'll have a structured tree that the renderer (next episode) can walk to produce clean HTML.
Defining the AST node types
First, what kinds of nodes does our tree need? Looking at the token types we defined in episode 37, we can group them into two categories: block nodes (things that occupy their own vertical space -- paragraphs, headings, lists, code blocks) and inline nodes (things that live within a block -- text, bold, italic, links, code spans).
Here's the node type enum:
const std = @import("std");
const Token = @import("tokenizer.zig").Token;
const TokenType = @import("tokenizer.zig").TokenType;
pub const NodeType = enum {
// Block nodes
document,
heading,
paragraph,
code_block,
unordered_list,
ordered_list,
list_item,
blockquote,
horizontal_rule,
// Inline nodes
text,
bold,
italic,
code_span,
link,
image,
line_break,
};
And the node struct itself. Each node has a type, optional content (for leaf nodes like text and code), optional metadata (heading level, code block language, link URL), and a list of children:
pub const Node = struct {
node_type: NodeType,
content: []const u8,
children: std.ArrayList(*Node),
// Metadata
level: u8, // heading level (1-6), list nesting depth
language: []const u8, // code block language tag
url: []const u8, // link/image URL
alt_text: []const u8, // image alt text
pub fn init(allocator: std.mem.Allocator, node_type: NodeType) !*Node {
const node = try allocator.create(Node);
node.* = .{
.node_type = node_type,
.content = "",
.children = std.ArrayList(*Node).init(allocator),
.level = 0,
.language = "",
.url = "",
.alt_text = "",
};
return node;
}
pub fn addChild(self: *Node, child: *Node) !void {
try self.children.append(child);
}
pub fn childCount(self: *const Node) usize {
return self.children.items.len;
}
};
Notice that we're using allocator.create(Node) to heap-allocate every node, and nodes hold *Node pointers to their children. This is a classic tree representation. Every node owns an ArrayList of child pointers, and the parent-child relationship is implicit through the list ordering.
Why not use a tagged union for the node types (like we did for the state machine in episode 33)? You could -- and for a production parser you might want the type safety of making it impossible to access .language on a text node. But for this project, the struct-with-optional-fields approach is simpler to work with and extend. Different tradeoff, both valid.
The Parser struct and recursive descent
The parser consumes tokens one at a time and builds up the AST. It's a recursive descent parser -- meaning each grammar rule (document, paragraph, inline content) is implemented as a function that may call other parsing functions recursively. This is the most intuitive parsing technique and works beautifully for Markdown because the format is fairly simple structuraly.
pub const Parser = struct {
tokens: []const Token,
pos: usize,
allocator: std.mem.Allocator,
pub fn init(allocator: std.mem.Allocator, tokens: []const Token) Parser {
return .{
.tokens = tokens,
.pos = 0,
.allocator = allocator,
};
}
fn peek(self: *const Parser) ?Token {
if (self.pos >= self.tokens.len) return null;
return self.tokens[self.pos];
}
fn advance(self: *Parser) ?Token {
if (self.pos >= self.tokens.len) return null;
const tok = self.tokens[self.pos];
self.pos += 1;
return tok;
}
fn expect(self: *Parser, expected: TokenType) ?Token {
const tok = self.peek() orelse return null;
if (tok.type == expected) {
return self.advance();
}
return null;
}
fn atEnd(self: *const Parser) bool {
if (self.pos >= self.tokens.len) return true;
return self.tokens[self.pos].type == .eof;
}
};
This should look familiar -- the pattern is almost identical to the tokenizer's peek/advance, just operating on tokens instead of characters. The expect function is a convenience: it peeks at the next token, checks if it matches what we want, and only consumes it if it does. If the token doesn't match, we return null and the caller knows to try something else.
The main parsing entry point builds a document node and then parses blocks until EOF:
pub fn parse(self: *Parser) !*Node {
const doc = try Node.init(self.allocator, .document);
while (!self.atEnd()) {
// Skip bare line breaks between blocks
if (self.peek()) |tok| {
if (tok.type == .line_break) {
_ = self.advance();
continue;
}
}
const block = try self.parseBlock();
if (block) |b| {
try doc.addChild(b);
}
}
return doc;
}
Parsing block-level elements
Each block type has its own parsing function. The parseBlock function looks at the current token and dispatches accordingly:
fn parseBlock(self: *Parser) !?*Node {
const tok = self.peek() orelse return null;
return switch (tok.type) {
.heading => try self.parseHeading(),
.code_block_start => try self.parseCodeBlock(),
.unordered_list_item => try self.parseUnorderedList(),
.ordered_list_item => try self.parseOrderedList(),
.blockquote => try self.parseBlockquote(),
.horizontal_rule => try self.parseHorizontalRule(),
// Anything else starts a paragraph
.text, .bold_start, .italic_start, .code_span,
.link_text_start, .image_start => try self.parseParagraph(),
// Skip tokens we don't handle at block level
else => blk: {
_ = self.advance();
break :blk null;
},
};
}
fn parseHeading(self: *Parser) !*Node {
const tok = self.advance().?; // consume heading token
const node = try Node.init(self.allocator, .heading);
node.level = @intCast(tok.text.len); // number of # chars = heading level
// Parse inline content until line break or EOF
try self.parseInlineContent(node);
return node;
}
fn parseCodeBlock(self: *Parser) !*Node {
const start_tok = self.advance().?; // consume code_block_start
const node = try Node.init(self.allocator, .code_block);
// Extract language from the opening fence
// The token text is something like "```zig" or "```"
const fence_text = start_tok.text;
var lang_start: usize = 0;
for (fence_text, 0..) |c, i| {
if (c != '`' and c != '~') {
lang_start = i;
break;
}
}
if (lang_start > 0) {
node.language = std.mem.trim(u8, fence_text[lang_start..], " \t\n");
}
// Grab the code content
if (self.peek()) |content_tok| {
if (content_tok.type == .code_block_content) {
node.content = content_tok.text;
_ = self.advance();
}
}
// Consume closing fence
_ = self.expect(.code_block_end);
return node;
}
fn parseHorizontalRule(self: *Parser) !*Node {
_ = self.advance(); // consume the horizontal_rule token
return try Node.init(self.allocator, .horizontal_rule);
}
The heading parser is straightforward: consume the heading token, check how many # characters it had (that's the level), then parse inline content until the end of the line. The code block parser pulls out the language tag from the opening fence and grabs the content as-is -- no inline parsing inside code blocks, everything is literal.
Parsing lists -- the nesting challenge
Lists are where things get genuinely tricky. In Markdown, consecutive - item lines form a single list. But lists can also nest: indented items create sub-lists inside parent items. Our tokenizer doesn't track indentation (we kept it simple in ep37), so for this version we'll handle flat lists -- consecutive list items grouped into one list node.
fn parseUnorderedList(self: *Parser) !*Node {
const list = try Node.init(self.allocator, .unordered_list);
while (self.peek()) |tok| {
if (tok.type != .unordered_list_item) break;
_ = self.advance(); // consume the list marker
const item = try Node.init(self.allocator, .list_item);
try self.parseInlineContent(item);
try list.addChild(item);
// Skip line break after list item
_ = self.expect(.line_break);
}
return list;
}
fn parseOrderedList(self: *Parser) !*Node {
const list = try Node.init(self.allocator, .ordered_list);
while (self.peek()) |tok| {
if (tok.type != .ordered_list_item) break;
_ = self.advance();
const item = try Node.init(self.allocator, .list_item);
try self.parseInlineContent(item);
try list.addChild(item);
_ = self.expect(.line_break);
}
return list;
}
fn parseBlockquote(self: *Parser) !*Node {
const bq = try Node.init(self.allocator, .blockquote);
while (self.peek()) |tok| {
if (tok.type != .blockquote) break;
_ = self.advance(); // consume > marker
// Parse the content as inline
const para = try Node.init(self.allocator, .paragraph);
try self.parseInlineContent(para);
try bq.addChild(para);
_ = self.expect(.line_break);
}
return bq;
}
The key insight: each parsing function consumes exactly the tokens it "owns" and stops when it sees something that belongs to a different block. The list parser keeps eating list items as long as it sees unordered_list_item (or ordered_list_item) tokens, then stops. The next call to parseBlock will handle whatever comes after.
Having said that, if you wanted proper nested lists you'd need to either track indentation in the tokenizer or do a second pass. Real Markdown parsers like CommonMark handle this by counting spaces before the list marker. We're keeping things simple here -- one nesting level is plenty for a learning project, and adding depth later is a matter of extending the tokenizer, not rewriting the parser.
Paragraph detection and inline content
Paragraphs are the "default" block -- if the current token doesn't match any other block type, it's the start of a paragraph. The tricky part is knowing when a paragraph ends. In Markdown, a paragraph ends at a blank line (two consecutive line breaks), at the start of another block element, or at EOF.
fn parseParagraph(self: *Parser) !*Node {
const para = try Node.init(self.allocator, .paragraph);
try self.parseInlineContent(para);
return para;
}
fn parseInlineContent(self: *Parser, parent: *Node) !void {
while (self.peek()) |tok| {
switch (tok.type) {
.text => {
const text_node = try Node.init(self.allocator, .text);
text_node.content = tok.text;
try parent.addChild(text_node);
_ = self.advance();
},
.bold_start => {
_ = self.advance();
const bold = try Node.init(self.allocator, .bold);
// Parse nested inline content until bold_end
try self.parseInlineUntil(bold, .bold_end);
try parent.addChild(bold);
},
.italic_start => {
_ = self.advance();
const italic = try Node.init(self.allocator, .italic);
try self.parseInlineUntil(italic, .italic_end);
try parent.addChild(italic);
},
.code_span => {
const cs = try Node.init(self.allocator, .code_span);
cs.content = tok.text;
try parent.addChild(cs);
_ = self.advance();
},
.link_text_start => {
try self.parseLink(parent);
},
.image_start => {
try self.parseImage(parent);
},
.line_break => {
// Check if next token starts a new block
// Two line breaks = paragraph end
_ = self.advance();
if (self.peek()) |next| {
if (next.type == .line_break or
next.type == .heading or
next.type == .code_block_start or
next.type == .unordered_list_item or
next.type == .ordered_list_item or
next.type == .horizontal_rule or
next.type == .blockquote or
next.type == .eof)
{
return; // end of paragraph
}
// Single line break within a paragraph -- insert soft break
const br = try Node.init(self.allocator, .line_break);
try parent.addChild(br);
} else {
return;
}
},
else => return, // any unrecognized token ends inline parsing
}
}
}
fn parseInlineUntil(self: *Parser, parent: *Node, end_token: TokenType) !void {
while (self.peek()) |tok| {
if (tok.type == end_token) {
_ = self.advance(); // consume the closing delimiter
return;
}
if (tok.type == .line_break or tok.type == .eof) {
return; // unclosed delimiter -- bail out
}
// Recursively parse inline content
switch (tok.type) {
.text => {
const text_node = try Node.init(self.allocator, .text);
text_node.content = tok.text;
try parent.addChild(text_node);
_ = self.advance();
},
.bold_start => {
_ = self.advance();
const bold = try Node.init(self.allocator, .bold);
try self.parseInlineUntil(bold, .bold_end);
try parent.addChild(bold);
},
.italic_start => {
_ = self.advance();
const italic = try Node.init(self.allocator, .italic);
try self.parseInlineUntil(italic, .italic_end);
try parent.addChild(italic);
},
.code_span => {
const cs = try Node.init(self.allocator, .code_span);
cs.content = tok.text;
try parent.addChild(cs);
_ = self.advance();
},
.link_text_start => {
try self.parseLink(parent);
},
else => {
_ = self.advance(); // skip unknown tokens
},
}
}
}
This is where the "recursive" in recursive descent becomes visible. parseInlineContent calls parseInlineUntil for bold/italic blocks, and parseInlineUntil can call itself recursively for nested formatting. So **bold and *italic* inside** creates a bold node whose children include a text node, an italic node (containing its own text node), and another text node. The tree captures the nesting perfectly.
The paragraph detection logic in parseInlineContent is worth studying. A single line break mid-paragraph becomes a soft break (which the HTML renderer will turn into a <br> or just a space, depending on your preference). A double line break or the start of a new block element ends the paragraph entirely. This matches how most Markdown renderers work -- consecutive text lines get merged into one paragraph, blank lines separate paragraphs.
Parsing links and images
Links and images follow a similar structure: [text](url) and ! [alt] (url) (without the spaces). The parser needs to collect the text content, then the URL:
fn parseLink(self: *Parser, parent: *Node) !void {
_ = self.advance(); // consume link_text_start
const lnk = try Node.init(self.allocator, .link);
// Parse link text (inline content until link_text_end)
while (self.peek()) |tok| {
if (tok.type == .link_text_end) {
_ = self.advance();
break;
}
if (tok.type == .text) {
const text_node = try Node.init(self.allocator, .text);
text_node.content = tok.text;
try lnk.addChild(text_node);
}
_ = self.advance();
}
// Parse URL
if (self.expect(.link_url_start)) |_| {
if (self.peek()) |url_tok| {
if (url_tok.type == .text) {
lnk.url = url_tok.text;
_ = self.advance();
}
}
_ = self.expect(.link_url_end);
}
try parent.addChild(lnk);
}
fn parseImage(self: *Parser, parent: *Node) !void {
_ = self.advance(); // consume image_start
const img = try Node.init(self.allocator, .image);
// Parse alt text until link_text_end
while (self.peek()) |tok| {
if (tok.type == .link_text_end) {
_ = self.advance();
break;
}
if (tok.type == .text) {
img.alt_text = tok.text;
}
_ = self.advance();
}
// Parse URL
if (self.expect(.link_url_start)) |_| {
if (self.peek()) |url_tok| {
if (url_tok.type == .text) {
img.url = url_tok.text;
_ = self.advance();
}
}
_ = self.expect(.link_url_end);
}
try parent.addChild(img);
}
The link parser extracts inline content between [ and ] as children of the link node (so the link text can contain bold, italic, etc. -- just like in real Markdown). The image parser is similar but stores the text between [ and ] as alt text rather than child nodes.
Building the tree with arena allocation
You may have noticed something: we're creating a LOT of nodes. Every text run, every bold span, every paragraph, every list item -- all separately heap-allocated. With the general purpose allocator, that means many individual malloc calls. And when we're done with the AST (after rendering to HTML), we need to free every single node individually.
This is the perfect use case for an arena allocator. We covered allocators in episode 7 and even wrote a custom one in episode 26. An arena allocates from a big chunk of memory and frees everything at once when you're done. No individual frees, no tracking, just one deinit and it's all gone.
pub fn parseMarkdown(source: []const u8, backing_allocator: std.mem.Allocator) !*Node {
// Create arena for all AST allocations
var arena = std.heap.ArenaAllocator.init(backing_allocator);
// NOTE: caller must keep arena alive until done with the AST
// In practice, wrap this in a Document struct that owns the arena
const allocator = arena.allocator();
// Tokenize
var tokenizer = @import("tokenizer.zig").Tokenizer.init(allocator, source);
const tokens = try tokenizer.tokenize();
// Parse
var parser = Parser.init(allocator, tokens);
return try parser.parse();
}
pub const Document = struct {
root: *Node,
arena: std.heap.ArenaAllocator,
pub fn init(source: []const u8, backing_allocator: std.mem.Allocator) !Document {
var arena = std.heap.ArenaAllocator.init(backing_allocator);
errdefer arena.deinit();
const allocator = arena.allocator();
var tokenizer = @import("tokenizer.zig").Tokenizer.init(allocator, source);
const tokens = try tokenizer.tokenize();
var parser = Parser.init(allocator, tokens);
const root = try parser.parse();
return .{
.root = root,
.arena = arena,
};
}
pub fn deinit(self: *Document) void {
self.arena.deinit();
}
};
The Document struct wraps the AST root and the arena together. When you're done with the document (after rendering or whatever), call deinit once and all nodes, all ArrayLists, all token storage -- everything allocated through the arena -- gets freed in one shot. No walking the tree to free individual nodes. No risk of forgeting to free a deeply nested child. Just arena.deinit() and you're done.
This is one of those patterns where Zig's explicit allocator design really shines. In C you'd have to write a custom arena or track every allocation manually. In garbage-collected languages you wouldn't think about it at all (but you'd pay the GC overhead). Zig gives you the arena pattern as a standard library primitive that costs nothing extra -- std.heap.ArenaAllocator is literally just bumping a pointer.
Visitor pattern for walking the AST
Once we have the tree, we need to walk it. The renderer in the next episode will walk the AST and emit HTML. But the walk logic itself is reusable -- we might also want to walk the tree for debugging output, for generating a table of contents, for counting words, etc.
The visitor pattern lets us separate "how we walk the tree" from "what we do at each node":
pub fn Visitor(comptime Context: type) type {
return struct {
const Self = @This();
enterFn: *const fn (ctx: *Context, node: *const Node, depth: usize) void,
leaveFn: *const fn (ctx: *Context, node: *const Node, depth: usize) void,
pub fn walk(self: Self, node: *const Node, ctx: *Context) void {
self.walkDepth(node, ctx, 0);
}
fn walkDepth(self: Self, node: *const Node, ctx: *Context, depth: usize) void {
self.enterFn(ctx, node, depth);
for (node.children.items) |child| {
self.walkDepth(child, ctx, depth + 1);
}
self.leaveFn(ctx, node, depth);
}
};
}
This uses comptime generics (from episode 14) to parameterize the visitor over a context type. The caller provides two function pointers: one called when entering a node (before processing children) and one called when leaving (after children). The depth parameter tracks how deep in the tree we are -- useful for indentation in debug output and for nesting HTML tags correctly.
Here's a debug printer using the visitor:
const DebugContext = struct {
const indent_str = " ";
fn enter(ctx: *DebugContext, node: *const Node, depth: usize) void {
_ = ctx;
// Print indentation
var i: usize = 0;
while (i < depth) : (i += 1) {
std.debug.print("{s}", .{indent_str});
}
std.debug.print("{s}", .{@tagName(node.node_type)});
if (node.content.len > 0) {
std.debug.print(" \"{s}\"", .{node.content});
}
if (node.level > 0) {
std.debug.print(" level={d}", .{node.level});
}
if (node.url.len > 0) {
std.debug.print(" url=\"{s}\"", .{node.url});
}
std.debug.print("\n", .{});
}
fn leave(_: *DebugContext, _: *const Node, _: usize) void {
// nothing on leave for debug output
}
};
pub fn debugPrint(root: *const Node) void {
var ctx = DebugContext{};
const visitor = Visitor(DebugContext){
.enterFn = DebugContext.enter,
.leaveFn = DebugContext.leave,
};
visitor.walk(root, &ctx);
}
Running this on a parsed document gives you output like:
document
heading level=1
text "Hello World"
paragraph
text "This is "
bold
text "important"
text " stuff."
code_block "const x = 42;"
unordered_list
list_item
text "first"
list_item
text "second"
Extremely handy for debugging parser issues. You can see at a glance whether nodes are nested correctly, whether text content ended up in the right places, whether paragraphs were split properly.
Testing the parser
Testing the parser builds on the tokenizer tests from last episode. The approach: feed known Markdown through both tokenizer and parser, then verify the AST structure:
test "parse heading" {
var doc = try Document.init("# Hello World\n", std.testing.allocator);
defer doc.deinit();
const root = doc.root;
try std.testing.expectEqual(NodeType.document, root.node_type);
try std.testing.expectEqual(@as(usize, 1), root.childCount());
const heading = root.children.items[0];
try std.testing.expectEqual(NodeType.heading, heading.node_type);
try std.testing.expectEqual(@as(u8, 1), heading.level);
try std.testing.expectEqual(@as(usize, 1), heading.childCount());
try std.testing.expectEqualStrings("Hello World", heading.children.items[0].content);
}
test "parse paragraph with bold" {
var doc = try Document.init("This is **bold** text\n", std.testing.allocator);
defer doc.deinit();
const root = doc.root;
const para = root.children.items[0];
try std.testing.expectEqual(NodeType.paragraph, para.node_type);
// Should have: text("This is "), bold(text("bold")), text(" text")
try std.testing.expectEqual(@as(usize, 3), para.childCount());
try std.testing.expectEqual(NodeType.text, para.children.items[0].node_type);
try std.testing.expectEqual(NodeType.bold, para.children.items[1].node_type);
try std.testing.expectEqual(NodeType.text, para.children.items[2].node_type);
// Check bold contains the right text
const bold = para.children.items[1];
try std.testing.expectEqual(@as(usize, 1), bold.childCount());
try std.testing.expectEqualStrings("bold", bold.children.items[0].content);
}
test "parse link" {
var doc = try Document.init("Click [here](https://example.com) now\n", std.testing.allocator);
defer doc.deinit();
const para = doc.root.children.items[0];
// text "Click " -> link -> text " now"
try std.testing.expectEqual(@as(usize, 3), para.childCount());
const link = para.children.items[1];
try std.testing.expectEqual(NodeType.link, link.node_type);
try std.testing.expectEqualStrings("https://example.com", link.url);
try std.testing.expectEqual(@as(usize, 1), link.childCount());
try std.testing.expectEqualStrings("here", link.children.items[0].content);
}
test "parse code block with language" {
const input = "```zig\nconst x = 42;\n```\n";
var doc = try Document.init(input, std.testing.allocator);
defer doc.deinit();
const cb = doc.root.children.items[0];
try std.testing.expectEqual(NodeType.code_block, cb.node_type);
try std.testing.expectEqualStrings("zig", cb.language);
try std.testing.expect(std.mem.indexOf(u8, cb.content, "const x = 42;") != null);
}
test "parse unordered list" {
const input = "- alpha\n- beta\n- gamma\n";
var doc = try Document.init(input, std.testing.allocator);
defer doc.deinit();
const list = doc.root.children.items[0];
try std.testing.expectEqual(NodeType.unordered_list, list.node_type);
try std.testing.expectEqual(@as(usize, 3), list.childCount());
for (list.children.items) |item| {
try std.testing.expectEqual(NodeType.list_item, item.node_type);
}
}
test "parse multiple paragraphs" {
const input = "First paragraph.\n\nSecond paragraph.\n";
var doc = try Document.init(input, std.testing.allocator);
defer doc.deinit();
// Should have two paragraph children
var para_count: usize = 0;
for (doc.root.children.items) |child| {
if (child.node_type == .paragraph) para_count += 1;
}
try std.testing.expectEqual(@as(usize, 2), para_count);
}
test "arena cleanup frees all nodes" {
// The testing allocator will detect leaks
var doc = try Document.init("# Title\n\nSome **bold** and *italic* text.\n\n- item one\n- item two\n", std.testing.allocator);
doc.deinit();
// If we get here without a leak report, arena cleanup works correctly
}
That last test is my favorite -- it doesn't check any specific AST structure, it just verifies that the arena cleanup actually frees everything. If we forgot to route some allocation through the arena (say, a stray std.heap.page_allocator.create() somewhere), std.testing.allocator would catch the leak. This is the Zig testing philosophy: let the tools do the checking for you.
Updating the build.zig
We need to add the parser as a test target alongside the tokenizer tests from last episode:
const std = @import("std");
pub fn build(b: *std.Build) void {
const target = b.standardTargetOptions(.{});
const optimize = b.standardOptimizeOption(.{});
const exe = b.addExecutable(.{
.name = "md2html",
.root_source_file = b.path("src/main.zig"),
.target = target,
.optimize = optimize,
});
b.installArtifact(exe);
const run_cmd = b.addRunArtifact(exe);
run_cmd.step.dependOn(b.getInstallStep());
if (b.args) |args| run_cmd.addArgs(args);
const run_step = b.step("run", "Run md2html");
run_step.dependOn(&run_cmd.step);
// Tokenizer tests
const tokenizer_tests = b.addTest(.{
.root_source_file = b.path("src/tokenizer.zig"),
.target = target,
.optimize = optimize,
});
const run_tok_tests = b.addRunArtifact(tokenizer_tests);
// Parser tests
const parser_tests = b.addTest(.{
.root_source_file = b.path("src/parser.zig"),
.target = target,
.optimize = optimize,
});
const run_parser_tests = b.addRunArtifact(parser_tests);
const test_step = b.step("test", "Run unit tests");
test_step.dependOn(&run_tok_tests.step);
test_step.dependOn(&run_parser_tests.step);
}
Now zig build test runs both tokenizer AND parser tests. The test step depends on both, so both suites run (and if either fails, you see exactly which one).
Wat we geleerd hebben
- How to define AST node types for a Markdown document using structs with child lists -- block nodes (document, heading, paragraph, code block, lists) and inline nodes (text, bold, italic, link, code span)
- Recursive descent parsing: each grammar rule becomes a function that consumes tokens and may call other parsing functions. The recursion handles nesting naturally -- bold inside links, italic inside bold, inline elements inside list items
- Paragraph detection by treating single line breaks as soft breaks and double line breaks (or block-start tokens) as paragraph boundaries
- Arena allocation for AST memory management -- all nodes allocated through a single arena, freed with one
deinitcall instead of walking the entire tree - The visitor pattern with comptime generics for separating tree traversal from processing logic, enabling debug printing, rendering, and analysis to share the same walk code
- Testing AST structure by feeding known Markdown through tokenizer + parser and verifying node types, child counts, content strings, and metadata. Plus leak detection via
std.testing.allocatorto ensure the arena cleanup is complete
We now have a complete pipeline from raw Markdown text to a structured tree. The tokenizer breaks text into tokens, the parser organizes tokens into a tree, and the arena makes cleanup trivial. What we don't have yet is the actual HTML output -- that's where the renderer and CLI come in. The visitor pattern we built today will be the foundation for that rendering step.
Thanks for reading!