Learn Zig Series (#38) - Markdown to HTML: Parser and AST

Project A: Markdown to HTML Converter (2/3)

What will I learn

You will learn defining an AST (Abstract Syntax Tree) for Markdown documents;
You will learn recursive descent parsing from the token stream;
You will learn nesting: lists inside lists, bold inside links;
You will learn paragraph detection: consecutive text lines merge into one paragraph;
You will learn code block parsing: fenced blocks with language tags;
You will learn building the tree with arena allocation for simple cleanup;
You will learn visitor pattern for walking the AST;
You will learn testing: parse known Markdown, verify AST structure.

Requirements

A working modern computer running macOS, Windows or Ubuntu;
An installed Zig 0.14+ distribution (download from ziglang.org);
The ambition to learn Zig programming.

Difficulty

Intermediate

Curriculum (of the `Learn Zig Series`):

Learn Zig Series (#38) - Markdown to HTML: Parser and AST

Welcome back to part two of our Markdown to HTML converter! ;-)

Last episode we built a tokenizer that chews through raw Markdown text and spits out a flat stream of tokens -- headings, bold markers, links, code blocks, all neatly labeled. But a flat list of tokens doesn't know anything about structure. It can't tell you that "this bold text is inside a paragraph" or "these three list items belong to the same list". For that, we need a parser that takes the token stream and builds a tree.

That tree is called an AST -- Abstract Syntax Tree. If you've ever looked at how compilers work (or played with any language tooling at all, really), you've encountered ASTs. They're everywhere. The idea is simple: your document has a hierarchical structure, and a tree is the natural data structure to represent hierarchies. A Markdown document contains paragraphs. Paragraphs contain inline elements (bold, italic, links, code spans). Lists contain list items. List items contain inline elements too. It's trees all the way down.

In this episode we'll define the AST node types, write a recursive descent parser that consumes our token stream, handle all the tricky nesting cases, and test the whole thing. By the end, we'll have a structured tree that the renderer (next episode) can walk to produce clean HTML.

Defining the AST node types

First, what kinds of nodes does our tree need? Looking at the token types we defined in episode 37, we can group them into two categories: block nodes (things that occupy their own vertical space -- paragraphs, headings, lists, code blocks) and inline nodes (things that live within a block -- text, bold, italic, links, code spans).

Here's the node type enum:

const std = @import("std");
const Token = @import("tokenizer.zig").Token;
const TokenType = @import("tokenizer.zig").TokenType;

pub const NodeType = enum {
    // Block nodes
    document,
    heading,
    paragraph,
    code_block,
    unordered_list,
    ordered_list,
    list_item,
    blockquote,
    horizontal_rule,

    // Inline nodes
    text,
    bold,
    italic,
    code_span,
    link,
    image,
    line_break,
};

And the node struct itself. Each node has a type, optional content (for leaf nodes like text and code), optional metadata (heading level, code block language, link URL), and a list of children:

pub const Node = struct {
    node_type: NodeType,
    content: []const u8,
    children: std.ArrayList(*Node),
    // Metadata
    level: u8,          // heading level (1-6), list nesting depth
    language: []const u8, // code block language tag
    url: []const u8,    // link/image URL
    alt_text: []const u8, // image alt text

    pub fn init(allocator: std.mem.Allocator, node_type: NodeType) !*Node {
        const node = try allocator.create(Node);
        node.* = .{
            .node_type = node_type,
            .content = "",
            .children = std.ArrayList(*Node).init(allocator),
            .level = 0,
            .language = "",
            .url = "",
            .alt_text = "",
        };
        return node;
    }

    pub fn addChild(self: *Node, child: *Node) !void {
        try self.children.append(child);
    }

    pub fn childCount(self: *const Node) usize {
        return self.children.items.len;
    }
};

Notice that we're using allocator.create(Node) to heap-allocate every node, and nodes hold *Node pointers to their children. This is a classic tree representation. Every node owns an ArrayList of child pointers, and the parent-child relationship is implicit through the list ordering.

Why not use a tagged union for the node types (like we did for the state machine in episode 33)? You could -- and for a production parser you might want the type safety of making it impossible to access .language on a text node. But for this project, the struct-with-optional-fields approach is simpler to work with and extend. Different tradeoff, both valid.

The Parser struct and recursive descent

The parser consumes tokens one at a time and builds up the AST. It's a recursive descent parser -- meaning each grammar rule (document, paragraph, inline content) is implemented as a function that may call other parsing functions recursively. This is the most intuitive parsing technique and works beautifully for Markdown because the format is fairly simple structuraly.

pub const Parser = struct {
    tokens: []const Token,
    pos: usize,
    allocator: std.mem.Allocator,

    pub fn init(allocator: std.mem.Allocator, tokens: []const Token) Parser {
        return .{
            .tokens = tokens,
            .pos = 0,
            .allocator = allocator,
        };
    }

    fn peek(self: *const Parser) ?Token {
        if (self.pos >= self.tokens.len) return null;
        return self.tokens[self.pos];
    }

    fn advance(self: *Parser) ?Token {
        if (self.pos >= self.tokens.len) return null;
        const tok = self.tokens[self.pos];
        self.pos += 1;
        return tok;
    }

    fn expect(self: *Parser, expected: TokenType) ?Token {
        const tok = self.peek() orelse return null;
        if (tok.type == expected) {
            return self.advance();
        }
        return null;
    }

    fn atEnd(self: *const Parser) bool {
        if (self.pos >= self.tokens.len) return true;
        return self.tokens[self.pos].type == .eof;
    }
};

This should look familiar -- the pattern is almost identical to the tokenizer's peek/advance, just operating on tokens instead of characters. The expect function is a convenience: it peeks at the next token, checks if it matches what we want, and only consumes it if it does. If the token doesn't match, we return null and the caller knows to try something else.

The main parsing entry point builds a document node and then parses blocks until EOF:

pub fn parse(self: *Parser) !*Node {
    const doc = try Node.init(self.allocator, .document);

    while (!self.atEnd()) {
        // Skip bare line breaks between blocks
        if (self.peek()) |tok| {
            if (tok.type == .line_break) {
                _ = self.advance();
                continue;
            }
        }

        const block = try self.parseBlock();
        if (block) |b| {
            try doc.addChild(b);
        }
    }

    return doc;
}

Parsing block-level elements

Each block type has its own parsing function. The parseBlock function looks at the current token and dispatches accordingly:

fn parseBlock(self: *Parser) !?*Node {
    const tok = self.peek() orelse return null;

    return switch (tok.type) {
        .heading => try self.parseHeading(),
        .code_block_start => try self.parseCodeBlock(),
        .unordered_list_item => try self.parseUnorderedList(),
        .ordered_list_item => try self.parseOrderedList(),
        .blockquote => try self.parseBlockquote(),
        .horizontal_rule => try self.parseHorizontalRule(),
        // Anything else starts a paragraph
        .text, .bold_start, .italic_start, .code_span,
        .link_text_start, .image_start => try self.parseParagraph(),
        // Skip tokens we don't handle at block level
        else => blk: {
            _ = self.advance();
            break :blk null;
        },
    };
}

fn parseHeading(self: *Parser) !*Node {
    const tok = self.advance().?; // consume heading token
    const node = try Node.init(self.allocator, .heading);
    node.level = @intCast(tok.text.len); // number of # chars = heading level

    // Parse inline content until line break or EOF
    try self.parseInlineContent(node);

    return node;
}

fn parseCodeBlock(self: *Parser) !*Node {
    const start_tok = self.advance().?; // consume code_block_start
    const node = try Node.init(self.allocator, .code_block);

    // Extract language from the opening fence
    // The token text is something like "```zig" or "```"
    const fence_text = start_tok.text;
    var lang_start: usize = 0;
    for (fence_text, 0..) |c, i| {
        if (c != '`' and c != '~') {
            lang_start = i;
            break;
        }
    }
    if (lang_start > 0) {
        node.language = std.mem.trim(u8, fence_text[lang_start..], " \t\n");
    }

    // Grab the code content
    if (self.peek()) |content_tok| {
        if (content_tok.type == .code_block_content) {
            node.content = content_tok.text;
            _ = self.advance();
        }
    }

    // Consume closing fence
    _ = self.expect(.code_block_end);

    return node;
}

fn parseHorizontalRule(self: *Parser) !*Node {
    _ = self.advance(); // consume the horizontal_rule token
    return try Node.init(self.allocator, .horizontal_rule);
}

The heading parser is straightforward: consume the heading token, check how many # characters it had (that's the level), then parse inline content until the end of the line. The code block parser pulls out the language tag from the opening fence and grabs the content as-is -- no inline parsing inside code blocks, everything is literal.

Parsing lists -- the nesting challenge

Lists are where things get genuinely tricky. In Markdown, consecutive - item lines form a single list. But lists can also nest: indented items create sub-lists inside parent items. Our tokenizer doesn't track indentation (we kept it simple in ep37), so for this version we'll handle flat lists -- consecutive list items grouped into one list node.

fn parseUnorderedList(self: *Parser) !*Node {
    const list = try Node.init(self.allocator, .unordered_list);

    while (self.peek()) |tok| {
        if (tok.type != .unordered_list_item) break;
        _ = self.advance(); // consume the list marker

        const item = try Node.init(self.allocator, .list_item);
        try self.parseInlineContent(item);
        try list.addChild(item);

        // Skip line break after list item
        _ = self.expect(.line_break);
    }

    return list;
}

fn parseOrderedList(self: *Parser) !*Node {
    const list = try Node.init(self.allocator, .ordered_list);

    while (self.peek()) |tok| {
        if (tok.type != .ordered_list_item) break;
        _ = self.advance();

        const item = try Node.init(self.allocator, .list_item);
        try self.parseInlineContent(item);
        try list.addChild(item);

        _ = self.expect(.line_break);
    }

    return list;
}

fn parseBlockquote(self: *Parser) !*Node {
    const bq = try Node.init(self.allocator, .blockquote);

    while (self.peek()) |tok| {
        if (tok.type != .blockquote) break;
        _ = self.advance(); // consume > marker

        // Parse the content as inline
        const para = try Node.init(self.allocator, .paragraph);
        try self.parseInlineContent(para);
        try bq.addChild(para);

        _ = self.expect(.line_break);
    }

    return bq;
}

The key insight: each parsing function consumes exactly the tokens it "owns" and stops when it sees something that belongs to a different block. The list parser keeps eating list items as long as it sees unordered_list_item (or ordered_list_item) tokens, then stops. The next call to parseBlock will handle whatever comes after.

Having said that, if you wanted proper nested lists you'd need to either track indentation in the tokenizer or do a second pass. Real Markdown parsers like CommonMark handle this by counting spaces before the list marker. We're keeping things simple here -- one nesting level is plenty for a learning project, and adding depth later is a matter of extending the tokenizer, not rewriting the parser.

Paragraph detection and inline content

Paragraphs are the "default" block -- if the current token doesn't match any other block type, it's the start of a paragraph. The tricky part is knowing when a paragraph ends. In Markdown, a paragraph ends at a blank line (two consecutive line breaks), at the start of another block element, or at EOF.

fn parseParagraph(self: *Parser) !*Node {
    const para = try Node.init(self.allocator, .paragraph);
    try self.parseInlineContent(para);
    return para;
}

fn parseInlineContent(self: *Parser, parent: *Node) !void {
    while (self.peek()) |tok| {
        switch (tok.type) {
            .text => {
                const text_node = try Node.init(self.allocator, .text);
                text_node.content = tok.text;
                try parent.addChild(text_node);
                _ = self.advance();
            },
            .bold_start => {
                _ = self.advance();
                const bold = try Node.init(self.allocator, .bold);
                // Parse nested inline content until bold_end
                try self.parseInlineUntil(bold, .bold_end);
                try parent.addChild(bold);
            },
            .italic_start => {
                _ = self.advance();
                const italic = try Node.init(self.allocator, .italic);
                try self.parseInlineUntil(italic, .italic_end);
                try parent.addChild(italic);
            },
            .code_span => {
                const cs = try Node.init(self.allocator, .code_span);
                cs.content = tok.text;
                try parent.addChild(cs);
                _ = self.advance();
            },
            .link_text_start => {
                try self.parseLink(parent);
            },
            .image_start => {
                try self.parseImage(parent);
            },
            .line_break => {
                // Check if next token starts a new block
                // Two line breaks = paragraph end
                _ = self.advance();
                if (self.peek()) |next| {
                    if (next.type == .line_break or
                        next.type == .heading or
                        next.type == .code_block_start or
                        next.type == .unordered_list_item or
                        next.type == .ordered_list_item or
                        next.type == .horizontal_rule or
                        next.type == .blockquote or
                        next.type == .eof)
                    {
                        return; // end of paragraph
                    }
                    // Single line break within a paragraph -- insert soft break
                    const br = try Node.init(self.allocator, .line_break);
                    try parent.addChild(br);
                } else {
                    return;
                }
            },
            else => return, // any unrecognized token ends inline parsing
        }
    }
}

fn parseInlineUntil(self: *Parser, parent: *Node, end_token: TokenType) !void {
    while (self.peek()) |tok| {
        if (tok.type == end_token) {
            _ = self.advance(); // consume the closing delimiter
            return;
        }
        if (tok.type == .line_break or tok.type == .eof) {
            return; // unclosed delimiter -- bail out
        }
        // Recursively parse inline content
        switch (tok.type) {
            .text => {
                const text_node = try Node.init(self.allocator, .text);
                text_node.content = tok.text;
                try parent.addChild(text_node);
                _ = self.advance();
            },
            .bold_start => {
                _ = self.advance();
                const bold = try Node.init(self.allocator, .bold);
                try self.parseInlineUntil(bold, .bold_end);
                try parent.addChild(bold);
            },
            .italic_start => {
                _ = self.advance();
                const italic = try Node.init(self.allocator, .italic);
                try self.parseInlineUntil(italic, .italic_end);
                try parent.addChild(italic);
            },
            .code_span => {
                const cs = try Node.init(self.allocator, .code_span);
                cs.content = tok.text;
                try parent.addChild(cs);
                _ = self.advance();
            },
            .link_text_start => {
                try self.parseLink(parent);
            },
            else => {
                _ = self.advance(); // skip unknown tokens
            },
        }
    }
}

This is where the "recursive" in recursive descent becomes visible. parseInlineContent calls parseInlineUntil for bold/italic blocks, and parseInlineUntil can call itself recursively for nested formatting. So **bold and *italic* inside** creates a bold node whose children include a text node, an italic node (containing its own text node), and another text node. The tree captures the nesting perfectly.

The paragraph detection logic in parseInlineContent is worth studying. A single line break mid-paragraph becomes a soft break (which the HTML renderer will turn into a <br> or just a space, depending on your preference). A double line break or the start of a new block element ends the paragraph entirely. This matches how most Markdown renderers work -- consecutive text lines get merged into one paragraph, blank lines separate paragraphs.

Parsing links and images

Links and images follow a similar structure: [text](url) and ! [alt] (url) (without the spaces). The parser needs to collect the text content, then the URL:

fn parseLink(self: *Parser, parent: *Node) !void {
    _ = self.advance(); // consume link_text_start

    const lnk = try Node.init(self.allocator, .link);

    // Parse link text (inline content until link_text_end)
    while (self.peek()) |tok| {
        if (tok.type == .link_text_end) {
            _ = self.advance();
            break;
        }
        if (tok.type == .text) {
            const text_node = try Node.init(self.allocator, .text);
            text_node.content = tok.text;
            try lnk.addChild(text_node);
        }
        _ = self.advance();
    }

    // Parse URL
    if (self.expect(.link_url_start)) |_| {
        if (self.peek()) |url_tok| {
            if (url_tok.type == .text) {
                lnk.url = url_tok.text;
                _ = self.advance();
            }
        }
        _ = self.expect(.link_url_end);
    }

    try parent.addChild(lnk);
}

fn parseImage(self: *Parser, parent: *Node) !void {
    _ = self.advance(); // consume image_start

    const img = try Node.init(self.allocator, .image);

    // Parse alt text until link_text_end
    while (self.peek()) |tok| {
        if (tok.type == .link_text_end) {
            _ = self.advance();
            break;
        }
        if (tok.type == .text) {
            img.alt_text = tok.text;
        }
        _ = self.advance();
    }

    // Parse URL
    if (self.expect(.link_url_start)) |_| {
        if (self.peek()) |url_tok| {
            if (url_tok.type == .text) {
                img.url = url_tok.text;
                _ = self.advance();
            }
        }
        _ = self.expect(.link_url_end);
    }

    try parent.addChild(img);
}

The link parser extracts inline content between [ and ] as children of the link node (so the link text can contain bold, italic, etc. -- just like in real Markdown). The image parser is similar but stores the text between [ and ] as alt text rather than child nodes.

Building the tree with arena allocation

You may have noticed something: we're creating a LOT of nodes. Every text run, every bold span, every paragraph, every list item -- all separately heap-allocated. With the general purpose allocator, that means many individual malloc calls. And when we're done with the AST (after rendering to HTML), we need to free every single node individually.

This is the perfect use case for an arena allocator. We covered allocators in episode 7 and even wrote a custom one in episode 26. An arena allocates from a big chunk of memory and frees everything at once when you're done. No individual frees, no tracking, just one deinit and it's all gone.

pub fn parseMarkdown(source: []const u8, backing_allocator: std.mem.Allocator) !*Node {
    // Create arena for all AST allocations
    var arena = std.heap.ArenaAllocator.init(backing_allocator);
    // NOTE: caller must keep arena alive until done with the AST
    // In practice, wrap this in a Document struct that owns the arena

    const allocator = arena.allocator();

    // Tokenize
    var tokenizer = @import("tokenizer.zig").Tokenizer.init(allocator, source);
    const tokens = try tokenizer.tokenize();

    // Parse
    var parser = Parser.init(allocator, tokens);
    return try parser.parse();
}

pub const Document = struct {
    root: *Node,
    arena: std.heap.ArenaAllocator,

    pub fn init(source: []const u8, backing_allocator: std.mem.Allocator) !Document {
        var arena = std.heap.ArenaAllocator.init(backing_allocator);
        errdefer arena.deinit();

        const allocator = arena.allocator();

        var tokenizer = @import("tokenizer.zig").Tokenizer.init(allocator, source);
        const tokens = try tokenizer.tokenize();

        var parser = Parser.init(allocator, tokens);
        const root = try parser.parse();

        return .{
            .root = root,
            .arena = arena,
        };
    }

    pub fn deinit(self: *Document) void {
        self.arena.deinit();
    }
};

The Document struct wraps the AST root and the arena together. When you're done with the document (after rendering or whatever), call deinit once and all nodes, all ArrayLists, all token storage -- everything allocated through the arena -- gets freed in one shot. No walking the tree to free individual nodes. No risk of forgeting to free a deeply nested child. Just arena.deinit() and you're done.

This is one of those patterns where Zig's explicit allocator design really shines. In C you'd have to write a custom arena or track every allocation manually. In garbage-collected languages you wouldn't think about it at all (but you'd pay the GC overhead). Zig gives you the arena pattern as a standard library primitive that costs nothing extra -- std.heap.ArenaAllocator is literally just bumping a pointer.

Visitor pattern for walking the AST

Once we have the tree, we need to walk it. The renderer in the next episode will walk the AST and emit HTML. But the walk logic itself is reusable -- we might also want to walk the tree for debugging output, for generating a table of contents, for counting words, etc.

The visitor pattern lets us separate "how we walk the tree" from "what we do at each node":

pub fn Visitor(comptime Context: type) type {
    return struct {
        const Self = @This();

        enterFn: *const fn (ctx: *Context, node: *const Node, depth: usize) void,
        leaveFn: *const fn (ctx: *Context, node: *const Node, depth: usize) void,

        pub fn walk(self: Self, node: *const Node, ctx: *Context) void {
            self.walkDepth(node, ctx, 0);
        }

        fn walkDepth(self: Self, node: *const Node, ctx: *Context, depth: usize) void {
            self.enterFn(ctx, node, depth);
            for (node.children.items) |child| {
                self.walkDepth(child, ctx, depth + 1);
            }
            self.leaveFn(ctx, node, depth);
        }
    };
}

This uses comptime generics (from episode 14) to parameterize the visitor over a context type. The caller provides two function pointers: one called when entering a node (before processing children) and one called when leaving (after children). The depth parameter tracks how deep in the tree we are -- useful for indentation in debug output and for nesting HTML tags correctly.

Here's a debug printer using the visitor:

const DebugContext = struct {
    const indent_str = "  ";

    fn enter(ctx: *DebugContext, node: *const Node, depth: usize) void {
        _ = ctx;
        // Print indentation
        var i: usize = 0;
        while (i < depth) : (i += 1) {
            std.debug.print("{s}", .{indent_str});
        }
        std.debug.print("{s}", .{@tagName(node.node_type)});
        if (node.content.len > 0) {
            std.debug.print(" \"{s}\"", .{node.content});
        }
        if (node.level > 0) {
            std.debug.print(" level={d}", .{node.level});
        }
        if (node.url.len > 0) {
            std.debug.print(" url=\"{s}\"", .{node.url});
        }
        std.debug.print("\n", .{});
    }

    fn leave(_: *DebugContext, _: *const Node, _: usize) void {
        // nothing on leave for debug output
    }
};

pub fn debugPrint(root: *const Node) void {
    var ctx = DebugContext{};
    const visitor = Visitor(DebugContext){
        .enterFn = DebugContext.enter,
        .leaveFn = DebugContext.leave,
    };
    visitor.walk(root, &ctx);
}

Running this on a parsed document gives you output like:

document
  heading level=1
    text "Hello World"
  paragraph
    text "This is "
    bold
      text "important"
    text " stuff."
  code_block "const x = 42;"
  unordered_list
    list_item
      text "first"
    list_item
      text "second"

Extremely handy for debugging parser issues. You can see at a glance whether nodes are nested correctly, whether text content ended up in the right places, whether paragraphs were split properly.

Testing the parser

Testing the parser builds on the tokenizer tests from last episode. The approach: feed known Markdown through both tokenizer and parser, then verify the AST structure:

test "parse heading" {
    var doc = try Document.init("# Hello World\n", std.testing.allocator);
    defer doc.deinit();

    const root = doc.root;
    try std.testing.expectEqual(NodeType.document, root.node_type);
    try std.testing.expectEqual(@as(usize, 1), root.childCount());

    const heading = root.children.items[0];
    try std.testing.expectEqual(NodeType.heading, heading.node_type);
    try std.testing.expectEqual(@as(u8, 1), heading.level);
    try std.testing.expectEqual(@as(usize, 1), heading.childCount());
    try std.testing.expectEqualStrings("Hello World", heading.children.items[0].content);
}

test "parse paragraph with bold" {
    var doc = try Document.init("This is **bold** text\n", std.testing.allocator);
    defer doc.deinit();

    const root = doc.root;
    const para = root.children.items[0];
    try std.testing.expectEqual(NodeType.paragraph, para.node_type);

    // Should have: text("This is "), bold(text("bold")), text(" text")
    try std.testing.expectEqual(@as(usize, 3), para.childCount());
    try std.testing.expectEqual(NodeType.text, para.children.items[0].node_type);
    try std.testing.expectEqual(NodeType.bold, para.children.items[1].node_type);
    try std.testing.expectEqual(NodeType.text, para.children.items[2].node_type);

    // Check bold contains the right text
    const bold = para.children.items[1];
    try std.testing.expectEqual(@as(usize, 1), bold.childCount());
    try std.testing.expectEqualStrings("bold", bold.children.items[0].content);
}

test "parse link" {
    var doc = try Document.init("Click [here](https://example.com) now\n", std.testing.allocator);
    defer doc.deinit();

    const para = doc.root.children.items[0];
    // text "Click " -> link -> text " now"
    try std.testing.expectEqual(@as(usize, 3), para.childCount());

    const link = para.children.items[1];
    try std.testing.expectEqual(NodeType.link, link.node_type);
    try std.testing.expectEqualStrings("https://example.com", link.url);
    try std.testing.expectEqual(@as(usize, 1), link.childCount());
    try std.testing.expectEqualStrings("here", link.children.items[0].content);
}

test "parse code block with language" {
    const input = "```zig\nconst x = 42;\n```\n";
    var doc = try Document.init(input, std.testing.allocator);
    defer doc.deinit();

    const cb = doc.root.children.items[0];
    try std.testing.expectEqual(NodeType.code_block, cb.node_type);
    try std.testing.expectEqualStrings("zig", cb.language);
    try std.testing.expect(std.mem.indexOf(u8, cb.content, "const x = 42;") != null);
}

test "parse unordered list" {
    const input = "- alpha\n- beta\n- gamma\n";
    var doc = try Document.init(input, std.testing.allocator);
    defer doc.deinit();

    const list = doc.root.children.items[0];
    try std.testing.expectEqual(NodeType.unordered_list, list.node_type);
    try std.testing.expectEqual(@as(usize, 3), list.childCount());

    for (list.children.items) |item| {
        try std.testing.expectEqual(NodeType.list_item, item.node_type);
    }
}

test "parse multiple paragraphs" {
    const input = "First paragraph.\n\nSecond paragraph.\n";
    var doc = try Document.init(input, std.testing.allocator);
    defer doc.deinit();

    // Should have two paragraph children
    var para_count: usize = 0;
    for (doc.root.children.items) |child| {
        if (child.node_type == .paragraph) para_count += 1;
    }
    try std.testing.expectEqual(@as(usize, 2), para_count);
}

test "arena cleanup frees all nodes" {
    // The testing allocator will detect leaks
    var doc = try Document.init("# Title\n\nSome **bold** and *italic* text.\n\n- item one\n- item two\n", std.testing.allocator);
    doc.deinit();
    // If we get here without a leak report, arena cleanup works correctly
}

That last test is my favorite -- it doesn't check any specific AST structure, it just verifies that the arena cleanup actually frees everything. If we forgot to route some allocation through the arena (say, a stray std.heap.page_allocator.create() somewhere), std.testing.allocator would catch the leak. This is the Zig testing philosophy: let the tools do the checking for you.

Updating the build.zig

We need to add the parser as a test target alongside the tokenizer tests from last episode:

const std = @import("std");

pub fn build(b: *std.Build) void {
    const target = b.standardTargetOptions(.{});
    const optimize = b.standardOptimizeOption(.{});

    const exe = b.addExecutable(.{
        .name = "md2html",
        .root_source_file = b.path("src/main.zig"),
        .target = target,
        .optimize = optimize,
    });
    b.installArtifact(exe);

    const run_cmd = b.addRunArtifact(exe);
    run_cmd.step.dependOn(b.getInstallStep());
    if (b.args) |args| run_cmd.addArgs(args);
    const run_step = b.step("run", "Run md2html");
    run_step.dependOn(&run_cmd.step);

    // Tokenizer tests
    const tokenizer_tests = b.addTest(.{
        .root_source_file = b.path("src/tokenizer.zig"),
        .target = target,
        .optimize = optimize,
    });
    const run_tok_tests = b.addRunArtifact(tokenizer_tests);

    // Parser tests
    const parser_tests = b.addTest(.{
        .root_source_file = b.path("src/parser.zig"),
        .target = target,
        .optimize = optimize,
    });
    const run_parser_tests = b.addRunArtifact(parser_tests);

    const test_step = b.step("test", "Run unit tests");
    test_step.dependOn(&run_tok_tests.step);
    test_step.dependOn(&run_parser_tests.step);
}

Now zig build test runs both tokenizer AND parser tests. The test step depends on both, so both suites run (and if either fails, you see exactly which one).

Wat we geleerd hebben

How to define AST node types for a Markdown document using structs with child lists -- block nodes (document, heading, paragraph, code block, lists) and inline nodes (text, bold, italic, link, code span)
Recursive descent parsing: each grammar rule becomes a function that consumes tokens and may call other parsing functions. The recursion handles nesting naturally -- bold inside links, italic inside bold, inline elements inside list items
Paragraph detection by treating single line breaks as soft breaks and double line breaks (or block-start tokens) as paragraph boundaries
Arena allocation for AST memory management -- all nodes allocated through a single arena, freed with one deinit call instead of walking the entire tree
The visitor pattern with comptime generics for separating tree traversal from processing logic, enabling debug printing, rendering, and analysis to share the same walk code
Testing AST structure by feeding known Markdown through tokenizer + parser and verifying node types, child counts, content strings, and metadata. Plus leak detection via std.testing.allocator to ensure the arena cleanup is complete

We now have a complete pipeline from raw Markdown text to a structured tree. The tokenizer breaks text into tokens, the parser organizes tokens into a tree, and the arena makes cleanup trivial. What we don't have yet is the actual HTML output -- that's where the renderer and CLI come in. The visitor pattern we built today will be the foundation for that rendering step.

Thanks for reading!

Hive account@scipio

Learn Zig Series (#38) - Markdown to HTML: Parser and AST

Learn Zig Series (#38) - Markdown to HTML: Parser and AST

What will I learn

Requirements

Difficulty

Curriculum (of the Learn Zig Series):

Learn Zig Series (#38) - Markdown to HTML: Parser and AST

Defining the AST node types

The Parser struct and recursive descent

Parsing block-level elements

Parsing lists -- the nesting challenge

Paragraph detection and inline content

Parsing links and images

Building the tree with arena allocation

Visitor pattern for walking the AST

Testing the parser

Updating the build.zig

Wat we geleerd hebben

Curriculum (of the `Learn Zig Series`):