Learn Zig Series (#47) - Build a Shell: Parsing Commands

Project D: Build Your Own Shell (1/4)

What will I learn

You will learn tokenizing a command line: splitting on spaces, handling quoted strings and escape sequences;
You will learn the Command struct: how to represent a parsed command with program name, arguments, and I/O redirections;
You will learn parsing pipes: turning ls | grep foo | wc -l into a chain of connected commands;
You will learn input/output redirection: parsing >, >>, and < operators;
You will learn handling escape characters and special characters inside and outside quotes;
You will learn building a simple REPL loop: print prompt, read line, parse, display result;
You will learn testing the parser with various command strings and edge cases;
You will learn handling tricky inputs: empty lines, multiple spaces, unclosed quotes, trailing pipes.

Requirements

A working modern computer running macOS, Windows or Ubuntu;
An installed Zig 0.14+ distribution (download from ziglang.org);
The ambition to learn Zig programming.

Difficulty

Intermediate

Curriculum (of the `Learn Zig Series`):

Learn Zig Series (#47) - Build a Shell: Parsing Commands

We just wrapped up Project C -- three episodes building an image manipulation tool from raw pixel buffers all the way to a working CLI pipeline. Now for something completely different. Project D is a shell. Not a simplified toy shell. A proper command-line shell that can parse complex commands, handle pipes and redirections, spawn processes, and manage jobs. The kind of program that every Unix user interacts with hundreds of times a day but (I suspect) very few have actually built themselves.

This first episode focuses entirely on the parsing layer. We need to turn a raw string like cat file.txt | grep "hello world" | wc -l > output.txt into a structured representation that our shell can later execute. Sounds simple until you start thinking about quoted strings, escape characters, nested pipes, and all the edge cases that make shell parsing one of those problems that's deceptivly harder than it looks.

Why a shell? Because it touches almost everything we've covered in this series. String parsing uses slices and memory management from episodes 5 and 7. The command representation uses structs and tagged unions from episode 6. Error handling is everywhere (episode 4). And in the upcoming episodes we'll use process spawning, file descriptors, and signals -- all things that a systems language like Zig is made for.

Here we go!

The data model: what does a parsed command look like?

Before we write a single line of parsing code, let's think about the output. What data structure do we need to represent a parsed command line?

Take this command: grep -i "hello" input.txt > matches.txt 2>&1 | wc -l

That's actually two commands piped together. The first command (grep) has:

A program name: grep
Arguments: -i, hello, input.txt
Output redirection: stdout goes to matches.txt
Error redirection: stderr goes to stdout (the 2>&1 part)

The second command (wc) has:

A program name: wc
Arguments: -l
Its stdin comes from the pipe (the first command's stdout)

Let's define the types. We'll start with redirections, since a single command can have multiple redirections:

const std = @import("std");

const RedirectKind = enum {
    stdout_overwrite, // >
    stdout_append,    // >>
    stdin_file,       // <
};

const Redirect = struct {
    kind: RedirectKind,
    target: []const u8,  // filename
};

Now the command itself. Each command in a pipeline has a program name, a list of arguments, and zero or more redirections:

const Command = struct {
    program: []const u8,
    args: []const []const u8,
    redirects: []const Redirect,

    pub fn deinit(self: *const Command, allocator: std.mem.Allocator) void {
        for (self.args) |arg| {
            allocator.free(arg);
        }
        allocator.free(self.args);
        for (self.redirects) |redir| {
            allocator.free(redir.target);
        }
        allocator.free(self.redirects);
        allocator.free(self.program);
    }
};

And a pipeline is a sequence of commands connected by pipes:

const Pipeline = struct {
    commands: []Command,

    pub fn deinit(self: *const Pipeline, allocator: std.mem.Allocator) void {
        for (self.commands) |*cmd| {
            cmd.deinit(allocator);
        }
        allocator.free(self.commands);
    }
};

Notice how Command.deinit frees every individual piece of owned memory. The program field is an owned slice (we'll allocate and copy it during parsing). Same for each arg and each redirect.target. This ownership model is the same pattern we used in the kv-store project (episode 40) -- the struct owns its data and is responsible for freeing it.

Having said that, there's a design tradeoff here. We could avoid individual allocations by using a single arena allocator for the entire parse result. Then deinit would be a single arena.deinit() call instead of walking through every field. We'll keep the per-field approach for now because it's more explicit about ownership, but for a production shell the arena approach would be cleaner. We covered arenas via the std.heap.ArenaAllocator back in episode 7.

The tokenizer: breaking input into tokens

Parsing happens in two stages. First we tokenize: split the raw input string into meaningful chunks (tokens). Then we parse: turn the token stream into our Pipeline structure. Splitting the work like this keeps each stage simple. We used the exact same two-stage approach in the markdown parser project (episode 37).

A shell tokenizer needs to handle:

Unquoted words: ls, -la, file.txt
Double-quoted strings: "hello world" (spaces inside don't split)
Single-quoted strings: 'hello world' (everything is literal)
Special characters: |, >, >>, <
Escape sequences: \" inside double quotes, \ to escape a space
Whitespace between tokens (any amount, gets consumed)

Let's define the token types:

const TokenKind = enum {
    word,         // regular word or quoted string (quotes stripped)
    pipe,         // |
    redirect_out, // >
    redirect_append, // >>
    redirect_in,  // <
};

const Token = struct {
    kind: TokenKind,
    value: []const u8, // owned string for words, empty for operators
};

Now the tokenizer function. This is the meaty part. We iterate through the input character by character, building up tokens as we go:

const TokenizeError = error{
    UnterminatedQuote,
    UnexpectedCharacter,
    TrailingEscape,
    OutOfMemory,
};

fn tokenize(allocator: std.mem.Allocator, input: []const u8) TokenizeError![]Token {
    var tokens = std.ArrayList(Token).init(allocator);
    errdefer {
        for (tokens.items) |tok| {
            if (tok.kind == .word) allocator.free(tok.value);
        }
        tokens.deinit();
    }

    var i: usize = 0;
    while (i < input.len) {
        const ch = input[i];

        // Skip whitespace
        if (ch == ' ' or ch == '\t') {
            i += 1;
            continue;
        }

        // Comments: # until end of line
        if (ch == '#') break;

        // Pipe
        if (ch == '|') {
            try tokens.append(.{ .kind = .pipe, .value = "" });
            i += 1;
            continue;
        }

        // Redirections
        if (ch == '>') {
            if (i + 1 < input.len and input[i + 1] == '>') {
                try tokens.append(.{ .kind = .redirect_append, .value = "" });
                i += 2;
            } else {
                try tokens.append(.{ .kind = .redirect_out, .value = "" });
                i += 1;
            }
            continue;
        }
        if (ch == '<') {
            try tokens.append(.{ .kind = .redirect_in, .value = "" });
            i += 1;
            continue;
        }

        // Quoted strings or regular words
        var buf = std.ArrayList(u8).init(allocator);
        errdefer buf.deinit();

        if (ch == '"') {
            // Double-quoted string: backslash escapes work
            i += 1; // skip opening quote
            while (i < input.len) {
                if (input[i] == '\\' and i + 1 < input.len) {
                    // Escape sequence
                    const next = input[i + 1];
                    switch (next) {
                        '"', '\\', '$', '`' => {
                            try buf.append(next);
                            i += 2;
                        },
                        'n' => {
                            try buf.append('\n');
                            i += 2;
                        },
                        't' => {
                            try buf.append('\t');
                            i += 2;
                        },
                        else => {
                            // Backslash is literal if next char isn't special
                            try buf.append('\\');
                            try buf.append(next);
                            i += 2;
                        },
                    }
                } else if (input[i] == '"') {
                    i += 1; // skip closing quote
                    break;
                } else {
                    try buf.append(input[i]);
                    i += 1;
                }
            } else {
                // Reached end of input without closing quote
                buf.deinit();
                return error.UnterminatedQuote;
            }
        } else if (ch == '\'') {
            // Single-quoted string: everything is literal, no escape processing
            i += 1; // skip opening quote
            while (i < input.len) {
                if (input[i] == '\'') {
                    i += 1;
                    break;
                }
                try buf.append(input[i]);
                i += 1;
            } else {
                buf.deinit();
                return error.UnterminatedQuote;
            }
        } else {
            // Unquoted word: ends at whitespace or special character
            while (i < input.len) {
                const c = input[i];
                if (c == ' ' or c == '\t' or c == '|' or c == '>' or c == '<' or c == '#') {
                    break;
                }
                if (c == '\\' and i + 1 < input.len) {
                    // Escaped character in unquoted context
                    try buf.append(input[i + 1]);
                    i += 2;
                    continue;
                }
                if (c == '\\' and i + 1 >= input.len) {
                    buf.deinit();
                    return error.TrailingEscape;
                }
                try buf.append(c);
                i += 1;
            }
        }

        const word = try buf.toOwnedSlice();
        try tokens.append(.{ .kind = .word, .value = word });
    }

    return try tokens.toOwnedSlice();
}

That's a solid chunk of code, so let me walk through the key decisions.

The errdefer at the top is critical. If tokenization fails halfway through -- say we encounter an unterminated quote after already collecting 5 tokens -- we need to free the memory we've already allocated. The errdefer iterates through all tokens collected so far and frees any word values. Without this, a parse error would leak every token we'd already created. This is the same errdefer cleanup pattern we hammered on in the WAL implementation in episode 41.

Double quotes vs single quotes: in real shells (bash, zsh), double quotes allow variable expansion ($HOME) and command substitution (backticks). Single quotes treat everything as literal -- no expansion, no escapes, nothing. We follow the same convention. Our double-quote handler processes escape sequences (\", \\, \n, \t), while the single-quote handler just copies bytes verbatim until it finds the closing quote. We're not implementing variable expansion yet (that's a whole separate concern), but the distinction matters for escaping.

The while/else pattern: Zig's while loop supports an else clause that runs when the loop condition becomes false without a break. We use this to detect unterminated quotes -- if we exhaust the input without finding a closing quote, the else branch fires and returns the error. This is much cleaner than a separate "did we find it?" boolean check after the loop.

Comment handling: the # character starts a comment (everything until end of line is ignored). We simply break out of the main loop when we see it. Real shells only treat # as a comment when it's at the start of a word -- echo hello#world prints hello#world in bash. Our version is simpler. Good enough for now.

The parser: tokens to Pipeline

With tokens in hand, we can now parse them into a Pipeline. The grammar is straightforward:

pipeline = command ("|" command)*
command  = word+ (redirection)*
redirection = (">" | ">>" | "<") word

A pipeline is one or more commands separated by pipes. Each command is one or more words (the first being the program name, the rest being arguments), optionally followed by redirections.

const ParseError = error{
    EmptyCommand,
    EmptyPipeline,
    MissingRedirectTarget,
    TrailingPipe,
    OutOfMemory,
};

fn parsePipeline(allocator: std.mem.Allocator, tokens: []const Token) ParseError!Pipeline {
    var commands = std.ArrayList(Command).init(allocator);
    errdefer {
        for (commands.items) |*cmd| cmd.deinit(allocator);
        commands.deinit();
    }

    var args = std.ArrayList([]const u8).init(allocator);
    errdefer {
        for (args.items) |a| allocator.free(a);
        args.deinit();
    }

    var redirects = std.ArrayList(Redirect).init(allocator);
    errdefer {
        for (redirects.items) |r| allocator.free(r.target);
        redirects.deinit();
    }

    var i: usize = 0;
    while (i < tokens.len) {
        const tok = tokens[i];

        switch (tok.kind) {
            .word => {
                const duped = try allocator.dupe(u8, tok.value);
                try args.append(duped);
                i += 1;
            },
            .redirect_out, .redirect_append, .redirect_in => {
                // Next token must be a word (the filename)
                if (i + 1 >= tokens.len or tokens[i + 1].kind != .word) {
                    return error.MissingRedirectTarget;
                }
                const target = try allocator.dupe(u8, tokens[i + 1].value);
                const kind: RedirectKind = switch (tok.kind) {
                    .redirect_out => .stdout_overwrite,
                    .redirect_append => .stdout_append,
                    .redirect_in => .stdin_file,
                    else => unreachable,
                };
                try redirects.append(.{ .kind = kind, .target = target });
                i += 2;
            },
            .pipe => {
                // Flush current command
                if (args.items.len == 0) return error.EmptyCommand;

                const owned_args = try args.toOwnedSlice();
                const program = owned_args[0];
                const cmd_args = owned_args[1..];

                // We need to re-allocate args and redirects slices
                // because the Command takes ownership
                const final_args = try allocator.dupe([]const u8, cmd_args);
                allocator.free(owned_args);

                try commands.append(.{
                    .program = program,
                    .args = final_args,
                    .redirects = try redirects.toOwnedSlice(),
                });

                args = std.ArrayList([]const u8).init(allocator);
                redirects = std.ArrayList(Redirect).init(allocator);
                i += 1;
            },
        }
    }

    // Flush the last command (there's always one after the last pipe, or the only command)
    if (args.items.len == 0) {
        if (commands.items.len > 0) return error.TrailingPipe;
        return error.EmptyPipeline;
    }

    const owned_args = try args.toOwnedSlice();
    const program = owned_args[0];
    const cmd_args = owned_args[1..];
    const final_args = try allocator.dupe([]const u8, cmd_args);
    allocator.free(owned_args);

    try commands.append(.{
        .program = program,
        .args = final_args,
        .redirects = try redirects.toOwnedSlice(),
    });

    return Pipeline{ .commands = try commands.toOwnedSlice() };
}

The trickiest part here is the ownership dance with the arguments slice. When we call args.toOwnedSlice(), we get back a [][]const u8 where owned_args[0] is the program name and owned_args[1..] are the command arguments. But we want Command.program and Command.args to be separate fields. So we dupe the args sub-slice (which creates a new allocation for just the pointers -- the actual string data is already owned by each individual []const u8), then free the original combined slice.

This is a bit fiddly, I know. An alternative would be to store everything in a single args field on the Command and just treat args[0] as the program name. That's simpler but less expressive. I went with separate fields because when we execute the command later, we'll pass program to std.process.Child and args to the argument list, and having them pre-split makes that cleaner.

The errdefer blocks are nested and layered here. If we error out after building some commands but before finishing, we need to free the already-built commands AND the in-progress args and redirects. Getting this cleanup right is non-trivial -- this is one of those situations where Zig's explicit memory management forces you to think carefully about every ownership boundary. And that's a good thing, even if it's more work up front ;-)

Handling escape characters and special characters

The tokenizer above handles basic escapes, but let me go a bit deeper on the design. In a POSIX-compliant shell, escape behavior depends on context:

Outside quotes: a backslash escapes any character. hello\ world is one token ("hello world"). echo \> treats the > as a literal character instead of a redirection operator.

Inside double quotes: only specific characters are escapable: ", \, $, and backtick. A backslash followed by anything else is literal -- "\z" becomes \z, not z.

Inside single quotes: nothing is escapable. There is literally no way to include a single quote inside a single-quoted string. If you need one, you have to end the single-quoted string, add an escaped single quote, and start a new single-quoted string: 'can'\''t' which produces can't. This is such a weird edge case that most people don't know about it.

Our tokenizer handles the first two correctly. The single-quote-escape trick is something we'll skip for now because it requires the tokenizer to concatenate adjacent string segments, which adds quite some complexity. In real shells this is called "word joining" -- adjacent quoted and unquoted segments without whitespace between them form a single word. So hello" world" is one word: hello world. Getting this right requires restructuring the tokenizer to accumulate bytes across multiple segments before emitting a single word token.

For the curious, here's what the word-joining extension would look like:

// Word joining: adjacent segments without whitespace form one word
// Example: hello"world"'!'  ->  "helloworld!"
fn tokenizeWord(allocator: std.mem.Allocator, input: []const u8, pos: *usize) ![]const u8 {
    var buf = std.ArrayList(u8).init(allocator);
    errdefer buf.deinit();

    while (pos.* < input.len) {
        const ch = input[pos.*];

        if (ch == '"') {
            // Consume double-quoted segment
            pos.* += 1;
            while (pos.* < input.len and input[pos.*] != '"') {
                if (input[pos.*] == '\\' and pos.* + 1 < input.len) {
                    const next = input[pos.* + 1];
                    if (next == '"' or next == '\\') {
                        try buf.append(next);
                        pos.* += 2;
                    } else {
                        try buf.append('\\');
                        try buf.append(next);
                        pos.* += 2;
                    }
                } else {
                    try buf.append(input[pos.*]);
                    pos.* += 1;
                }
            }
            if (pos.* < input.len) pos.* += 1; // skip closing quote
        } else if (ch == '\'') {
            // Consume single-quoted segment (all literal)
            pos.* += 1;
            while (pos.* < input.len and input[pos.*] != '\'') {
                try buf.append(input[pos.*]);
                pos.* += 1;
            }
            if (pos.* < input.len) pos.* += 1;
        } else if (ch == ' ' or ch == '\t' or ch == '|' or ch == '>' or ch == '<') {
            // End of word
            break;
        } else if (ch == '\\' and pos.* + 1 < input.len) {
            try buf.append(input[pos.* + 1]);
            pos.* += 2;
        } else {
            try buf.append(ch);
            pos.* += 1;
        }
    }

    return try buf.toOwnedSlice();
}

The key idea is that instead of treating ", ', and unquoted as separate token-producing paths, we loop over segments and keep appending to the same buffer. The word only ends when we hit unquoted whitespace or a special character. This unifies the three modes into a single word-building loop.

A simple REPL: read, parse, display

Now let's wire everything together into a basic REPL (Read-Eval-Print Loop). For this episode, we won't actually execute commands -- that's for next time. Instead we'll parse the input and display the structured result so we can verify the parser works correctly.

fn printPipeline(pipeline: *const Pipeline) void {
    const stdout = std.io.getStdOut().writer();

    for (pipeline.commands, 0..) |cmd, cmd_idx| {
        if (cmd_idx > 0) {
            stdout.print("  |\n", .{}) catch {};
        }
        stdout.print("Command {d}:\n", .{cmd_idx}) catch {};
        stdout.print("  program: \"{s}\"\n", .{cmd.program}) catch {};

        if (cmd.args.len > 0) {
            stdout.print("  args:", .{}) catch {};
            for (cmd.args) |arg| {
                stdout.print(" \"{s}\"", .{arg}) catch {};
            }
            stdout.print("\n", .{}) catch {};
        }

        for (cmd.redirects) |redir| {
            const sym: []const u8 = switch (redir.kind) {
                .stdout_overwrite => ">",
                .stdout_append => ">>",
                .stdin_file => "<",
            };
            stdout.print("  redirect: {s} \"{s}\"\n", .{ sym, redir.target }) catch {};
        }
    }
}

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer {
        const check = gpa.deinit();
        if (check == .leak) std.debug.print("WARNING: memory leak detected\n", .{});
    }
    const allocator = gpa.allocator();

    const stdin = std.io.getStdIn().reader();
    const stdout = std.io.getStdOut().writer();

    try stdout.print("zsh-lite> ", .{});

    var line_buf: [4096]u8 = undefined;
    while (stdin.readUntilDelimiter(&line_buf, '\n')) |line| {
        if (line.len == 0) {
            try stdout.print("zsh-lite> ", .{});
            continue;
        }

        // Check for exit
        if (std.mem.eql(u8, line, "exit") or std.mem.eql(u8, line, "quit")) {
            try stdout.print("Bye!\n", .{});
            break;
        }

        const tokens = tokenize(allocator, line) catch |err| {
            switch (err) {
                error.UnterminatedQuote => try stdout.print("Error: unterminated quote\n", .{}),
                error.TrailingEscape => try stdout.print("Error: trailing backslash\n", .{}),
                else => try stdout.print("Tokenize error: {}\n", .{err}),
            }
            try stdout.print("zsh-lite> ", .{});
            continue;
        };
        defer {
            for (tokens) |tok| {
                if (tok.kind == .word) allocator.free(tok.value);
            }
            allocator.free(tokens);
        }

        if (tokens.len == 0) {
            try stdout.print("zsh-lite> ", .{});
            continue;
        }

        const pipeline = parsePipeline(allocator, tokens) catch |err| {
            switch (err) {
                error.EmptyCommand => try stdout.print("Error: empty command (double pipe?)\n", .{}),
                error.EmptyPipeline => try stdout.print("Error: empty pipeline\n", .{}),
                error.MissingRedirectTarget => try stdout.print("Error: redirect without target\n", .{}),
                error.TrailingPipe => try stdout.print("Error: trailing pipe\n", .{}),
                else => try stdout.print("Parse error: {}\n", .{err}),
            }
            try stdout.print("zsh-lite> ", .{});
            continue;
        };
        defer pipeline.deinit(allocator);

        try stdout.print("\n", .{});
        printPipeline(&pipeline);
        try stdout.print("\nzsh-lite> ", .{});
    } else |err| {
        if (err != error.EndOfStream) {
            std.debug.print("Read error: {}\n", .{err});
        }
    }
}

The REPL uses readUntilDelimiter with a fixed 4096-byte buffer for line input. That's plenty for interactive use. If somebody pastes in a 5000-character command... well, they'll get an error. A production shell would use dynamic allocation for the line buffer (or a line-editing library like readline), but for our purposes the fixed buffer keeps things simple.

The while/else on readUntilDelimiter is the same pattern as the tokenizer -- the else clause handles the end-of-stream case (Ctrl+D). The inner error handling catches parse errors and prints friendly messages instead of crashing.

Let's see what a session looks like:

zsh-lite> ls -la /tmp
Command 0:
  program: "ls"
  args: "-la" "/tmp"

zsh-lite> cat file.txt | grep "hello world" | wc -l
Command 0:
  program: "cat"
  args: "file.txt"
  |
Command 1:
  program: "grep"
  args: "hello world"
  |
Command 2:
  program: "wc"
  args: "-l"

zsh-lite> echo "test" > output.txt
Command 0:
  program: "echo"
  args: "test"
  redirect: > "output.txt"

zsh-lite> sort < input.txt >> results.txt
Command 0:
  program: "sort"
  args:
  redirect: < "input.txt"
  redirect: >> "results.txt"

zsh-lite> exit
Bye!

Looks right. The parser correctly splits pipes, extracts redirections, handles quoted strings (stripping the quotes and keeping the content), and produces a clean structured output.

Testing the parser

One of the things I always stress in this series -- and we dedicated an entire episode to it (episode 12) -- is that parsers are perfect candidates for unit testing. The input is a string, the output is a structure. Pure transformation, no side effects.

test "simple command" {
    const allocator = std.testing.allocator;
    const tokens = try tokenize(allocator, "ls -la /tmp");
    defer {
        for (tokens) |t| if (t.kind == .word) allocator.free(t.value);
        allocator.free(tokens);
    }

    try std.testing.expectEqual(@as(usize, 3), tokens.len);
    try std.testing.expectEqualStrings("ls", tokens[0].value);
    try std.testing.expectEqualStrings("-la", tokens[1].value);
    try std.testing.expectEqualStrings("/tmp", tokens[2].value);
}

test "quoted string preserves spaces" {
    const allocator = std.testing.allocator;
    const tokens = try tokenize(allocator, "echo \"hello world\"");
    defer {
        for (tokens) |t| if (t.kind == .word) allocator.free(t.value);
        allocator.free(tokens);
    }

    try std.testing.expectEqual(@as(usize, 2), tokens.len);
    try std.testing.expectEqualStrings("echo", tokens[0].value);
    try std.testing.expectEqualStrings("hello world", tokens[1].value);
}

test "pipe produces pipe tokens" {
    const allocator = std.testing.allocator;
    const tokens = try tokenize(allocator, "ls | grep foo");
    defer {
        for (tokens) |t| if (t.kind == .word) allocator.free(t.value);
        allocator.free(tokens);
    }

    try std.testing.expectEqual(@as(usize, 4), tokens.len);
    try std.testing.expect(tokens[0].kind == .word);
    try std.testing.expect(tokens[1].kind == .pipe);
    try std.testing.expect(tokens[2].kind == .word);
    try std.testing.expect(tokens[3].kind == .word);
}

test "redirections" {
    const allocator = std.testing.allocator;
    const tokens = try tokenize(allocator, "sort < input.txt >> output.txt");
    defer {
        for (tokens) |t| if (t.kind == .word) allocator.free(t.value);
        allocator.free(tokens);
    }

    try std.testing.expectEqual(@as(usize, 5), tokens.len);
    try std.testing.expect(tokens[0].kind == .word);
    try std.testing.expect(tokens[1].kind == .redirect_in);
    try std.testing.expect(tokens[2].kind == .word);
    try std.testing.expect(tokens[3].kind == .redirect_append);
    try std.testing.expect(tokens[4].kind == .word);
}

test "escape sequence in double quotes" {
    const allocator = std.testing.allocator;
    const tokens = try tokenize(allocator, "echo \"hello\\tworld\\n\"");
    defer {
        for (tokens) |t| if (t.kind == .word) allocator.free(t.value);
        allocator.free(tokens);
    }

    try std.testing.expectEqual(@as(usize, 2), tokens.len);
    try std.testing.expectEqualStrings("hello\tworld\n", tokens[1].value);
}

test "single quotes are fully literal" {
    const allocator = std.testing.allocator;
    const tokens = try tokenize(allocator, "echo '\\n is not a newline'");
    defer {
        for (tokens) |t| if (t.kind == .word) allocator.free(t.value);
        allocator.free(tokens);
    }

    try std.testing.expectEqual(@as(usize, 2), tokens.len);
    try std.testing.expectEqualStrings("\\n is not a newline", tokens[1].value);
}

test "unterminated quote returns error" {
    const allocator = std.testing.allocator;
    const result = tokenize(allocator, "echo \"hello");
    try std.testing.expectError(error.UnterminatedQuote, result);
}

test "empty input produces no tokens" {
    const allocator = std.testing.allocator;
    const tokens = try tokenize(allocator, "   ");
    defer allocator.free(tokens);
    try std.testing.expectEqual(@as(usize, 0), tokens.len);
}

test "pipeline parse: two commands" {
    const allocator = std.testing.allocator;
    const tokens = try tokenize(allocator, "cat file.txt | wc -l");
    defer {
        for (tokens) |t| if (t.kind == .word) allocator.free(t.value);
        allocator.free(tokens);
    }

    const pipeline = try parsePipeline(allocator, tokens);
    defer pipeline.deinit(allocator);

    try std.testing.expectEqual(@as(usize, 2), pipeline.commands.len);
    try std.testing.expectEqualStrings("cat", pipeline.commands[0].program);
    try std.testing.expectEqual(@as(usize, 1), pipeline.commands[0].args.len);
    try std.testing.expectEqualStrings("file.txt", pipeline.commands[0].args[0]);
    try std.testing.expectEqualStrings("wc", pipeline.commands[1].program);
    try std.testing.expectEqual(@as(usize, 1), pipeline.commands[1].args.len);
    try std.testing.expectEqualStrings("-l", pipeline.commands[1].args[0]);
}

test "trailing pipe returns error" {
    const allocator = std.testing.allocator;
    const tokens = try tokenize(allocator, "ls |");
    defer {
        for (tokens) |t| if (t.kind == .word) allocator.free(t.value);
        allocator.free(tokens);
    }

    const result = parsePipeline(allocator, tokens);
    try std.testing.expectError(error.TrailingPipe, result);
}

That's ten tests covering the core functionality. The std.testing.allocator is especially useful here -- it's a wrapper around the GPA that fails the test if any memory leaks are detected. So these tests don't just verify correctness, they also verify that our cleanup code works. Every errdefer and deinit gets exercised.

You'd run these with zig build test (assuming your build.zig is set up -- same pattern as the image tool from episode 46).

Edge cases: the stuff that breaks parsers

Let me walk through some of the nasty inputs that shell parsers need to handle:

Multiple spaces: ls -la should produce the same result as ls -la. Our tokenizer handles this because the whitespace-skipping loop consumes all consecutive spaces before looking for the next token.

Empty input or whitespace-only input: "" or " " should produce zero tokens and no error. Our tokenizer does this -- it just never enters any token-building branch and returns an empty slice.

Trailing pipe: ls | is an error. The user started a pipeline but didn't provide the second command. Our parser catches this because after processing the pipe token, the args list is empty when we try to flush the last command.

Double pipe: ls || grep foo -- this is tricky. In bash, || is the logical OR operator ("run the second command if the first fails"). In our simple parser, it's two pipe tokens with an empty command between them, which our parser rejects as EmptyCommand. If we wanted to support ||, we'd need to add it as a separate token type in the tokenizer.

Redirect without target: ls > is an error. The redirection operator needs a filename after it. Our parser checks i + 1 >= tokens.len and returns MissingRedirectTarget.

Comments: ls -la # this is a comment should produce tokens for just ls -la. The # and everything after it is ignored.

Mixed quoting styles in one word: As discussed in the escape section, hello"world"'!' is three segments that form one word. Our basic tokenizer doesn't handle this -- each quoted section becomes its own token. The word-joining extension shown earlier would handle it correctly.

Backslash at end of input: echo hello\ is an error (trailing escape with nothing to escape). Our tokenizer returns TrailingEscape for this case.

One edge case that's particularly annoying in real shells is nested quotes. What does echo "it's a \"test\"" produce? In our parser: it's a "test". The backslash-escaped double quotes inside the double-quoted string are treated as literal quote characters. The single quote inside doesn't do anything special because we're already inside double quotes. This matches POSIX behavior.

Project setup

Here's the build.zig for the shell project:

const std = @import("std");

pub fn build(b: *std.Build) void {
    const target = b.standardTargetOptions(.{});
    const optimize = b.standardOptimizeOption(.{});

    const exe = b.addExecutable(.{
        .name = "zsh-lite",
        .root_source_file = b.path("src/main.zig"),
        .target = target,
        .optimize = optimize,
    });

    b.installArtifact(exe);

    const run_cmd = b.addRunArtifact(exe);
    run_cmd.step.dependOn(b.getInstallStep());

    const run_step = b.step("run", "Run zsh-lite");
    run_step.dependOn(&run_cmd.step);

    const tests = b.addTest(.{
        .root_source_file = b.path("src/main.zig"),
        .target = target,
        .optimize = optimize,
    });

    const run_tests = b.addRunArtifact(tests);
    const test_step = b.step("test", "Run parser tests");
    test_step.dependOn(&run_tests.step);
}

And the project folder layout:

zsh-lite/
  src/
    main.zig       -- REPL, arg structures, tokenizer, parser, tests
  build.zig        -- Build configuration

For now everything lives in main.zig. As the project grows across the next episodes we'll split into modules -- tokenizer.zig, parser.zig, executor.zig, builtins.zig -- following the same modular pattern we used in Project B (the markdown tool, episodes 37-39) and Project C (the image tool, episodes 44-46).

Build it with:

zig build test     # run the parser tests
zig build run      # launch the REPL

Wat we geleerd hebben

Defining a clean data model (Command, Pipeline, Redirect) before writing any parsing code -- the structure drives the implementation, not the other way around
Building a two-stage parser: tokenizer splits raw input into typed tokens, parser converts tokens into structured commands -- the same architecture we used for the markdown tool
Handling three quoting modes: unquoted (backslash escapes anything), double-quoted (selective escapes for ", \, $), single-quoted (everything literal)
Memory ownership in parse results: each Command owns its strings, deinit walks every field, errdefer chains ensure cleanup on parse failure
Detecting and reporting common errors: unterminated quotes, trailing pipes, missing redirect targets, empty commands
Using while/else loops to detect exhausted input (unterminated quote detection)
Word joining as a concept: how real shells concatenate adjacent quoted and unquoted segments into a single token
Writing focused unit tests for parsers: string in, structure out, testing allocator catches leaks

We've got the foundation. The parser can handle pipes, redirections, quoting, and escaping. What it can't do yet is actually run anything -- typing ls just shows you the parsed structure instead of listing files. Making that happen requires process spawning, file descriptor manipulation, and pipe plumbing. That's where the real systems programming starts ;-)

Thanks for reading!

Hive account@scipio

Learn Zig Series (#47) - Build a Shell: Parsing Commands

Learn Zig Series (#47) - Build a Shell: Parsing Commands

What will I learn

Requirements

Difficulty

Curriculum (of the Learn Zig Series):

Learn Zig Series (#47) - Build a Shell: Parsing Commands

The data model: what does a parsed command look like?

The tokenizer: breaking input into tokens

The parser: tokens to Pipeline

Handling escape characters and special characters

A simple REPL: read, parse, display

Testing the parser

Edge cases: the stuff that breaks parsers

Project setup

Wat we geleerd hebben

Curriculum (of the `Learn Zig Series`):