Learn Zig Series (#47) - Build a Shell: Parsing Commands
Project D: Build Your Own Shell (1/4)
What will I learn
- You will learn tokenizing a command line: splitting on spaces, handling quoted strings and escape sequences;
- You will learn the Command struct: how to represent a parsed command with program name, arguments, and I/O redirections;
- You will learn parsing pipes: turning
ls | grep foo | wc -linto a chain of connected commands; - You will learn input/output redirection: parsing
>,>>, and<operators; - You will learn handling escape characters and special characters inside and outside quotes;
- You will learn building a simple REPL loop: print prompt, read line, parse, display result;
- You will learn testing the parser with various command strings and edge cases;
- You will learn handling tricky inputs: empty lines, multiple spaces, unclosed quotes, trailing pipes.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Zig 0.14+ distribution (download from ziglang.org);
- The ambition to learn Zig programming.
Difficulty
- Intermediate
Curriculum (of the Learn Zig Series):
- Zig Programming Tutorial - ep001 - Intro
- Learn Zig Series (#2) - Hello Zig, Variables and Types
- Learn Zig Series (#3) - Functions and Control Flow
- Learn Zig Series (#4) - Error Handling (Zig's Best Feature)
- Learn Zig Series (#5) - Arrays, Slices, and Strings
- Learn Zig Series (#6) - Structs, Enums, and Tagged Unions
- Learn Zig Series (#7) - Memory Management and Allocators
- Learn Zig Series (#8) - Pointers and Memory Layout
- Learn Zig Series (#9) - Comptime (Zig's Superpower)
- Learn Zig Series (#10) - Project Structure, Modules, and File I/O
- Learn Zig Series (#11) - Mini Project: Building a Step Sequencer
- Learn Zig Series (#12) - Testing and Test-Driven Development
- Learn Zig Series (#13) - Interfaces via Type Erasure
- Learn Zig Series (#14) - Generics with Comptime Parameters
- Learn Zig Series (#15) - The Build System (build.zig)
- Learn Zig Series (#16) - Sentinel-Terminated Types and C Strings
- Learn Zig Series (#17) - Packed Structs and Bit Manipulation
- Learn Zig Series (#18) - Async Concepts and Event Loops
- Learn Zig Series (#18b) - Addendum: Async Returns in Zig 0.16
- Learn Zig Series (#19) - SIMD with @Vector
- Learn Zig Series (#20) - Working with JSON
- Learn Zig Series (#21) - Networking and TCP Sockets
- Learn Zig Series (#22) - Hash Maps and Data Structures
- Learn Zig Series (#23) - Iterators and Lazy Evaluation
- Learn Zig Series (#24) - Logging, Formatting, and Debug Output
- Learn Zig Series (#25) - Mini Project: HTTP Status Checker
- Learn Zig Series (#26) - Writing a Custom Allocator
- Learn Zig Series (#27) - C Interop: Calling C from Zig
- Learn Zig Series (#28) - C Interop: Exposing Zig to C
- Learn Zig Series (#29) - Inline Assembly and Low-Level Control
- Learn Zig Series (#30) - Thread Safety and Atomics
- Learn Zig Series (#31) - Memory-Mapped I/O and Files
- Learn Zig Series (#32) - Compile-Time Reflection with @typeInfo
- Learn Zig Series (#33) - Building a State Machine with Tagged Unions
- Learn Zig Series (#34) - Performance Profiling and Optimization
- Learn Zig Series (#35) - Cross-Compilation and Target Triples
- Learn Zig Series (#36) - Mini Project: CLI Task Runner
- Learn Zig Series (#37) - Markdown to HTML: Tokenizer and Lexer
- Learn Zig Series (#38) - Markdown to HTML: Parser and AST
- Learn Zig Series (#39) - Markdown to HTML: Renderer and CLI
- Learn Zig Series (#40) - Key-Value Store: In-Memory Store
- Learn Zig Series (#41) - Key-Value Store: Write-Ahead Log
- Learn Zig Series (#42) - Key-Value Store: TCP Server
- Learn Zig Series (#43) - Key-Value Store: Client Library and Benchmarks
- Learn Zig Series (#44) - Image Tool: Reading and Writing PPM/BMP
- Learn Zig Series (#45) - Image Tool: Pixel Operations
- Learn Zig Series (#46) - Image Tool: CLI Pipeline
- Learn Zig Series (#47) - Build a Shell: Parsing Commands (this post)
Learn Zig Series (#47) - Build a Shell: Parsing Commands
We just wrapped up Project C -- three episodes building an image manipulation tool from raw pixel buffers all the way to a working CLI pipeline. Now for something completely different. Project D is a shell. Not a simplified toy shell. A proper command-line shell that can parse complex commands, handle pipes and redirections, spawn processes, and manage jobs. The kind of program that every Unix user interacts with hundreds of times a day but (I suspect) very few have actually built themselves.
This first episode focuses entirely on the parsing layer. We need to turn a raw string like cat file.txt | grep "hello world" | wc -l > output.txt into a structured representation that our shell can later execute. Sounds simple until you start thinking about quoted strings, escape characters, nested pipes, and all the edge cases that make shell parsing one of those problems that's deceptivly harder than it looks.
Why a shell? Because it touches almost everything we've covered in this series. String parsing uses slices and memory management from episodes 5 and 7. The command representation uses structs and tagged unions from episode 6. Error handling is everywhere (episode 4). And in the upcoming episodes we'll use process spawning, file descriptors, and signals -- all things that a systems language like Zig is made for.
Here we go!
The data model: what does a parsed command look like?
Before we write a single line of parsing code, let's think about the output. What data structure do we need to represent a parsed command line?
Take this command: grep -i "hello" input.txt > matches.txt 2>&1 | wc -l
That's actually two commands piped together. The first command (grep) has:
- A program name:
grep - Arguments:
-i,hello,input.txt - Output redirection: stdout goes to
matches.txt - Error redirection: stderr goes to stdout (the
2>&1part)
The second command (wc) has:
- A program name:
wc - Arguments:
-l - Its stdin comes from the pipe (the first command's stdout)
Let's define the types. We'll start with redirections, since a single command can have multiple redirections:
const std = @import("std");
const RedirectKind = enum {
stdout_overwrite, // >
stdout_append, // >>
stdin_file, // <
};
const Redirect = struct {
kind: RedirectKind,
target: []const u8, // filename
};
Now the command itself. Each command in a pipeline has a program name, a list of arguments, and zero or more redirections:
const Command = struct {
program: []const u8,
args: []const []const u8,
redirects: []const Redirect,
pub fn deinit(self: *const Command, allocator: std.mem.Allocator) void {
for (self.args) |arg| {
allocator.free(arg);
}
allocator.free(self.args);
for (self.redirects) |redir| {
allocator.free(redir.target);
}
allocator.free(self.redirects);
allocator.free(self.program);
}
};
And a pipeline is a sequence of commands connected by pipes:
const Pipeline = struct {
commands: []Command,
pub fn deinit(self: *const Pipeline, allocator: std.mem.Allocator) void {
for (self.commands) |*cmd| {
cmd.deinit(allocator);
}
allocator.free(self.commands);
}
};
Notice how Command.deinit frees every individual piece of owned memory. The program field is an owned slice (we'll allocate and copy it during parsing). Same for each arg and each redirect.target. This ownership model is the same pattern we used in the kv-store project (episode 40) -- the struct owns its data and is responsible for freeing it.
Having said that, there's a design tradeoff here. We could avoid individual allocations by using a single arena allocator for the entire parse result. Then deinit would be a single arena.deinit() call instead of walking through every field. We'll keep the per-field approach for now because it's more explicit about ownership, but for a production shell the arena approach would be cleaner. We covered arenas via the std.heap.ArenaAllocator back in episode 7.
The tokenizer: breaking input into tokens
Parsing happens in two stages. First we tokenize: split the raw input string into meaningful chunks (tokens). Then we parse: turn the token stream into our Pipeline structure. Splitting the work like this keeps each stage simple. We used the exact same two-stage approach in the markdown parser project (episode 37).
A shell tokenizer needs to handle:
- Unquoted words:
ls,-la,file.txt - Double-quoted strings:
"hello world"(spaces inside don't split) - Single-quoted strings:
'hello world'(everything is literal) - Special characters:
|,>,>>,< - Escape sequences:
\"inside double quotes,\to escape a space - Whitespace between tokens (any amount, gets consumed)
Let's define the token types:
const TokenKind = enum {
word, // regular word or quoted string (quotes stripped)
pipe, // |
redirect_out, // >
redirect_append, // >>
redirect_in, // <
};
const Token = struct {
kind: TokenKind,
value: []const u8, // owned string for words, empty for operators
};
Now the tokenizer function. This is the meaty part. We iterate through the input character by character, building up tokens as we go:
const TokenizeError = error{
UnterminatedQuote,
UnexpectedCharacter,
TrailingEscape,
OutOfMemory,
};
fn tokenize(allocator: std.mem.Allocator, input: []const u8) TokenizeError![]Token {
var tokens = std.ArrayList(Token).init(allocator);
errdefer {
for (tokens.items) |tok| {
if (tok.kind == .word) allocator.free(tok.value);
}
tokens.deinit();
}
var i: usize = 0;
while (i < input.len) {
const ch = input[i];
// Skip whitespace
if (ch == ' ' or ch == '\t') {
i += 1;
continue;
}
// Comments: # until end of line
if (ch == '#') break;
// Pipe
if (ch == '|') {
try tokens.append(.{ .kind = .pipe, .value = "" });
i += 1;
continue;
}
// Redirections
if (ch == '>') {
if (i + 1 < input.len and input[i + 1] == '>') {
try tokens.append(.{ .kind = .redirect_append, .value = "" });
i += 2;
} else {
try tokens.append(.{ .kind = .redirect_out, .value = "" });
i += 1;
}
continue;
}
if (ch == '<') {
try tokens.append(.{ .kind = .redirect_in, .value = "" });
i += 1;
continue;
}
// Quoted strings or regular words
var buf = std.ArrayList(u8).init(allocator);
errdefer buf.deinit();
if (ch == '"') {
// Double-quoted string: backslash escapes work
i += 1; // skip opening quote
while (i < input.len) {
if (input[i] == '\\' and i + 1 < input.len) {
// Escape sequence
const next = input[i + 1];
switch (next) {
'"', '\\', '$', '`' => {
try buf.append(next);
i += 2;
},
'n' => {
try buf.append('\n');
i += 2;
},
't' => {
try buf.append('\t');
i += 2;
},
else => {
// Backslash is literal if next char isn't special
try buf.append('\\');
try buf.append(next);
i += 2;
},
}
} else if (input[i] == '"') {
i += 1; // skip closing quote
break;
} else {
try buf.append(input[i]);
i += 1;
}
} else {
// Reached end of input without closing quote
buf.deinit();
return error.UnterminatedQuote;
}
} else if (ch == '\'') {
// Single-quoted string: everything is literal, no escape processing
i += 1; // skip opening quote
while (i < input.len) {
if (input[i] == '\'') {
i += 1;
break;
}
try buf.append(input[i]);
i += 1;
} else {
buf.deinit();
return error.UnterminatedQuote;
}
} else {
// Unquoted word: ends at whitespace or special character
while (i < input.len) {
const c = input[i];
if (c == ' ' or c == '\t' or c == '|' or c == '>' or c == '<' or c == '#') {
break;
}
if (c == '\\' and i + 1 < input.len) {
// Escaped character in unquoted context
try buf.append(input[i + 1]);
i += 2;
continue;
}
if (c == '\\' and i + 1 >= input.len) {
buf.deinit();
return error.TrailingEscape;
}
try buf.append(c);
i += 1;
}
}
const word = try buf.toOwnedSlice();
try tokens.append(.{ .kind = .word, .value = word });
}
return try tokens.toOwnedSlice();
}
That's a solid chunk of code, so let me walk through the key decisions.
The errdefer at the top is critical. If tokenization fails halfway through -- say we encounter an unterminated quote after already collecting 5 tokens -- we need to free the memory we've already allocated. The errdefer iterates through all tokens collected so far and frees any word values. Without this, a parse error would leak every token we'd already created. This is the same errdefer cleanup pattern we hammered on in the WAL implementation in episode 41.
Double quotes vs single quotes: in real shells (bash, zsh), double quotes allow variable expansion ($HOME) and command substitution (backticks). Single quotes treat everything as literal -- no expansion, no escapes, nothing. We follow the same convention. Our double-quote handler processes escape sequences (\", \\, \n, \t), while the single-quote handler just copies bytes verbatim until it finds the closing quote. We're not implementing variable expansion yet (that's a whole separate concern), but the distinction matters for escaping.
The while/else pattern: Zig's while loop supports an else clause that runs when the loop condition becomes false without a break. We use this to detect unterminated quotes -- if we exhaust the input without finding a closing quote, the else branch fires and returns the error. This is much cleaner than a separate "did we find it?" boolean check after the loop.
Comment handling: the # character starts a comment (everything until end of line is ignored). We simply break out of the main loop when we see it. Real shells only treat # as a comment when it's at the start of a word -- echo hello#world prints hello#world in bash. Our version is simpler. Good enough for now.
The parser: tokens to Pipeline
With tokens in hand, we can now parse them into a Pipeline. The grammar is straightforward:
pipeline = command ("|" command)*
command = word+ (redirection)*
redirection = (">" | ">>" | "<") word
A pipeline is one or more commands separated by pipes. Each command is one or more words (the first being the program name, the rest being arguments), optionally followed by redirections.
const ParseError = error{
EmptyCommand,
EmptyPipeline,
MissingRedirectTarget,
TrailingPipe,
OutOfMemory,
};
fn parsePipeline(allocator: std.mem.Allocator, tokens: []const Token) ParseError!Pipeline {
var commands = std.ArrayList(Command).init(allocator);
errdefer {
for (commands.items) |*cmd| cmd.deinit(allocator);
commands.deinit();
}
var args = std.ArrayList([]const u8).init(allocator);
errdefer {
for (args.items) |a| allocator.free(a);
args.deinit();
}
var redirects = std.ArrayList(Redirect).init(allocator);
errdefer {
for (redirects.items) |r| allocator.free(r.target);
redirects.deinit();
}
var i: usize = 0;
while (i < tokens.len) {
const tok = tokens[i];
switch (tok.kind) {
.word => {
const duped = try allocator.dupe(u8, tok.value);
try args.append(duped);
i += 1;
},
.redirect_out, .redirect_append, .redirect_in => {
// Next token must be a word (the filename)
if (i + 1 >= tokens.len or tokens[i + 1].kind != .word) {
return error.MissingRedirectTarget;
}
const target = try allocator.dupe(u8, tokens[i + 1].value);
const kind: RedirectKind = switch (tok.kind) {
.redirect_out => .stdout_overwrite,
.redirect_append => .stdout_append,
.redirect_in => .stdin_file,
else => unreachable,
};
try redirects.append(.{ .kind = kind, .target = target });
i += 2;
},
.pipe => {
// Flush current command
if (args.items.len == 0) return error.EmptyCommand;
const owned_args = try args.toOwnedSlice();
const program = owned_args[0];
const cmd_args = owned_args[1..];
// We need to re-allocate args and redirects slices
// because the Command takes ownership
const final_args = try allocator.dupe([]const u8, cmd_args);
allocator.free(owned_args);
try commands.append(.{
.program = program,
.args = final_args,
.redirects = try redirects.toOwnedSlice(),
});
args = std.ArrayList([]const u8).init(allocator);
redirects = std.ArrayList(Redirect).init(allocator);
i += 1;
},
}
}
// Flush the last command (there's always one after the last pipe, or the only command)
if (args.items.len == 0) {
if (commands.items.len > 0) return error.TrailingPipe;
return error.EmptyPipeline;
}
const owned_args = try args.toOwnedSlice();
const program = owned_args[0];
const cmd_args = owned_args[1..];
const final_args = try allocator.dupe([]const u8, cmd_args);
allocator.free(owned_args);
try commands.append(.{
.program = program,
.args = final_args,
.redirects = try redirects.toOwnedSlice(),
});
return Pipeline{ .commands = try commands.toOwnedSlice() };
}
The trickiest part here is the ownership dance with the arguments slice. When we call args.toOwnedSlice(), we get back a [][]const u8 where owned_args[0] is the program name and owned_args[1..] are the command arguments. But we want Command.program and Command.args to be separate fields. So we dupe the args sub-slice (which creates a new allocation for just the pointers -- the actual string data is already owned by each individual []const u8), then free the original combined slice.
This is a bit fiddly, I know. An alternative would be to store everything in a single args field on the Command and just treat args[0] as the program name. That's simpler but less expressive. I went with separate fields because when we execute the command later, we'll pass program to std.process.Child and args to the argument list, and having them pre-split makes that cleaner.
The errdefer blocks are nested and layered here. If we error out after building some commands but before finishing, we need to free the already-built commands AND the in-progress args and redirects. Getting this cleanup right is non-trivial -- this is one of those situations where Zig's explicit memory management forces you to think carefully about every ownership boundary. And that's a good thing, even if it's more work up front ;-)
Handling escape characters and special characters
The tokenizer above handles basic escapes, but let me go a bit deeper on the design. In a POSIX-compliant shell, escape behavior depends on context:
Outside quotes: a backslash escapes any character. hello\ world is one token ("hello world"). echo \> treats the > as a literal character instead of a redirection operator.
Inside double quotes: only specific characters are escapable: ", \, $, and backtick. A backslash followed by anything else is literal -- "\z" becomes \z, not z.
Inside single quotes: nothing is escapable. There is literally no way to include a single quote inside a single-quoted string. If you need one, you have to end the single-quoted string, add an escaped single quote, and start a new single-quoted string: 'can'\''t' which produces can't. This is such a weird edge case that most people don't know about it.
Our tokenizer handles the first two correctly. The single-quote-escape trick is something we'll skip for now because it requires the tokenizer to concatenate adjacent string segments, which adds quite some complexity. In real shells this is called "word joining" -- adjacent quoted and unquoted segments without whitespace between them form a single word. So hello" world" is one word: hello world. Getting this right requires restructuring the tokenizer to accumulate bytes across multiple segments before emitting a single word token.
For the curious, here's what the word-joining extension would look like:
// Word joining: adjacent segments without whitespace form one word
// Example: hello"world"'!' -> "helloworld!"
fn tokenizeWord(allocator: std.mem.Allocator, input: []const u8, pos: *usize) ![]const u8 {
var buf = std.ArrayList(u8).init(allocator);
errdefer buf.deinit();
while (pos.* < input.len) {
const ch = input[pos.*];
if (ch == '"') {
// Consume double-quoted segment
pos.* += 1;
while (pos.* < input.len and input[pos.*] != '"') {
if (input[pos.*] == '\\' and pos.* + 1 < input.len) {
const next = input[pos.* + 1];
if (next == '"' or next == '\\') {
try buf.append(next);
pos.* += 2;
} else {
try buf.append('\\');
try buf.append(next);
pos.* += 2;
}
} else {
try buf.append(input[pos.*]);
pos.* += 1;
}
}
if (pos.* < input.len) pos.* += 1; // skip closing quote
} else if (ch == '\'') {
// Consume single-quoted segment (all literal)
pos.* += 1;
while (pos.* < input.len and input[pos.*] != '\'') {
try buf.append(input[pos.*]);
pos.* += 1;
}
if (pos.* < input.len) pos.* += 1;
} else if (ch == ' ' or ch == '\t' or ch == '|' or ch == '>' or ch == '<') {
// End of word
break;
} else if (ch == '\\' and pos.* + 1 < input.len) {
try buf.append(input[pos.* + 1]);
pos.* += 2;
} else {
try buf.append(ch);
pos.* += 1;
}
}
return try buf.toOwnedSlice();
}
The key idea is that instead of treating ", ', and unquoted as separate token-producing paths, we loop over segments and keep appending to the same buffer. The word only ends when we hit unquoted whitespace or a special character. This unifies the three modes into a single word-building loop.
A simple REPL: read, parse, display
Now let's wire everything together into a basic REPL (Read-Eval-Print Loop). For this episode, we won't actually execute commands -- that's for next time. Instead we'll parse the input and display the structured result so we can verify the parser works correctly.
fn printPipeline(pipeline: *const Pipeline) void {
const stdout = std.io.getStdOut().writer();
for (pipeline.commands, 0..) |cmd, cmd_idx| {
if (cmd_idx > 0) {
stdout.print(" |\n", .{}) catch {};
}
stdout.print("Command {d}:\n", .{cmd_idx}) catch {};
stdout.print(" program: \"{s}\"\n", .{cmd.program}) catch {};
if (cmd.args.len > 0) {
stdout.print(" args:", .{}) catch {};
for (cmd.args) |arg| {
stdout.print(" \"{s}\"", .{arg}) catch {};
}
stdout.print("\n", .{}) catch {};
}
for (cmd.redirects) |redir| {
const sym: []const u8 = switch (redir.kind) {
.stdout_overwrite => ">",
.stdout_append => ">>",
.stdin_file => "<",
};
stdout.print(" redirect: {s} \"{s}\"\n", .{ sym, redir.target }) catch {};
}
}
}
pub fn main() !void {
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
defer {
const check = gpa.deinit();
if (check == .leak) std.debug.print("WARNING: memory leak detected\n", .{});
}
const allocator = gpa.allocator();
const stdin = std.io.getStdIn().reader();
const stdout = std.io.getStdOut().writer();
try stdout.print("zsh-lite> ", .{});
var line_buf: [4096]u8 = undefined;
while (stdin.readUntilDelimiter(&line_buf, '\n')) |line| {
if (line.len == 0) {
try stdout.print("zsh-lite> ", .{});
continue;
}
// Check for exit
if (std.mem.eql(u8, line, "exit") or std.mem.eql(u8, line, "quit")) {
try stdout.print("Bye!\n", .{});
break;
}
const tokens = tokenize(allocator, line) catch |err| {
switch (err) {
error.UnterminatedQuote => try stdout.print("Error: unterminated quote\n", .{}),
error.TrailingEscape => try stdout.print("Error: trailing backslash\n", .{}),
else => try stdout.print("Tokenize error: {}\n", .{err}),
}
try stdout.print("zsh-lite> ", .{});
continue;
};
defer {
for (tokens) |tok| {
if (tok.kind == .word) allocator.free(tok.value);
}
allocator.free(tokens);
}
if (tokens.len == 0) {
try stdout.print("zsh-lite> ", .{});
continue;
}
const pipeline = parsePipeline(allocator, tokens) catch |err| {
switch (err) {
error.EmptyCommand => try stdout.print("Error: empty command (double pipe?)\n", .{}),
error.EmptyPipeline => try stdout.print("Error: empty pipeline\n", .{}),
error.MissingRedirectTarget => try stdout.print("Error: redirect without target\n", .{}),
error.TrailingPipe => try stdout.print("Error: trailing pipe\n", .{}),
else => try stdout.print("Parse error: {}\n", .{err}),
}
try stdout.print("zsh-lite> ", .{});
continue;
};
defer pipeline.deinit(allocator);
try stdout.print("\n", .{});
printPipeline(&pipeline);
try stdout.print("\nzsh-lite> ", .{});
} else |err| {
if (err != error.EndOfStream) {
std.debug.print("Read error: {}\n", .{err});
}
}
}
The REPL uses readUntilDelimiter with a fixed 4096-byte buffer for line input. That's plenty for interactive use. If somebody pastes in a 5000-character command... well, they'll get an error. A production shell would use dynamic allocation for the line buffer (or a line-editing library like readline), but for our purposes the fixed buffer keeps things simple.
The while/else on readUntilDelimiter is the same pattern as the tokenizer -- the else clause handles the end-of-stream case (Ctrl+D). The inner error handling catches parse errors and prints friendly messages instead of crashing.
Let's see what a session looks like:
zsh-lite> ls -la /tmp
Command 0:
program: "ls"
args: "-la" "/tmp"
zsh-lite> cat file.txt | grep "hello world" | wc -l
Command 0:
program: "cat"
args: "file.txt"
|
Command 1:
program: "grep"
args: "hello world"
|
Command 2:
program: "wc"
args: "-l"
zsh-lite> echo "test" > output.txt
Command 0:
program: "echo"
args: "test"
redirect: > "output.txt"
zsh-lite> sort < input.txt >> results.txt
Command 0:
program: "sort"
args:
redirect: < "input.txt"
redirect: >> "results.txt"
zsh-lite> exit
Bye!
Looks right. The parser correctly splits pipes, extracts redirections, handles quoted strings (stripping the quotes and keeping the content), and produces a clean structured output.
Testing the parser
One of the things I always stress in this series -- and we dedicated an entire episode to it (episode 12) -- is that parsers are perfect candidates for unit testing. The input is a string, the output is a structure. Pure transformation, no side effects.
test "simple command" {
const allocator = std.testing.allocator;
const tokens = try tokenize(allocator, "ls -la /tmp");
defer {
for (tokens) |t| if (t.kind == .word) allocator.free(t.value);
allocator.free(tokens);
}
try std.testing.expectEqual(@as(usize, 3), tokens.len);
try std.testing.expectEqualStrings("ls", tokens[0].value);
try std.testing.expectEqualStrings("-la", tokens[1].value);
try std.testing.expectEqualStrings("/tmp", tokens[2].value);
}
test "quoted string preserves spaces" {
const allocator = std.testing.allocator;
const tokens = try tokenize(allocator, "echo \"hello world\"");
defer {
for (tokens) |t| if (t.kind == .word) allocator.free(t.value);
allocator.free(tokens);
}
try std.testing.expectEqual(@as(usize, 2), tokens.len);
try std.testing.expectEqualStrings("echo", tokens[0].value);
try std.testing.expectEqualStrings("hello world", tokens[1].value);
}
test "pipe produces pipe tokens" {
const allocator = std.testing.allocator;
const tokens = try tokenize(allocator, "ls | grep foo");
defer {
for (tokens) |t| if (t.kind == .word) allocator.free(t.value);
allocator.free(tokens);
}
try std.testing.expectEqual(@as(usize, 4), tokens.len);
try std.testing.expect(tokens[0].kind == .word);
try std.testing.expect(tokens[1].kind == .pipe);
try std.testing.expect(tokens[2].kind == .word);
try std.testing.expect(tokens[3].kind == .word);
}
test "redirections" {
const allocator = std.testing.allocator;
const tokens = try tokenize(allocator, "sort < input.txt >> output.txt");
defer {
for (tokens) |t| if (t.kind == .word) allocator.free(t.value);
allocator.free(tokens);
}
try std.testing.expectEqual(@as(usize, 5), tokens.len);
try std.testing.expect(tokens[0].kind == .word);
try std.testing.expect(tokens[1].kind == .redirect_in);
try std.testing.expect(tokens[2].kind == .word);
try std.testing.expect(tokens[3].kind == .redirect_append);
try std.testing.expect(tokens[4].kind == .word);
}
test "escape sequence in double quotes" {
const allocator = std.testing.allocator;
const tokens = try tokenize(allocator, "echo \"hello\\tworld\\n\"");
defer {
for (tokens) |t| if (t.kind == .word) allocator.free(t.value);
allocator.free(tokens);
}
try std.testing.expectEqual(@as(usize, 2), tokens.len);
try std.testing.expectEqualStrings("hello\tworld\n", tokens[1].value);
}
test "single quotes are fully literal" {
const allocator = std.testing.allocator;
const tokens = try tokenize(allocator, "echo '\\n is not a newline'");
defer {
for (tokens) |t| if (t.kind == .word) allocator.free(t.value);
allocator.free(tokens);
}
try std.testing.expectEqual(@as(usize, 2), tokens.len);
try std.testing.expectEqualStrings("\\n is not a newline", tokens[1].value);
}
test "unterminated quote returns error" {
const allocator = std.testing.allocator;
const result = tokenize(allocator, "echo \"hello");
try std.testing.expectError(error.UnterminatedQuote, result);
}
test "empty input produces no tokens" {
const allocator = std.testing.allocator;
const tokens = try tokenize(allocator, " ");
defer allocator.free(tokens);
try std.testing.expectEqual(@as(usize, 0), tokens.len);
}
test "pipeline parse: two commands" {
const allocator = std.testing.allocator;
const tokens = try tokenize(allocator, "cat file.txt | wc -l");
defer {
for (tokens) |t| if (t.kind == .word) allocator.free(t.value);
allocator.free(tokens);
}
const pipeline = try parsePipeline(allocator, tokens);
defer pipeline.deinit(allocator);
try std.testing.expectEqual(@as(usize, 2), pipeline.commands.len);
try std.testing.expectEqualStrings("cat", pipeline.commands[0].program);
try std.testing.expectEqual(@as(usize, 1), pipeline.commands[0].args.len);
try std.testing.expectEqualStrings("file.txt", pipeline.commands[0].args[0]);
try std.testing.expectEqualStrings("wc", pipeline.commands[1].program);
try std.testing.expectEqual(@as(usize, 1), pipeline.commands[1].args.len);
try std.testing.expectEqualStrings("-l", pipeline.commands[1].args[0]);
}
test "trailing pipe returns error" {
const allocator = std.testing.allocator;
const tokens = try tokenize(allocator, "ls |");
defer {
for (tokens) |t| if (t.kind == .word) allocator.free(t.value);
allocator.free(tokens);
}
const result = parsePipeline(allocator, tokens);
try std.testing.expectError(error.TrailingPipe, result);
}
That's ten tests covering the core functionality. The std.testing.allocator is especially useful here -- it's a wrapper around the GPA that fails the test if any memory leaks are detected. So these tests don't just verify correctness, they also verify that our cleanup code works. Every errdefer and deinit gets exercised.
You'd run these with zig build test (assuming your build.zig is set up -- same pattern as the image tool from episode 46).
Edge cases: the stuff that breaks parsers
Let me walk through some of the nasty inputs that shell parsers need to handle:
Multiple spaces: ls -la should produce the same result as ls -la. Our tokenizer handles this because the whitespace-skipping loop consumes all consecutive spaces before looking for the next token.
Empty input or whitespace-only input: "" or " " should produce zero tokens and no error. Our tokenizer does this -- it just never enters any token-building branch and returns an empty slice.
Trailing pipe: ls | is an error. The user started a pipeline but didn't provide the second command. Our parser catches this because after processing the pipe token, the args list is empty when we try to flush the last command.
Double pipe: ls || grep foo -- this is tricky. In bash, || is the logical OR operator ("run the second command if the first fails"). In our simple parser, it's two pipe tokens with an empty command between them, which our parser rejects as EmptyCommand. If we wanted to support ||, we'd need to add it as a separate token type in the tokenizer.
Redirect without target: ls > is an error. The redirection operator needs a filename after it. Our parser checks i + 1 >= tokens.len and returns MissingRedirectTarget.
Comments: ls -la # this is a comment should produce tokens for just ls -la. The # and everything after it is ignored.
Mixed quoting styles in one word: As discussed in the escape section, hello"world"'!' is three segments that form one word. Our basic tokenizer doesn't handle this -- each quoted section becomes its own token. The word-joining extension shown earlier would handle it correctly.
Backslash at end of input: echo hello\ is an error (trailing escape with nothing to escape). Our tokenizer returns TrailingEscape for this case.
One edge case that's particularly annoying in real shells is nested quotes. What does echo "it's a \"test\"" produce? In our parser: it's a "test". The backslash-escaped double quotes inside the double-quoted string are treated as literal quote characters. The single quote inside doesn't do anything special because we're already inside double quotes. This matches POSIX behavior.
Project setup
Here's the build.zig for the shell project:
const std = @import("std");
pub fn build(b: *std.Build) void {
const target = b.standardTargetOptions(.{});
const optimize = b.standardOptimizeOption(.{});
const exe = b.addExecutable(.{
.name = "zsh-lite",
.root_source_file = b.path("src/main.zig"),
.target = target,
.optimize = optimize,
});
b.installArtifact(exe);
const run_cmd = b.addRunArtifact(exe);
run_cmd.step.dependOn(b.getInstallStep());
const run_step = b.step("run", "Run zsh-lite");
run_step.dependOn(&run_cmd.step);
const tests = b.addTest(.{
.root_source_file = b.path("src/main.zig"),
.target = target,
.optimize = optimize,
});
const run_tests = b.addRunArtifact(tests);
const test_step = b.step("test", "Run parser tests");
test_step.dependOn(&run_tests.step);
}
And the project folder layout:
zsh-lite/
src/
main.zig -- REPL, arg structures, tokenizer, parser, tests
build.zig -- Build configuration
For now everything lives in main.zig. As the project grows across the next episodes we'll split into modules -- tokenizer.zig, parser.zig, executor.zig, builtins.zig -- following the same modular pattern we used in Project B (the markdown tool, episodes 37-39) and Project C (the image tool, episodes 44-46).
Build it with:
zig build test # run the parser tests
zig build run # launch the REPL
Wat we geleerd hebben
- Defining a clean data model (Command, Pipeline, Redirect) before writing any parsing code -- the structure drives the implementation, not the other way around
- Building a two-stage parser: tokenizer splits raw input into typed tokens, parser converts tokens into structured commands -- the same architecture we used for the markdown tool
- Handling three quoting modes: unquoted (backslash escapes anything), double-quoted (selective escapes for
",\,$), single-quoted (everything literal) - Memory ownership in parse results: each Command owns its strings, deinit walks every field, errdefer chains ensure cleanup on parse failure
- Detecting and reporting common errors: unterminated quotes, trailing pipes, missing redirect targets, empty commands
- Using
while/elseloops to detect exhausted input (unterminated quote detection) - Word joining as a concept: how real shells concatenate adjacent quoted and unquoted segments into a single token
- Writing focused unit tests for parsers: string in, structure out, testing allocator catches leaks
We've got the foundation. The parser can handle pipes, redirections, quoting, and escaping. What it can't do yet is actually run anything -- typing ls just shows you the parsed structure instead of listing files. Making that happen requires process spawning, file descriptor manipulation, and pipe plumbing. That's where the real systems programming starts ;-)
Thanks for reading!