Learn Zig Series (#41) - Key-Value Store: Write-Ahead Log
Project B: Key-Value Store (2/4)
What will I learn
- You will learn what a write-ahead log (WAL) is and why databases use it;
- You will learn designing a binary log format with operation type, key length, value length, and data;
- You will learn appending operations to the log file before applying them to the in-memory store;
- You will learn replaying the log on startup to rebuild the in-memory state;
- You will learn log compaction: rewriting the log without deleted keys;
- You will learn crash recovery: handling partial writes and corruption;
- You will learn fsync and durability guarantees;
- You will learn testing crash recovery by truncating the log at random points.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Zig 0.14+ distribution (download from ziglang.org);
- The ambition to learn Zig programming.
Difficulty
- Intermediate
Curriculum (of the Learn Zig Series):
- Zig Programming Tutorial - ep001 - Intro
- Learn Zig Series (#2) - Hello Zig, Variables and Types
- Learn Zig Series (#3) - Functions and Control Flow
- Learn Zig Series (#4) - Error Handling (Zig's Best Feature)
- Learn Zig Series (#5) - Arrays, Slices, and Strings
- Learn Zig Series (#6) - Structs, Enums, and Tagged Unions
- Learn Zig Series (#7) - Memory Management and Allocators
- Learn Zig Series (#8) - Pointers and Memory Layout
- Learn Zig Series (#9) - Comptime (Zig's Superpower)
- Learn Zig Series (#10) - Project Structure, Modules, and File I/O
- Learn Zig Series (#11) - Mini Project: Building a Step Sequencer
- Learn Zig Series (#12) - Testing and Test-Driven Development
- Learn Zig Series (#13) - Interfaces via Type Erasure
- Learn Zig Series (#14) - Generics with Comptime Parameters
- Learn Zig Series (#15) - The Build System (build.zig)
- Learn Zig Series (#16) - Sentinel-Terminated Types and C Strings
- Learn Zig Series (#17) - Packed Structs and Bit Manipulation
- Learn Zig Series (#18) - Async Concepts and Event Loops
- Learn Zig Series (#18b) - Addendum: Async Returns in Zig 0.16
- Learn Zig Series (#19) - SIMD with @Vector
- Learn Zig Series (#20) - Working with JSON
- Learn Zig Series (#21) - Networking and TCP Sockets
- Learn Zig Series (#22) - Hash Maps and Data Structures
- Learn Zig Series (#23) - Iterators and Lazy Evaluation
- Learn Zig Series (#24) - Logging, Formatting, and Debug Output
- Learn Zig Series (#25) - Mini Project: HTTP Status Checker
- Learn Zig Series (#26) - Writing a Custom Allocator
- Learn Zig Series (#27) - C Interop: Calling C from Zig
- Learn Zig Series (#28) - C Interop: Exposing Zig to C
- Learn Zig Series (#29) - Inline Assembly and Low-Level Control
- Learn Zig Series (#30) - Thread Safety and Atomics
- Learn Zig Series (#31) - Memory-Mapped I/O and Files
- Learn Zig Series (#32) - Compile-Time Reflection with @typeInfo
- Learn Zig Series (#33) - Building a State Machine with Tagged Unions
- Learn Zig Series (#34) - Performance Profiling and Optimization
- Learn Zig Series (#35) - Cross-Compilation and Target Triples
- Learn Zig Series (#36) - Mini Project: CLI Task Runner
- Learn Zig Series (#37) - Markdown to HTML: Tokenizer and Lexer
- Learn Zig Series (#38) - Markdown to HTML: Parser and AST
- Learn Zig Series (#39) - Markdown to HTML: Renderer and CLI
- Learn Zig Series (#40) - Key-Value Store: In-Memory Store
- Learn Zig Series (#41) - Key-Value Store: Write-Ahead Log (this post)
Learn Zig Series (#41) - Key-Value Store: Write-Ahead Log
Last time we built a perfectly functional in-memory key-value store -- put, get, delete, TTL, snapshots, the works. One problem though: kill the process and all your data vanishes. Poof. Gone. That's... not great if you're actually storing things you care about ;-)
This is the exact problem that every database in history has had to solve: how do you make RAM-speed operations survive a crash? The answer, going back decades, is the write-ahead log (WAL). The idea is almost comically simple: before you do anything to the in-memory data, write what you're about to do to a file on disk first. If the process crashes, you replay the log and you're back to where you were. PostgreSQL does it. SQLite does it. MySQL's InnoDB does it. Redis has its AOF (append-only file) which is basically the same concept with a different name.
Today we're adding a WAL to our KV store. By the end of this episode, our store will survive crashes, handle corrupt data gracefully, and compact its log to keep disk usage from growing without bound.
The binary log format
Before we can write anything to disk we need to decide what the log entries look like. There are two common approaches: text-based (like Redis AOF in its default mode -- literally the commands as strings) or binary (fixed-size headers with raw bytes). We're going with binary because it's more compact, faster to parse, and gives us practice with the kind of low-level byte wrangling that Zig is built for.
Each log entry represents one operation: either a Put (store a key-value pair) or a Delete (remove a key). Here's the layout:
+--------+--------+--------+--------+------+---------+----------+
| magic | op | key_len| val_len| key | value | checksum |
| 2 bytes| 1 byte | 4 bytes| 4 bytes| var | var | 4 bytes |
+--------+--------+--------+--------+------+---------+----------+
- Magic bytes (
0xAA 0x55): A recognizable pattern at the start of every entry. If we're scanning through a corrupted file, this helps us find where entries begin. The pattern is chosen because it's unlikely to occur naturally in key/value data and it alternates bits nicely (10101010 01010101). - Op byte:
0x01for Put,0x02for Delete. - Key length: 4 bytes, little-endian u32.
- Value length: 4 bytes, little-endian u32. For Delete ops this is 0.
- Key data: Variable length raw bytes.
- Value data: Variable length raw bytes (empty for Delete).
- Checksum: CRC32 over everything from op through value data. This lets us detect partial writes and corruption.
Let's define this in code:
const std = @import("std");
pub const WalOp = enum(u8) {
put = 0x01,
delete = 0x02,
};
pub const WAL_MAGIC = [2]u8{ 0xAA, 0x55 };
pub const HEADER_SIZE = 2 + 1 + 4 + 4; // magic + op + key_len + val_len = 11 bytes
pub const CHECKSUM_SIZE = 4;
pub const WalEntry = struct {
op: WalOp,
key: []const u8,
value: []const u8, // empty slice for delete ops
pub fn totalSize(self: WalEntry) usize {
return HEADER_SIZE + self.key.len + self.value.len + CHECKSUM_SIZE;
}
};
Why CRC32 and not something stronger like SHA-256? Because we're detecting accidental corruption (power loss, partial writes, disk errors), not malicious tampering. CRC32 is fast -- a few nanoseconds per entry -- and catches over 99.99% of random bit errors. SHA-256 would be massive overkill here, and the performance cost would add up when replaying a log with millions of entries.
The WAL struct: managing the log file
Now we need a struct that wraps a file handle and provides appendPut, appendDelete, and replay operations:
pub const Wal = struct {
file: std.fs.File,
allocator: std.mem.Allocator,
bytes_written: u64,
pub fn open(allocator: std.mem.Allocator, path: []const u8) !Wal {
const file = try std.fs.cwd().createFile(path, .{
.read = true,
.truncate = false,
});
// Seek to the end so new appends go after existing data
const end_pos = try file.getEndPos();
try file.seekTo(end_pos);
return .{
.file = file,
.allocator = allocator,
.bytes_written = end_pos,
};
}
pub fn close(self: *Wal) void {
self.file.close();
}
};
A subtle but important point: createFile with .truncate = false opens the file for writing without deleting existing content. If the file already exists (from a previous run), we keep its data and seek to the end. If it doesn't exist, we create a fresh one. This is the correct behavior for a WAL -- you want to preserve the log across restarts so you can replay it.
The bytes_written field tracks total log size, which we'll use later for compaction decisions. If we had a std.fs.File.getPos() we could query it, but tracking it ourselves is simpler and avoids a syscall.
Appending operations to the log
Here's where the "write-ahead" part of write-ahead log happens. Before our KV store modifies the in-memory hash map, it calls one of these methods to write the operation to disk:
pub fn appendPut(self: *Wal, key: []const u8, value: []const u8) !void {
try self.writeEntry(.{
.op = .put,
.key = key,
.value = value,
});
}
pub fn appendDelete(self: *Wal, key: []const u8) !void {
try self.writeEntry(.{
.op = .delete,
.key = key,
.value = &.{},
});
}
fn writeEntry(self: *Wal, entry: WalEntry) !void {
var buf: [HEADER_SIZE]u8 = undefined;
// Magic
buf[0] = WAL_MAGIC[0];
buf[1] = WAL_MAGIC[1];
// Op
buf[2] = @intFromEnum(entry.op);
// Key length (little-endian u32)
std.mem.writeInt(u32, buf[3..7], @intCast(entry.key.len), .little);
// Value length (little-endian u32)
std.mem.writeInt(u32, buf[7..11], @intCast(entry.value.len), .little);
// Compute CRC32 over op + key_len + val_len + key + value
var hasher = std.hash.Crc32.init();
hasher.update(buf[2..HEADER_SIZE]); // op through val_len
hasher.update(entry.key);
hasher.update(entry.value);
const checksum = hasher.final();
var crc_bytes: [CHECKSUM_SIZE]u8 = undefined;
std.mem.writeInt(u32, &crc_bytes, checksum, .little);
// Write everything: header + key + value + checksum
// Use writev for a single syscall instead of multiple writes
const vecs = [_]std.posix.iovec_const{
.{ .base = &buf, .len = HEADER_SIZE },
.{ .base = entry.key.ptr, .len = entry.key.len },
.{ .base = entry.value.ptr, .len = entry.value.len },
.{ .base = &crc_bytes, .len = CHECKSUM_SIZE },
};
_ = try self.file.writev(&vecs);
self.bytes_written += entry.totalSize();
}
Let me walk through what's happening. We build the 11-byte header in a stack buffer: magic, op byte, key length, value length. Then we compute a CRC32 checksum over everything except the magic bytes (the magic is for finding entry boundaries during recovery, not for integrity checking). Finally we write the header, key bytes, value bytes, and checksum to disk.
The writev call is worth explaining. Instead of four separate write calls (header, key, value, checksum), we use scatter-gather I/O to send all four buffers in a single syscall. This is both faster (one kernel transition instead of four) and more atomic -- the OS is more likely to write all the data contiguously. We touched on this kind of low-level I/O in episode 31 when we worked with memory-mapped files.
Having said that, "more likely" is NOT "guaranteed". A power failure mid-write can still result in a partial entry on disk. That's exactly why we have the checksum -- if the CRC doesn't match on replay, we know the entry is incomplete and we can safely skip it. More on that in the crash recovery section.
Integrating WAL with the KV store
The WAL itself is just a file writer. The interesting part is wiring it into our KvStore from last episode. The rule is simple: write to the log first, then update the hash map. If the log write fails, we don't touch the map, and the caller gets an error. If the log write succeeds but the map update fails (OOM during hash map resize), we have an entry in the log that didn't make it into memory -- but that's fine, because on the next restart the replay will apply it.
pub const PersistentKvStore = struct {
store: KvStore,
wal: Wal,
pub fn open(allocator: std.mem.Allocator, wal_path: []const u8) !PersistentKvStore {
var store = KvStore.init(allocator);
errdefer store.deinit();
var wal = try Wal.open(allocator, wal_path);
errdefer wal.close();
// Replay existing log to rebuild state
try wal.replay(&store);
return .{
.store = store,
.wal = wal,
};
}
pub fn close(self: *PersistentKvStore) void {
self.wal.close();
self.store.deinit();
}
pub fn put(self: *PersistentKvStore, key: []const u8, value: []const u8) !void {
// WAL first, then memory
try self.wal.appendPut(key, value);
try self.store.put(key, value);
}
pub fn get(self: *PersistentKvStore, key: []const u8) ?[]const u8 {
return self.store.get(key);
}
pub fn delete(self: *PersistentKvStore, key: []const u8) !bool {
if (!self.store.contains(key)) return false;
try self.wal.appendDelete(key);
_ = self.store.delete(key);
return true;
}
};
The PersistentKvStore wraps both the in-memory store and the WAL. On open, it creates (or opens) the WAL file and immediately replays it to reconstruct the in-memory state. After that, every put and delete goes through the WAL first. The get operation doesn't touch the WAL at all -- reads are purely in-memory, which is why they're fast.
Notice the delete method now returns an error union (!bool) instead of just bool, because the WAL write can fail. This is a small API change from episode 40, but it's necessary -- any operation that touches disk can fail, and Zig won't let you ignore that.
Replaying the log on startup
This is the heart of the WAL concept. When the store opens, it reads the log file from beginning to end and applies every operation to the in-memory hash map. The result is an exact reconstruction of the state at the time of the last successfull write:
pub fn replay(self: *Wal, store: *KvStore) !void {
// Seek to beginning for replay
try self.file.seekTo(0);
const reader = self.file.reader();
var entries_applied: u64 = 0;
var entries_skipped: u64 = 0;
while (true) {
const entry = self.readOneEntry(reader) catch |err| {
switch (err) {
error.EndOfStream => break,
error.CorruptEntry => {
entries_skipped += 1;
// Try to find next magic bytes and continue
self.skipToNextMagic(reader) catch break;
continue;
},
else => return err,
}
};
defer self.allocator.free(entry.key);
defer if (entry.value.len > 0) self.allocator.free(entry.value);
switch (entry.op) {
.put => try store.put(entry.key, entry.value),
.delete => _ = store.delete(entry.key),
}
entries_applied += 1;
}
// Seek back to end for new appends
const end_pos = try self.file.getEndPos();
try self.file.seekTo(end_pos);
if (entries_skipped > 0) {
std.log.warn("WAL replay: {d} entries applied, {d} corrupt entries skipped", .{
entries_applied, entries_skipped,
});
}
}
fn readOneEntry(self: *Wal, reader: anytype) !WalEntry {
// Read and verify magic
var header: [HEADER_SIZE]u8 = undefined;
const bytes_read = reader.readAll(&header) catch return error.EndOfStream;
if (bytes_read < HEADER_SIZE) return error.EndOfStream;
if (header[0] != WAL_MAGIC[0] or header[1] != WAL_MAGIC[1]) {
return error.CorruptEntry;
}
// Parse header fields
const op_byte = header[2];
const op: WalOp = std.meta.intToEnum(WalOp, op_byte) catch return error.CorruptEntry;
const key_len = std.mem.readInt(u32, header[3..7], .little);
const val_len = std.mem.readInt(u32, header[7..11], .little);
// Sanity check lengths (reject absurdly large entries)
if (key_len > 16 * 1024 * 1024 or val_len > 256 * 1024 * 1024) {
return error.CorruptEntry;
}
// Read key
const key = try self.allocator.alloc(u8, key_len);
errdefer self.allocator.free(key);
const key_read = reader.readAll(key) catch {
self.allocator.free(key);
return error.CorruptEntry;
};
if (key_read < key_len) {
self.allocator.free(key);
return error.CorruptEntry;
}
// Read value
const value = try self.allocator.alloc(u8, val_len);
errdefer self.allocator.free(value);
if (val_len > 0) {
const val_read = reader.readAll(value) catch {
self.allocator.free(value);
return error.CorruptEntry;
};
if (val_read < val_len) {
self.allocator.free(value);
return error.CorruptEntry;
}
}
// Read and verify checksum
var crc_bytes: [CHECKSUM_SIZE]u8 = undefined;
const crc_read = reader.readAll(&crc_bytes) catch {
self.allocator.free(key);
self.allocator.free(value);
return error.CorruptEntry;
};
if (crc_read < CHECKSUM_SIZE) {
self.allocator.free(key);
self.allocator.free(value);
return error.CorruptEntry;
}
const stored_crc = std.mem.readInt(u32, &crc_bytes, .little);
// Recompute checksum
var hasher = std.hash.Crc32.init();
hasher.update(header[2..HEADER_SIZE]);
hasher.update(key);
hasher.update(value);
const computed_crc = hasher.final();
if (stored_crc != computed_crc) {
self.allocator.free(key);
self.allocator.free(value);
return error.CorruptEntry;
}
return .{
.op = op,
.key = key,
.value = value,
};
}
The replay loop reads entries one at a time. For each entry, it verifies the magic bytes, parses the header, reads the key and value data, then checks the CRC32 checksum. If everything checks out, it applies the operation to the store. If any part of the entry is corrupt (bad magic, bad checksum, truncated data), it flags it as CorruptEntry and tries to skip ahead to the next valid entry.
The sanity checks on key and value lengths (16 MB and 256 MB respectively) are important. Without them, a corrupted length field could cause us to try allocating gigabytes of memory -- which would either OOM-kill the process or take forever. These limits are arbitrary but reasonable -- if someone is storing 256 MB values in a simple KV store, they probably need a different architecture anyway.
There's something worth noticeing about the error handling here. When we encounter corruption, we don't abort -- we log a warning and continue. This is a deliberate design choice. The last entry in a WAL is very likely to be partial (the process crashed while writing it), and that's totally expected. Aborting on the last corrupt entry would mean losing ALL data just because one write was interrupted. The correct behavior is to apply everything that's valid and skip the rest.
Skip to next magic: recovering from corruption
When we encounter a corrupt entry mid-log, we need to scan forward until we find the next valid entry header. This is where those magic bytes (0xAA 0x55) earn their keep:
fn skipToNextMagic(self: *Wal, reader: anytype) !void {
_ = self;
// Scan byte-by-byte looking for magic sequence
var prev: u8 = 0;
while (true) {
const byte = reader.readByte() catch return error.EndOfStream;
if (prev == WAL_MAGIC[0] and byte == WAL_MAGIC[1]) {
// Found magic -- seek back 2 bytes so readOneEntry can read it
const pos = try reader.context.getPos();
try reader.context.seekTo(pos - 2);
return;
}
prev = byte;
}
}
This is a byte-by-byte scan, which is slow but correct. In practice it rarely runs -- corruption is usually limited to the tail of the log (partial last write). If the log has corruption in the middle, something went seriously wrong (disk hardware failure, filesystem corruption), and at that point you're in disaster recovery territory regardless.
The trick of seeking back 2 bytes after finding the magic means readOneEntry can start fresh and read the full header including the magic. This keeps the entry parsing logic clean -- it always expects to read from the start of an entry.
Log compaction
Here's a problem: our WAL only grows. Put "key1" a thousand times? That's a thousand log entries, even though only the last one matters. Delete a key? The put entry for that key is still in the log forever. Over time, the log file gets enormous, full of outdated entries that slow down replay on startup.
Compaction solves this by rewriting the log with only the entries that matter -- one entry per key that currently exists in the store:
pub fn compact(self: *PersistentKvStore, wal_path: []const u8) !void {
// Build the compacted log path
var tmp_path_buf: [std.fs.max_path_bytes]u8 = undefined;
const tmp_path = try std.fmt.bufPrint(&tmp_path_buf, "{s}.compact", .{wal_path});
// Write current state to a new WAL file
var new_wal = try Wal.open(self.store.allocator, tmp_path);
errdefer {
new_wal.close();
std.fs.cwd().deleteFile(tmp_path) catch {};
}
// Iterate all live entries and write them as Put ops
var iter = self.store.map.iterator();
while (iter.next()) |entry| {
// Skip expired entries -- no point persisting dead data
if (entry.value_ptr.*.expires_at) |expires| {
if (std.time.timestamp() >= expires) continue;
}
try new_wal.appendPut(entry.key_ptr.*, entry.value_ptr.*.value);
}
// fsync the new file
try new_wal.file.sync();
new_wal.close();
// Close old WAL
self.wal.close();
// Atomic rename: tmp -> original path
try std.fs.cwd().rename(tmp_path, wal_path);
// Reopen the compacted WAL
self.wal = try Wal.open(self.store.allocator, wal_path);
}
The compaction strategy is simple: write the current in-memory state to a new file, then atomically rename it over the old one. The atomic rename (rename on POSIX) is critical -- if we crash during compaction, we either have the old complete log or the new complete log, never a half-written replacement. This is a standard technique, exactly what Redis does when rewriting its AOF.
Expired entries get filtered out during compaction, same as they do during snapshots (remember episode 40). No point persisting keys that are already dead.
When should you compact? Common strategies:
- Size-based: When the log exceeds 2x the expected compacted size (based on entry count).
- Ratio-based: When the number of delete operations exceeds 50% of total operations.
- Time-based: Every N minutes or hours.
- Manual: Let the client trigger it via a command.
For our store, we'll leave compaction as a manual operation that the caller can invoke. Automatic compaction adds complexity (background threads, coordination with writes) that would distract from the core concepts.
Fsync and durability guarantees
Here's a question that trips up almost every junior developer working on persistence: when you call write, is the data actually on disk?
The answer is almost certainly no. Modern operating systems buffer writes in kernel memory (the page cache) and flush them to disk lazily. This is great for performance -- disk I/O is measured in milliseconds, memory copies in nanoseconds. But it means that if the power goes out between your write call and the OS's background flush, your data is gone even though write returned success.
The fix is fsync (or fdatasync). These syscalls force the OS to flush all pending writes for a specific file to the underlying storage device. In Zig, that's file.sync():
pub fn appendPutDurable(self: *Wal, key: []const u8, value: []const u8) !void {
try self.appendPut(key, value);
try self.file.sync();
}
pub fn appendDeleteDurable(self: *Wal, key: []const u8) !void {
try self.appendDelete(key);
try self.file.sync();
}
Calling sync() after every write gives you the strongest durability guarantee: if the function returns without error, the data is on stable storage. But there's a massive performance cost. An fsync on a typical SSD takes 0.5-5ms. On a spinning disk, 5-15ms. Compare that to a buffered write which takes microseconds. If you're doing 10,000 writes per second, fsyncing each one means you can only do 200-2000 per second -- a 5-50x slowdown.
This is why most databases offer configurable durability levels:
- fsync every write: Strongest. You never lose committed data. Slow.
- fsync every N writes: Good tradeoff. You might lose the last N-1 writes on crash.
- fsync every N milliseconds: Similar tradeoff, time-based.
- No fsync: Fastest. You might lose several seconds of writes on crash. The OS will eventually flush, but you don't control when.
For our WAL, we provide both appendPut (no sync) and appendPutDurable (with sync), and let the caller choose. The compact method always syncs before renaming, because losing the compacted file would mean data loss -- there's no old log to fall back on.
NB: even fsync doesn't guarantee durability on all hardware. Some cheap SSDs lie about flushing to their NAND cells. Enterprise SSDs with power-loss capacitors are better. If you're building a database for real money, you need hardware you can trust -- but that's a hardware problem, not a software one.
Testing crash recovery
How do you test that your WAL handles crashes correctly? You can't exactly unplug the power during a unit test. But you CAN simulate a crash by truncating the log file at various points:
const std = @import("std");
const Wal = @import("wal.zig").Wal;
const KvStore = @import("kv_store.zig").KvStore;
const PersistentKvStore = @import("persistent_kv.zig").PersistentKvStore;
test "basic WAL replay" {
const path = "test_wal_basic.log";
defer std.fs.cwd().deleteFile(path) catch {};
// Write some entries
{
var store = try PersistentKvStore.open(std.testing.allocator, path);
defer store.close();
try store.put("name", "scipio");
try store.put("lang", "zig");
try store.put("version", "0.14");
}
// Reopen -- should replay
{
var store = try PersistentKvStore.open(std.testing.allocator, path);
defer store.close();
try std.testing.expectEqualStrings("scipio", store.get("name").?);
try std.testing.expectEqualStrings("zig", store.get("lang").?);
try std.testing.expectEqualStrings("0.14", store.get("version").?);
}
}
test "WAL handles delete during replay" {
const path = "test_wal_delete.log";
defer std.fs.cwd().deleteFile(path) catch {};
{
var store = try PersistentKvStore.open(std.testing.allocator, path);
defer store.close();
try store.put("keep", "yes");
try store.put("remove", "soon");
_ = try store.delete("remove");
}
{
var store = try PersistentKvStore.open(std.testing.allocator, path);
defer store.close();
try std.testing.expectEqualStrings("yes", store.get("keep").?);
try std.testing.expect(store.get("remove") == null);
try std.testing.expectEqual(@as(usize, 1), store.store.count());
}
}
test "WAL survives truncated last entry" {
const path = "test_wal_truncated.log";
defer std.fs.cwd().deleteFile(path) catch {};
// Write entries normally
{
var store = try PersistentKvStore.open(std.testing.allocator, path);
defer store.close();
try store.put("safe1", "data1");
try store.put("safe2", "data2");
try store.put("partial", "this_will_be_cut");
}
// Truncate the file to simulate crash during last write
{
const file = try std.fs.cwd().openFile(path, .{ .mode = .read_write });
defer file.close();
const size = try file.getEndPos();
// Cut off the last 10 bytes -- corrupts the final entry
try file.setEndPos(size - 10);
}
// Reopen -- should recover first two entries, skip the truncated one
{
var store = try PersistentKvStore.open(std.testing.allocator, path);
defer store.close();
try std.testing.expectEqualStrings("data1", store.get("safe1").?);
try std.testing.expectEqualStrings("data2", store.get("safe2").?);
try std.testing.expect(store.get("partial") == null);
}
}
The truncation test is the important one. We write three entries, then chop off the last 10 bytes of the file (simulating a crash mid-write), then reopen. The first two entries should replay perfectly. The third entry will fail the checksum check (because its data was truncated) and get skipped. This is exactly how a real crash would behave -- the process was killed while writing the last entry, and on restart we recover everything up to the last complete write.
Let's also test with randomized truncation points to make sure we're robust no matter where the crash happens:
test "WAL handles random truncation points" {
const path = "test_wal_random.log";
var prng = std.Random.DefaultPrng.init(42);
const random = prng.random();
// Run multiple rounds with different truncation points
for (0..20) |_| {
defer std.fs.cwd().deleteFile(path) catch {};
// Write 10 entries
{
var store = try PersistentKvStore.open(std.testing.allocator, path);
defer store.close();
for (0..10) |i| {
var key_buf: [16]u8 = undefined;
const key = std.fmt.bufPrint(&key_buf, "key{d}", .{i}) catch unreachable;
try store.put(key, "value");
}
}
// Truncate at a random point
{
const file = try std.fs.cwd().openFile(path, .{ .mode = .read_write });
defer file.close();
const size = try file.getEndPos();
const cut_point = random.intRangeAtMost(u64, 0, size);
try file.setEndPos(cut_point);
}
// Reopen -- should not crash, regardless of truncation point
{
var store = try PersistentKvStore.open(std.testing.allocator, path);
defer store.close();
// We can't know exactly how many entries survived,
// but the store should be in a consistent state
const count = store.store.count();
try std.testing.expect(count <= 10);
}
}
}
This test runs 20 rounds. Each round writes 10 entries, truncates the file at a random point (could be in the middle of the first entry, between entries, in the checksum bytes, anywhere), and then verifies that the store opens without crashing and contains some subset of the original entries. The std.Random.DefaultPrng.init(42) gives us a deterministic seed so the test is reproducible -- same seed, same random truncation points every time.
And the compaction test:
test "compaction reduces log size" {
const path = "test_wal_compact.log";
defer std.fs.cwd().deleteFile(path) catch {};
defer std.fs.cwd().deleteFile("test_wal_compact.log.compact") catch {};
{
var store = try PersistentKvStore.open(std.testing.allocator, path);
defer store.close();
// Write and overwrite the same key many times
for (0..100) |i| {
var val_buf: [16]u8 = undefined;
const val = std.fmt.bufPrint(&val_buf, "v{d}", .{i}) catch unreachable;
try store.put("counter", val);
}
// Also add and delete some keys
try store.put("temp1", "data");
try store.put("temp2", "data");
_ = try store.delete("temp1");
_ = try store.delete("temp2");
const size_before = store.wal.bytes_written;
// Compact
try store.compact(path);
const size_after = store.wal.bytes_written;
// Compacted log should be much smaller (1 entry vs 104)
try std.testing.expect(size_after < size_before / 10);
}
// Verify data survived compaction
{
var store = try PersistentKvStore.open(std.testing.allocator, path);
defer store.close();
try std.testing.expectEqualStrings("v99", store.get("counter").?);
try std.testing.expect(store.get("temp1") == null);
try std.testing.expect(store.get("temp2") == null);
try std.testing.expectEqual(@as(usize, 1), store.store.count());
}
}
104 operations (100 overwrites + 2 puts + 2 deletes) get compacted down to 1 entry (the final value of "counter"). The deleted keys disappear entirely. The log file shrinks by more than 90%. And after compaction, reopening the store still shows the correct data.
Updated project structure
Our KV store project now looks like this:
kv-store/
src/
kv_store.zig -- in-memory store (from episode 40)
wal.zig -- write-ahead log: format, append, replay, compact
persistent_kv.zig -- PersistentKvStore wrapping store + WAL
kv_store_test.zig -- in-memory store tests
wal_test.zig -- WAL + persistence + crash recovery tests
main.zig -- REPL (now with persistence)
build.zig
The REPL from last time now uses PersistentKvStore instead of KvStore, which means your data survives between runs. You can put some keys, quit, restart, and get them back. Try it -- it's weirdly satiesfying to see your data outlive the process for the first time ;-)
Wat we geleerd hebben
- What a write-ahead log is and why every serious database uses one: write operations to a log file on disk before applying them in memory, so they can be replayed after a crash
- Designing a binary log format with magic bytes for entry boundary detection, an operation type byte, length-prefixed key and value data, and a CRC32 checksum for integrity verification
- Appending operations with
writevscatter-gather I/O for efficiency -- one syscall instead of four, and the data is more likely to be written contiguously - Replaying the log on startup by reading entries sequentially, verifying each checksum, and applying valid operations to the in-memory store while gracefully skipping corrupt entries
- Log compaction by writing the current live state to a new file and atomically renaming it over the old one -- the standard rename trick that Redis, PostgreSQL, and most other systems use
- Crash recovery through checksum verification and the
skipToNextMagicscan -- a truncated last entry is expected and handled, mid-log corruption triggers a forward scan to the next valid entry - The difference between
writeandfsync: buffered writes sit in the OS page cache and can be lost on power failure, whilefsyncforces data to stable storage at a significant performance cost (0.5-15ms per call) - Testing crash recovery by truncating the WAL at random offsets with a seeded PRNG, verifying that the store always opens successfully and contains a consistent subset of the original data
The store now survives crashes. It's durable (if you use the fsync variants), it's recoverable, and the log won't grow without bound thanks to compaction. What it can't do yet is talk to the outside world -- right now it's a library that only your own process can use. The natural next step is putting a TCP server in front of it so other programs (and other machines) can connect and run operations over the network. That's where the networking skills from episode 21 come back into play.
Thanks for reading!