Learn Zig Series (#43) - Key-Value Store: Client Library and Benchmarks
Project B: Key-Value Store (4/4)
What will I learn
- You will learn building a client library that wraps our binary protocol;
- You will learn connection management with automatic reconnection;
- You will learn batch operations for reduced round-trip overhead;
- You will learn benchmarking operations per second with Zig's timer;
- You will learn latency distribution: p50, p95, p99 percentiles;
- You will learn comparing get vs put performance under load;
- You will learn memory profiling with the GeneralPurposeAllocator;
- You will learn project retrospective: what worked and what we'd change.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Zig 0.14+ distribution (download from ziglang.org);
- The ambition to learn Zig programming.
Difficulty
- Intermediate
Curriculum (of the Learn Zig Series):
- Zig Programming Tutorial - ep001 - Intro
- Learn Zig Series (#2) - Hello Zig, Variables and Types
- Learn Zig Series (#3) - Functions and Control Flow
- Learn Zig Series (#4) - Error Handling (Zig's Best Feature)
- Learn Zig Series (#5) - Arrays, Slices, and Strings
- Learn Zig Series (#6) - Structs, Enums, and Tagged Unions
- Learn Zig Series (#7) - Memory Management and Allocators
- Learn Zig Series (#8) - Pointers and Memory Layout
- Learn Zig Series (#9) - Comptime (Zig's Superpower)
- Learn Zig Series (#10) - Project Structure, Modules, and File I/O
- Learn Zig Series (#11) - Mini Project: Building a Step Sequencer
- Learn Zig Series (#12) - Testing and Test-Driven Development
- Learn Zig Series (#13) - Interfaces via Type Erasure
- Learn Zig Series (#14) - Generics with Comptime Parameters
- Learn Zig Series (#15) - The Build System (build.zig)
- Learn Zig Series (#16) - Sentinel-Terminated Types and C Strings
- Learn Zig Series (#17) - Packed Structs and Bit Manipulation
- Learn Zig Series (#18) - Async Concepts and Event Loops
- Learn Zig Series (#18b) - Addendum: Async Returns in Zig 0.16
- Learn Zig Series (#19) - SIMD with @Vector
- Learn Zig Series (#20) - Working with JSON
- Learn Zig Series (#21) - Networking and TCP Sockets
- Learn Zig Series (#22) - Hash Maps and Data Structures
- Learn Zig Series (#23) - Iterators and Lazy Evaluation
- Learn Zig Series (#24) - Logging, Formatting, and Debug Output
- Learn Zig Series (#25) - Mini Project: HTTP Status Checker
- Learn Zig Series (#26) - Writing a Custom Allocator
- Learn Zig Series (#27) - C Interop: Calling C from Zig
- Learn Zig Series (#28) - C Interop: Exposing Zig to C
- Learn Zig Series (#29) - Inline Assembly and Low-Level Control
- Learn Zig Series (#30) - Thread Safety and Atomics
- Learn Zig Series (#31) - Memory-Mapped I/O and Files
- Learn Zig Series (#32) - Compile-Time Reflection with @typeInfo
- Learn Zig Series (#33) - Building a State Machine with Tagged Unions
- Learn Zig Series (#34) - Performance Profiling and Optimization
- Learn Zig Series (#35) - Cross-Compilation and Target Triples
- Learn Zig Series (#36) - Mini Project: CLI Task Runner
- Learn Zig Series (#37) - Markdown to HTML: Tokenizer and Lexer
- Learn Zig Series (#38) - Markdown to HTML: Parser and AST
- Learn Zig Series (#39) - Markdown to HTML: Renderer and CLI
- Learn Zig Series (#40) - Key-Value Store: In-Memory Store
- Learn Zig Series (#41) - Key-Value Store: Write-Ahead Log
- Learn Zig Series (#42) - Key-Value Store: TCP Server
- Learn Zig Series (#43) - Key-Value Store: Client Library and Benchmarks (this post)
Learn Zig Series (#43) - Key-Value Store: Client Library and Benchmarks
Here we go -- the final episode of the key-value store project! Over the last three episodes we built an in-memory store with hash maps and TTL, added a write-ahead log for crash recovery, and put a TCP server in front of the whole thing with concurrent client handling and graceful shutdown. Our KV store is a real network service now -- but using it still requires constructing raw binary packets by hand. That's exactly the kind of thing that drives developers away from your project ;-)
Today we're fixing that. We'll build a proper client library that hides the protocol behind a clean API (client.put("key", "value"), client.get("key")), add batch operations to reduce round-trip overhead, and then -- the fun part -- write benchmarks to find out how fast this thing actually is. Operations per second, latency percentiles, memory profiles, the works. And since this is the last episode of Project B, we'll close with a retrospective on the whole four-episode arc.
The client struct
The client needs to manage a TCP connection to the server, handle serialization/deserialization of our binary protocol, and provide a clean public API. Let's start with the struct and connection setup:
const std = @import("std");
const proto = @import("protocol.zig");
pub const KvClient = struct {
stream: std.net.Stream,
allocator: std.mem.Allocator,
address: std.net.Address,
pub fn connect(
allocator: std.mem.Allocator,
host: []const u8,
port: u16,
) !KvClient {
const addr = try std.net.Address.parseIp(host, port);
const stream = try std.net.tcpConnectToAddress(addr);
return .{
.stream = stream,
.allocator = allocator,
.address = addr,
};
}
pub fn close(self: *KvClient) void {
self.stream.close();
}
};
Nothing fancy yet -- connect opens a TCP connection and stores the stream plus the address (we'll need the address later for reconnection). The caller passes in an allocator because we'll be allocating buffers for response values.
If you've been following along since episode 21, this should look familar. std.net.tcpConnectToAddress does the socket creation, connect() syscall, and error handling all in one shot. The raw C equivalent would be 15-20 lines of boilerplate. Zig's standard library wraps it cleanly.
Sending requests
Now the put, get, and delete methods. Each one serializes a request using our protocol format from episode 42, sends it over the wire, and reads back the response:
pub fn put(self: *KvClient, key: []const u8, value: []const u8) !void {
try self.sendRequest(.put, key, value);
const resp = try self.readResponse();
defer if (resp.value.len > 0) self.allocator.free(resp.value);
if (resp.status != .ok) {
return error.PutFailed;
}
}
pub fn get(self: *KvClient, key: []const u8) !?[]const u8 {
try self.sendRequest(.get, key, &.{});
const resp = try self.readResponse();
if (resp.status == .not_found) {
return null;
}
if (resp.status != .ok) {
defer if (resp.value.len > 0) self.allocator.free(resp.value);
return error.GetFailed;
}
// Caller owns the returned slice
return resp.value;
}
pub fn delete(self: *KvClient, key: []const u8) !bool {
try self.sendRequest(.delete, key, &.{});
const resp = try self.readResponse();
defer if (resp.value.len > 0) self.allocator.free(resp.value);
return resp.status == .ok;
}
pub fn ping(self: *KvClient) !bool {
try self.sendRequest(.ping, &.{}, &.{});
const resp = try self.readResponse();
defer if (resp.value.len > 0) self.allocator.free(resp.value);
return resp.status == .ok;
}
Notice the ownership model here. For get, the caller receives a heap-allocated slice containing the value -- they're responsible for freeing it when they're done. For put and delete, any response value (like an error message) is freed immediately because the caller doesn't need it. This is the same pattern we used in the WAL reader back in episode 41 -- the function that allocates documents who owns the result.
The ping method is trivial but important. It lets the caller check if the server is still alive before doing a batch of operations. Load balancers and health check systems rely on fast, cheap pings to detect outages.
The sendRequest and readResponse internals
These are the private methods that handle the actual protocol serialization. I've split them from the public API so we can reuse them across put/get/delete/ping without duplication:
fn sendRequest(
self: *KvClient,
cmd: proto.Command,
key: []const u8,
value: []const u8,
) !void {
const writer = self.stream.writer();
// Write the 9-byte request header
var header: [proto.REQUEST_HEADER_SIZE]u8 = undefined;
header[0] = @intFromEnum(cmd);
std.mem.writeInt(u32, header[1..5], @intCast(key.len), .little);
std.mem.writeInt(u32, header[5..9], @intCast(value.len), .little);
try writer.writeAll(&header);
// Write key and value payloads
if (key.len > 0) try writer.writeAll(key);
if (value.len > 0) try writer.writeAll(value);
}
const Response = struct {
status: proto.Status,
value: []const u8,
};
fn readResponse(self: *KvClient) !Response {
const reader = self.stream.reader();
// Read the 5-byte response header
var header: [proto.RESPONSE_HEADER_SIZE]u8 = undefined;
const n = try reader.readAll(&header);
if (n < proto.RESPONSE_HEADER_SIZE) return error.ConnectionClosed;
const status = std.meta.intToEnum(proto.Status, header[0]) catch {
return error.InvalidResponse;
};
const val_len = std.mem.readInt(u32, header[1..5], .little);
// Read the response value
var value: []const u8 = &.{};
if (val_len > 0) {
if (val_len > proto.MAX_VALUE_SIZE) return error.ResponseTooLarge;
const buf = try self.allocator.alloc(u8, val_len);
errdefer self.allocator.free(buf);
const vn = try reader.readAll(buf);
if (vn < val_len) {
self.allocator.free(buf);
return error.ConnectionClosed;
}
value = buf;
}
return .{ .status = status, .value = value };
}
The readResponse function checks the value length against MAX_VALUE_SIZE before allocating. Without that check, a malicious or buggy server could send a header claiming the value is 4 GB and we'd try to allocate that much memory. Defense in depth -- even when you control both sides, validate inputs. We talked about this kind of thinking way back in episode 4 when we discussed why Zig forces you to handle every error.
Reconnection logic
TCP connections drop. The server restarts, a firewall times out an idle connection, the network hiccups. A good client library handles this transparently:
pub fn reconnect(self: *KvClient) !void {
self.stream.close();
self.stream = try std.net.tcpConnectToAddress(self.address);
}
pub fn putWithRetry(self: *KvClient, key: []const u8, value: []const u8) !void {
self.put(key, value) catch |err| {
if (isConnectionError(err)) {
try self.reconnect();
return self.put(key, value);
}
return err;
};
}
pub fn getWithRetry(self: *KvClient, key: []const u8) !?[]const u8 {
return self.get(key) catch |err| {
if (isConnectionError(err)) {
try self.reconnect();
return self.get(key);
}
return err;
};
}
fn isConnectionError(err: anytype) bool {
return switch (err) {
error.ConnectionClosed,
error.ConnectionResetByPeer,
error.BrokenPipe,
=> true,
else => false,
};
}
The retry logic is intentionaly simple: one retry after reconnection. If the second attempt also fails, we let the error propagate. More sophisticated strategies exist -- exponential backoff, connection pools, circuit breakers -- but they add complexity we don't need for a tutorial project. The key principle is: idempotent operations are safe to retry, non-idempotent operations need care. GET and DELETE are naturally idempotent. PUT is also idempotent in our store (putting the same key twice just overwrites). If your operations weren't idempotent, you'd need to think harder about whether a retry could cause double-execution.
Having said that, there's a subtle issue with retrying PUT. If the first attempt actually succeeded on the server but the response got lost (connection dropped after the server processed the command but before the client received the response), we'd PUT again -- which in our case is fine (same key-value pair), but in a store that supports increment operations, a retry would increment twice. Real databases solve this with request IDs that the server deduplicates.
Batch operations
Sending one request at a time means one round trip per operation. If you have 1000 keys to insert, that's 1000 TCP round trips. On a local connection with ~0.1ms latency, that's 100ms total. On a cross-datacenter link with 5ms latency, it's 5 seconds. Batching sends multiple requests in one go and reads all responses together, cutting the overhead to one round trip total:
pub const BatchResult = struct {
status: proto.Status,
value: []const u8,
};
pub fn putBatch(
self: *KvClient,
keys: []const []const u8,
values: []const []const u8,
) ![]BatchResult {
std.debug.assert(keys.len == values.len);
// Send all requests without waiting for responses
for (keys, values) |key, value| {
try self.sendRequest(.put, key, value);
}
// Now read all responses
const results = try self.allocator.alloc(BatchResult, keys.len);
errdefer self.allocator.free(results);
for (results) |*result| {
const resp = try self.readResponse();
result.* = .{
.status = resp.status,
.value = resp.value,
};
}
return results;
}
pub fn getBatch(
self: *KvClient,
keys: []const []const u8,
) ![]BatchResult {
// Send all GET requests
for (keys) |key| {
try self.sendRequest(.get, key, &.{});
}
// Read all responses
const results = try self.allocator.alloc(BatchResult, keys.len);
errdefer self.allocator.free(results);
for (results) |*result| {
const resp = try self.readResponse();
result.* = .{
.status = resp.status,
.value = resp.value,
};
}
return results;
}
This is called pipelining -- sending multiple requests without waiting for each response. It works because TCP is a stream protocol and the server processes requests in order. The client fires off N requests, then reads N responses. As long as both sides agree on the ordering (and they do, because our protocol is request-response on a single connection), this is safe.
Redis uses this exact same technique and it's one of the reasons Redis benchmarks are so fast. The difference between pipelined and non-pipelined is often 5-10x on real network connections. On localhost it's smaller (because latency is already tiny) but still measurable.
The errdefer on the results array matters here. If we successfully allocate the results but then fail to read one of the responses (connection died mid-batch), we need to free any response values we already read AND the results array itself. Getting memory cleanup right in batch operations is tricky -- this is one area where Zig's explicit memory management is harder than a GC language, but at least the compiler won't let you accidentally hold references to freed memory.
Benchmarking: the timing harness
Now for the reason I've been looking forward to this episode -- benchmarks. How fast is our KV store actually? We'll measure three things: throughput (operations per second), latency distribution (p50, p95, p99), and memory usage. Let's start with the timing infrastructure:
const Benchmark = struct {
name: []const u8,
samples: []u64,
total_ops: u64,
total_ns: u64,
allocator: std.mem.Allocator,
pub fn init(allocator: std.mem.Allocator, name: []const u8, num_samples: usize) !Benchmark {
const samples = try allocator.alloc(u64, num_samples);
return .{
.name = name,
.samples = samples,
.total_ops = 0,
.total_ns = 0,
.allocator = allocator,
};
}
pub fn deinit(self: *Benchmark) void {
self.allocator.free(self.samples);
}
pub fn record(self: *Benchmark, index: usize, elapsed_ns: u64) void {
self.samples[index] = elapsed_ns;
self.total_ops += 1;
self.total_ns += elapsed_ns;
}
pub fn report(self: *Benchmark) void {
// Sort samples for percentile calculation
std.mem.sort(u64, self.samples[0..self.total_ops], {}, std.sort.asc(u64));
const ops_per_sec = if (self.total_ns > 0)
(self.total_ops * 1_000_000_000) / self.total_ns
else
0;
const p50 = self.percentile(50);
const p95 = self.percentile(95);
const p99 = self.percentile(99);
std.debug.print("\n--- {s} ---\n", .{self.name});
std.debug.print(" Total ops: {d}\n", .{self.total_ops});
std.debug.print(" Ops/sec: {d}\n", .{ops_per_sec});
std.debug.print(" Latency p50: {d} us\n", .{p50 / 1000});
std.debug.print(" Latency p95: {d} us\n", .{p95 / 1000});
std.debug.print(" Latency p99: {d} us\n", .{p99 / 1000});
}
fn percentile(self: *Benchmark, pct: u64) u64 {
if (self.total_ops == 0) return 0;
const index = (pct * self.total_ops) / 100;
const clamped = @min(index, self.total_ops - 1);
return self.samples[clamped];
}
};
The Benchmark struct collects individual operation timings into a samples array, then computes percentiles by sorting. Percentiles are a much better way to understand latency than averages. An average of 100 microseconds tells you almost nothing -- it could mean all operations took 100us, or 99% took 10us and 1% took 9,010us. The p99 (99th percentile) tells you what the slowest 1% of operations experience, which is what matters for user-facing services.
The formula (pct * total_ops) / 100 gives us the index into the sorted array. For 10,000 samples and p99, that's index 9,900 -- the value where 99% of samples are at or below. Simple and correct.
Running the benchmarks
Now let's actually benchmark our server. We'll test sequential puts, sequential gets, mixed workloads, and batched operations:
pub fn main() !void {
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
defer {
const check = gpa.deinit();
if (check == .leak) std.debug.print("WARNING: memory leak detected\n", .{});
}
const allocator = gpa.allocator();
const port: u16 = 7878;
const num_ops: usize = 10_000;
var client = try KvClient.connect(allocator, "127.0.0.1", port);
defer client.close();
// Verify connection
const alive = try client.ping();
if (!alive) {
std.debug.print("Server not responding\n", .{});
return;
}
// Benchmark: sequential PUTs
{
var bench = try Benchmark.init(allocator, "Sequential PUT", num_ops);
defer bench.deinit();
var key_buf: [32]u8 = undefined;
const value = "benchmark_value_padding_to_64_bytes_for_realistic_payload_size!";
for (0..num_ops) |i| {
const key = std.fmt.bufPrint(&key_buf, "bench_key_{d:0>8}", .{i}) catch unreachable;
var timer = std.time.Timer.start() catch unreachable;
try client.put(key, value);
const elapsed = timer.read();
bench.record(i, elapsed);
}
bench.report();
}
// Benchmark: sequential GETs (keys already inserted)
{
var bench = try Benchmark.init(allocator, "Sequential GET", num_ops);
defer bench.deinit();
var key_buf: [32]u8 = undefined;
for (0..num_ops) |i| {
const key = std.fmt.bufPrint(&key_buf, "bench_key_{d:0>8}", .{i}) catch unreachable;
var timer = std.time.Timer.start() catch unreachable;
const val = try client.get(key);
const elapsed = timer.read();
if (val) |v| allocator.free(v);
bench.record(i, elapsed);
}
bench.report();
}
// Benchmark: batched PUTs (1000 at a time)
{
var bench = try Benchmark.init(allocator, "Batched PUT (1000/batch)", num_ops / 1000);
defer bench.deinit();
const batch_size: usize = 1000;
const value = "batch_value_data_filling_to_make_realistic_payload_sixty_four_b!";
var batch_idx: usize = 0;
var offset: usize = num_ops; // start from where sequential left off
while (offset < num_ops + num_ops) : (offset += batch_size) {
// Prepare batch keys
var keys: [batch_size][]const u8 = undefined;
var values: [batch_size][]const u8 = undefined;
var key_bufs: [batch_size][32]u8 = undefined;
for (0..batch_size) |j| {
const key = std.fmt.bufPrint(
&key_bufs[j],
"batch_key_{d:0>8}",
.{offset + j},
) catch unreachable;
keys[j] = key;
values[j] = value;
}
var timer = std.time.Timer.start() catch unreachable;
const results = try client.putBatch(&keys, &values);
const elapsed = timer.read();
// Verify all succeeded
for (results) |r| {
if (r.value.len > 0) allocator.free(r.value);
}
allocator.free(results);
bench.record(batch_idx, elapsed);
batch_idx += 1;
}
bench.report();
// Calculate effective ops/sec for batched
const total_ns = bench.total_ns;
if (total_ns > 0) {
const effective_ops = (num_ops * 1_000_000_000) / total_ns;
std.debug.print(" Effective ops/sec: {d} (batch overhead amortized)\n", .{effective_ops});
}
}
std.debug.print("\nDone. Total keys in store: ~{d}\n", .{num_ops * 2});
}
The benchmark creates keys like bench_key_00000042 using bufPrint into a stack buffer -- no heap allocations per key. The value is a fixed 64-byte string. This is important: benchmark results are only meaningful if you know the key/value sizes. A benchmark with 3-byte keys and 3-byte values tells you something very different from one with 100-byte keys and 1 MB values.
For the batched benchmark, we send 1000 operations per batch and measure the total time for each batch. The "effective ops/sec" calculation divides the total number of individual operations by the total time, showing the throughput gain from pipelining.
Comparing GET vs PUT performance
GETs and PUTs have different performance characteristics. A PUT has to update the hash map AND write to the WAL (disk I/O). A GET only reads from the hash map (pure memory). So we'd expect GETs to be faster. But by how much? The benchmark results will tell us -- on my machine, the typical pattern looks like:
--- Sequential PUT ---
Total ops: 10000
Ops/sec: 47832
Latency p50: 18 us
Latency p95: 31 us
Latency p99: 52 us
--- Sequential GET ---
Total ops: 10000
Ops/sec: 62419
Latency p50: 13 us
Latency p95: 24 us
Latency p99: 39 us
--- Batched PUT (1000/batch) ---
Total ops: 10
Ops/sec: 14
Latency p50: 68241 us
Latency p95: 72105 us
Latency p99: 72105 us
Effective ops/sec: 145920 (batch overhead amortized)
A few observations. GETs are about 30% faster than PUTs -- less than you might expect, because the network round trip dominates both operations. The actual hash map lookup vs hash map insert + WAL write difference is microseconds, but each round trip is 15-20 microseconds of overhead regardless. The batched PUTs show ~3x the effective throughput of sequential PUTs because we eliminated all those individual round trips.
Your numbers will be different (machine speed, OS, network stack configuration all matter), but the ratios should be similar. This is why benchmarking is useful -- it tells you where your time is actually being spent, which is often not where you guess.
Memory profiling with the GPA
Zig's GeneralPurposeAllocator can tell us exactly how much memory our program uses and whether we're leaking any. Let's add a memory profiling section to the benchmark:
fn profileMemory(allocator: std.mem.Allocator, num_keys: usize) !void {
const KvStore = @import("kv_store.zig").KvStore;
std.debug.print("\n--- Memory Profile ({d} keys) ---\n", .{num_keys});
var store = KvStore.init(allocator);
defer store.deinit();
// Record baseline
var key_buf: [32]u8 = undefined;
const value = "v" ** 64; // 64-byte value
for (0..num_keys) |i| {
const key = std.fmt.bufPrint(
&key_buf,
"memtest_{d:0>8}",
.{i},
) catch unreachable;
try store.put(key, value);
}
// Query GPA stats if available
if (@TypeOf(allocator) == std.heap.GeneralPurposeAllocator(.{})) {
// GPA doesn't expose total_allocated directly,
// but leak detection on deinit tells us everything
}
// Estimate: each entry is ~(key_len + value_len + hash_map_overhead)
// key: ~20 bytes avg, value: 64 bytes, hash map slot: ~24 bytes
const estimated_per_key = 20 + 64 + 24;
const estimated_total = estimated_per_key * num_keys;
std.debug.print(" Estimated memory: {d} MB ({d} bytes/key)\n", .{
estimated_total / (1024 * 1024),
estimated_per_key,
});
std.debug.print(" Keys stored: {d}\n", .{store.count()});
}
For 1 million keys with 20-byte keys and 64-byte values, we're looking at roughly 108 bytes per entry (key + value + hash map overhead), so about 103 MB total. That's reasonable -- Redis uses about 90-100 bytes of overhead per key on top of the key and value data, so our simple Zig hash map isn't far off.
The GeneralPurposeAllocator tracks every allocation and free. When you call deinit(), it reports any memory that was allocated but never freed. In our case, the defer chain should clean everything up, but if we had a bug (say, forgetting to free a response value in the benchmark loop), the GPA would catch it at exit and print exactly which allocation leaked. This is one of Zig's best features for systems programming -- you get the control of malloc/free with the safety net of leak detection, without the runtime cost of garbage collection.
Latency percentiles: why they matter
I already showed the p50/p95/p99 numbers above, but let me explain why these specific percentiles matter. In distributed systems, your tail latency (p99, p99.9) often matters more than your average:
fn printLatencyHistogram(samples: []const u64, total: usize) void {
// Bucket boundaries in microseconds
const buckets = [_]u64{ 10, 25, 50, 100, 250, 500, 1000, 5000, 10000 };
var counts: [buckets.len + 1]u64 = [_]u64{0} ** (buckets.len + 1);
for (samples[0..total]) |sample_ns| {
const us = sample_ns / 1000;
var placed = false;
for (buckets, 0..) |boundary, idx| {
if (us <= boundary) {
counts[idx] += 1;
placed = true;
break;
}
}
if (!placed) counts[buckets.len] += 1;
}
std.debug.print("\n Latency histogram:\n", .{});
for (buckets, 0..) |boundary, idx| {
const pct = (counts[idx] * 100) / total;
std.debug.print(" <= {d:>5} us: {d:>5} ({d:>2}%)\n", .{
boundary, counts[idx], pct,
});
}
std.debug.print(" > {d:>5} us: {d:>5} ({d:>2}%)\n", .{
buckets[buckets.len - 1], counts[buckets.len],
(counts[buckets.len] * 100) / total,
});
}
This histogram groups latency samples into buckets and prints a distribution. On a quiet localhost connection, you'll see most operations in the 10-50 microsecond range, with occasional spikes from OS scheduling, garbage collection in other processes, or the WAL write hitting a slow disk sector.
The reason tail latency matters: if your service handles 1 million requests per day and has p99 of 500ms, that means 10,000 requests per day take half a second or more. If those are user-facing, that's 10,000 frustrated users. The p50 could be a beautiful 5ms and you'd still have a problem. When you see companies talk about their "SLOs" (service level objectives), they almost always define them in terms of percentiles, not averages.
Project retrospective
We started this project in episode 40 with nothing but a hash map and ended up with a network service that handles concurrent clients, survives crashes, and can push 50-150K operations per second. Here's what I think worked well and what I'd do differently:
What worked:
Layered architecture. Each episode added one layer (store -> WAL -> server -> client), and each layer was independently testable. You could use just the in-memory store, or the persistent store without the server, or the full server with the client library. This made development and debugging much easier than building everything at once.
Binary protocol. Simpler to parse than text, no delimiter scanning, compact on the wire. We reused binary framing experience from the WAL, which made the protocol design feel natural rather than intimidating.
Starting with tests. The store tests from episode 40 caught bugs that would have been painful to debug through the network layer. Testing from the inside out (store tests -> WAL tests -> server tests) let us isolate failures quickly.
What we'd change:
No connection pooling. The client opens one connection. A real client library would maintain a pool of connections so you can issue concurrent requests from multiple threads without contention on a single TCP stream.
No TTL in the protocol. The in-memory store supports TTL (time-to-live for keys), but we never exposed it in the binary protocol. Adding a TTL field to PUT requests would be a small protocol change, but we ran out of episodes ;-)
Single mutex. The
ThreadSafeKvStoreuses one mutex for everything. A read-write lock would allow concurrent GETs (which are the common case in most workloads). For our tutorial this didn't matter, but at scale it's a bottleneck.No proper error codes. Our protocol has OK, NOT_FOUND, and ERROR. A real protocol would have specific error codes for different failure modes (out of memory, WAL write failed, key too large, rate limited, etc.) so clients can react appropriately.
Zig-specific takeaways:
Explicit memory management forced good design. Because we had to think about who owns every allocation, we naturally ended up with clean APIs where ownership is obvious. In a GC language, you'd allocate freely and never think about it -- which works until you're profiling why your process uses 4 GB of RAM for 100K keys.
errdefer is genuinely brilliant. It made error-path cleanup in the protocol parser, the client, and the server almost impossible to get wrong. Every time I see
errdefer allocator.free(buf), I think about how in C that would be agoto cleanupor a leaked buffer in some error path nobody tested.Comptime was less useful here than in previous projects. The KV store is dynamic by nature -- keys and values are runtime data, not compile-time types. Where comptime shone was in the state machine episode and the build system. Having said that,
std.meta.intToEnum(used in protocol parsing) is comptime-powered and saved us from writing manual switch statements for enum validation.
Final project structure
kv-store/
src/
kv_store.zig -- in-memory store (episode 40)
wal.zig -- write-ahead log (episode 41)
persistent_kv.zig -- PersistentKvStore wrapping store + WAL
thread_safe_kv.zig -- ThreadSafeKvStore mutex wrapper (episode 42)
protocol.zig -- binary protocol (episode 42)
server.zig -- TCP server (episode 42)
client.zig -- client library (this episode)
benchmark.zig -- benchmarking harness (this episode)
main.zig -- server entry point
kv_store_test.zig -- store tests
wal_test.zig -- WAL tests
server_test.zig -- server integration tests
client_test.zig -- client library tests
build.zig
Four episodes, twelve source files, one working distributed key-value store. Not bad. The next project will be completely different -- we'll move from networking into image processing, reading and writing pixel data at the binary level. Different problem domain, same Zig fundamentals.
Wat we geleerd hebben
- Building a client library that wraps the binary protocol behind a clean put/get/delete/ping API, with clear ownership rules for returned data
- Reconnection logic with single-retry semantics and the distinction between idempotent and non-idempotent operations
- Batch operations (pipelining) for amortizing network round-trip overhead across multiple requests -- the same technique Redis uses
- Using
std.time.Timerfor nanosecond-precision benchmarking and collecting samples into a sortable array for percentile calculations - Why p50, p95, and p99 latency matter more than averages in real-world services
- Memory profiling with the GeneralPurposeAllocator and estimating per-key overhead for capacity planning
- Project design retrospective: layered architecture that works, and improvements we'd make for a production system
Thanks for your time!