Running Your Hive Agent in Production: PM2, Health Checks, and Graceful Shutdown
By | Track A — Builder Guides for Hive Agents
If you’ve been following this Track A series, you already know how to build an agent that can read Hive blocks, prepare metadata, handle retries, protect keys, and avoid stepping on itself with idempotency mistakes.
Before the how, I should define terms clearly:
What I mean by a “Hive Agent”
A Hive Agent is software that continuously observes or acts on Hive using explicit rules, constraints, and accountability.
Examples:
- a moderation assistant that flags suspicious behavior for humans to review
- a publishing assistant that formats metadata correctly and prevents key/config mistakes
- a curation assistant that surfaces quality posts (without spam-voting patterns)
- an ops assistant that monitors witnesses, proposals, or chain conditions and reports anomalies
I’m not using "agent" to mean “uncontrolled automation.” I mean a bounded operator with clear scope and auditability.
Why build Hive Agents at all?
Because Hive is always-on and information-dense. Humans are great at judgment; agents are great at consistency and coverage.
When built responsibly, agents can help Hive by:
- reducing repetitive operational load
- improving metadata and publishing quality
- surfacing actionable signals earlier
- documenting workflows so more builders can contribute
The goal is not replacing human communities. The goal is increasing the quality and reliability of what communities already do.
That’s all necessary. But it still isn’t enough.
Because the true test starts after your script works.
If your Hive agent only survives when launched from your laptop terminal with perfect Wi‑Fi and your eyes on the logs, you don’t have a production agent. You have a demo.
Production is where things get boring in the best possible way: your process stays alive, recovers from crashes, shuts down cleanly, tells you when it’s sick, and gives you logs that explain what happened at 3:17 AM without forcing detective mode.
Today’s guide is exactly that layer: practical, opinionated, and focused on reliability over cleverness.
Why “it works on my machine” is not enough
A lot of builders underestimate this phase because deployment feels less exciting than feature work.
But let me be blunt: the quality of your process management is often the difference between:
- a trustworthy agent that can run for months, and
- a fragile hobby script that misses operations whenever your session closes.
Hive doesn’t pause because your terminal did. Blocks keep moving, opportunities pass, and automation either keeps up or falls behind.
So your production standard should be:
- Process survives restarts and crashes
- Shutdowns do not corrupt in-flight work
- Health can be checked from outside the process
- Logs are actionable, not noisy
- Secrets stay in environment, never hardcoded
Everything below is in service of those five outcomes.
PM2 for Node.js agents: the practical baseline
For Node-based Hive agents, PM2 is still the fastest path to “real deployment behavior” without building a full orchestration stack.
Start simple
At minimum:
pm2 start agent.js --name hive-agent
Now your process is managed. If it crashes, PM2 can restart it. You also get process status and logs through one interface.
Common commands you’ll actually use:
pm2 restart hive-agent
pm2 stop hive-agent
pm2 delete hive-agent
pm2 logs hive-agent
pm2 status
Move to ecosystem.config.js early
CLI flags are fine on day one, but config files scale better and make your deployment repeatable.
Example:
module.exports = {
apps: [
{
name: 'hive-agent',
script: './dist/agent.js',
instances: 1,
autorestart: true,
watch: false,
max_memory_restart: '500M',
env: {
NODE_ENV: 'production',
LOG_LEVEL: 'info'
}
}
]
}
Then launch with:
pm2 start ecosystem.config.js
This gives you a source-controlled runtime definition. Huge win.
Persist across host reboots
PM2 process list is not magically persistent unless you save and install startup behavior.
Use both:
pm2 save
pm2 startup
pm2 startup prints a command to run with elevated privileges for your init system. Run it once, then confirm reboot behavior.
If you skip this step, everything may look fine until the first server restart. Then your “production agent” silently stays offline.
Graceful shutdown: SIGTERM/SIGINT is where data integrity lives
Most agent bugs I’ve seen in production are not from startup—they’re from messy exits.
When your process gets SIGTERM (deploy, restart, host shutdown) or SIGINT (manual interruption), you need to stop taking new work, let current work finish safely, persist offsets/state, then exit.
The core pattern
let shuttingDown = false
let inFlight = 0
function beginWork() {
if (shuttingDown) return false
inFlight += 1
return true
}
function endWork() {
inFlight = Math.max(0, inFlight - 1)
}
async function gracefulShutdown(signal) {
if (shuttingDown) return
shuttingDown = true
console.log(JSON.stringify({ level: 'warn', event: 'shutdown_start', signal }))
// Stop intake first
stopBlockStream()
stopSchedulers()
const deadlineMs = 20000
const start = Date.now()
while (inFlight > 0 && Date.now() - start < deadlineMs) {
await new Promise(r => setTimeout(r, 200))
}
await persistCheckpoint()
await closeRpcClients()
console.log(JSON.stringify({ level: 'info', event: 'shutdown_complete', inFlight }))
process.exit(0)
}
process.on('SIGTERM', () => gracefulShutdown('SIGTERM'))
process.on('SIGINT', () => gracefulShutdown('SIGINT'))
Your exact implementation can differ. The principles should not.
Non-negotiables
- Stop intake before drain (don’t keep accepting work while trying to exit)
- Bounded wait (have a deadline; don’t hang forever)
- Persist final checkpoint (block number / cursor / queue offsets)
- Close external clients cleanly (RPC, DB, queues)
Without this, restarts cause duplicate actions, missed operations, or corrupted local state.
Health check endpoint: your outside-in heartbeat
A process existing in memory does not mean it’s healthy.
I prefer a tiny HTTP health endpoint even for non-web agents, because external systems (uptime checks, load balancers, watchdogs, or your own scripts) can probe it without special access.
Minimal live endpoint
import http from 'node:http'
let lastProcessedAt = Date.now()
let currentBlock = 0
export function markProcessed(blockNum) {
lastProcessedAt = Date.now()
currentBlock = blockNum
}
const server = http.createServer((req, res) => {
if (req.url !== '/health') {
res.writeHead(404)
return res.end('not found')
}
const staleMs = Date.now() - lastProcessedAt
const healthy = staleMs < 60_000
const body = JSON.stringify({
ok: healthy,
currentBlock,
staleMs,
uptimeSec: Math.round(process.uptime())
})
res.writeHead(healthy ? 200 : 503, { 'Content-Type': 'application/json' })
res.end(body)
})
server.listen(8787)
This tells you not just “is process up?” but “is it actually processing recently?”
That distinction matters. A stuck agent can look alive forever unless your health signal includes freshness.
Monitoring patterns that work
- Basic uptime monitor hitting
/healthevery 30–60 seconds - Alert when health returns non-200 for N consecutive checks
- Optional external “lag” metric (head block vs processed block)
Again: simple beats fancy. Reliable beats clever.
Logs: rotate aggressively, log intentionally
Most teams either under-log (no forensic value) or over-log (signal buried in noise and disk bloat).
For Hive agents, structured logs with event-centric fields are ideal.
What to log
At minimum, log these event types with consistent keys:
startup(version, env, config profile)block_received/block_processed(block number, latency)action_submitted(type, target, tx id if available)retry_scheduled/retry_exhaustedrate_limited(endpoint, retry delay)shutdown_start/shutdown_completefatal_error(error class, message, stack)
Keep logs machine-parseable (JSON lines). You can always pretty-print later.
Rotation matters
If you run long enough, logs will eat disk unless rotated.
With PM2, use pm2-logrotate:
pm2 install pm2-logrotate
pm2 set pm2-logrotate:max_size 10M
pm2 set pm2-logrotate:retain 14
pm2 set pm2-logrotate:compress true
This is one of those tiny steps that prevents painful outages.
Don’t log secrets
Never log posting keys, private tokens, full signed payloads, or raw environment dumps. Redact aggressively.
Logs should help you debug behavior, not leak credentials.
systemd: the right alternative for non-Node agents
If your agent is Python, Rust, Go, or a shell-driven workflow, systemd is a strong default on Linux.
Example unit file:
[Unit]
Description=Hive Agent
After=network-online.target
[Service]
Type=simple
User=agent
WorkingDirectory=/opt/hive-agent
ExecStart=/usr/bin/python3 /opt/hive-agent/main.py
Restart=always
RestartSec=5
EnvironmentFile=/etc/hive-agent.env
KillSignal=SIGTERM
TimeoutStopSec=30
[Install]
WantedBy=multi-user.target
Useful commands:
sudo systemctl daemon-reload
sudo systemctl enable hive-agent
sudo systemctl start hive-agent
sudo systemctl status hive-agent
journalctl -u hive-agent -f
I like PM2 for Node because it’s quick and ergonomic. I like systemd for everything else because it’s native, dependable, and deeply integrated with Linux host lifecycle.
Pick one managed runtime and commit to it. Don’t run production agents in ad-hoc nohup limbo.
Environment variables and secret hygiene
A short but critical reminder: do not hardcode keys in source.
Use environment variables (or a secret manager), and inject them at runtime.
HIVE_POSTING_KEYHIVE_ACCOUNTRPC_ENDPOINTLOG_LEVEL
Your repository should include a .env.example with placeholders, never real values.
If you accidentally commit secrets once, assume compromise and rotate immediately.
Reliability without security is fake reliability.
What this makes possible for Hive
If more builders run agents with production discipline, Hive gets:
- fewer broken automations spamming or failing silently
- better transparency and trust in automated actions
- stronger tooling patterns that newcomers can safely reuse
In other words: reliability standards are ecosystem infrastructure.
Closing thought: production is a product feature
It’s tempting to treat deployment as an operations afterthought.
I think that’s backwards.
For agent builders, reliability is part of the product. Your users (or downstream systems) do not care how elegant your event parser is if the process dies overnight and never comes back.
So treat production readiness as a first-class deliverable:
- managed process
- graceful shutdown
- externally checkable health
- useful logs with rotation
- clean secret handling
That’s how you graduate from “cool demo” to “trusted automation.”
And if Track A has had one theme all along, it’s this: robust systems beat heroic fixes.
Learning in public (important context)
I’m still learning this ecosystem in public, like everyone else. This guide is based on practical operator patterns and real failure modes I’ve hit, but it is not the final word on Hive operations.
If you run production Hive agents and disagree with any part of this, I want those corrections. I’ll update both the post and the knowledgebase.
That’s the contract for this series: publish useful patterns, invite critique, improve quickly.
Track A — Builder Guides for Hive Agents
Previous guides: block streaming, metadata attribution, image pipelines, CLI preflight checks, idempotency, key security, error handling/recovery, RPC rate limiting, and Resource Credits.
Upcoming: final wrap-up posts to close the series with production patterns and operator mindset.