Hive-Engine RPC Failover Is Not Rolling Over Cleanly, Here Are the ...

Hey everyone,

Lately I have been digging into a streamer issue on the Hive-Engine node that I think a lot of operators have probably felt, even if they did not immediately know where the problem lived.

The short version is this:

if your first configured Hive RPC goes bad, the node does not always fail over the way you would expect.

Instead of cleanly moving on to the next healthy RPC, it can get stuck waiting on the bad one, start falling behind, and just sort of sit there looking alive while making less progress than it should.

That is exactly the kind of issue that is annoying for operators because it does not always look like a hard crash. Sometimes it looks more like the node just got dumb and slow.

After tracing through the streamer logic, I do not think this is just bad luck or one flaky endpoint. I think the failover behavior in the current implementation is weaker than people assume it is.

So this post is not code yet. This is the architecture discussion I would want to have before I go change it.

There are really two reasonable paths here.

What Is Going Wrong

At a high level, the current streamer is not treating streamNodes like a true failover chain for block reads.

What it is doing instead is closer to this:

create separate clients for separate nodes
assign block fetches into that structure
rotate the configured list only after a harder stream-level failure

That means a bad first node can still end up pinning the current block fetch even when other nodes are healthy.

So from an operator point of view, what you see is:

block processing stops moving normally
the node starts lagging
the backup RPCs do not seem to take over decisively

That is the problem I think we actually need to solve.

Option A: The Minimal Fix

This is the smallest practical path.

The idea here is not to redesign the whole RPC layer. It is to make the existing streamer stop behaving badly under RPC failure.

That would mean things like:

giving block fetches a hard timeout
marking a bad node unhealthy after repeated failures
retrying the current block on another node instead of waiting forever
making node cooldown and recovery explicit
making sure rotation actually changes which node gets used next

This is the "fix the bug without starting a philosophy debate" option.

Why I Like It

it solves the operator pain faster
it is easier to review
it is less invasive
it is much easier to sell as a targeted reliability patch

Why I Do Not Love It

It still leaves failover behavior living inside streamer-specific logic instead of one shared RPC layer.

In other words, it is a good fix, but not necessarily the cleanest design.

Option B: The Redesign

This is the "do it properly" path.

Instead of having the streamer partially own node scheduling and failover behavior, I would move Hive RPC access into a dedicated abstraction and make the streamer consume that.

Something like:

one RPC manager
one place that owns node health
one place that owns retry and failover policy
one place that can answer "which node are we using and why?"

Under that model, the streamer stops caring about node order directly and just asks for blocks or global props through a shared interface.

That is the version that makes more sense to me architecturally.

Why I Like It

cleaner ownership
easier to reason about
better long-term maintainability
better base for logging, metrics, and future improvements

Why I Would Not Rush It Blindly

This is a larger change.

And larger changes in infrastructure code always cost more than the first diff makes them look.

This path would need:

broader testing
more careful review
more buy-in from people who care about node stability and backward compatibility

So while I think it is the better long-term design, I also think it is the heavier community conversation.

If I Had To Recommend One

If the goal is:

"stop nodes from hanging behind a bad RPC as quickly and safely as possible"

then I would recommend Option A first.

If the goal is:

"clean up the architecture so this whole class of problem is handled properly going forward"

then I would recommend Option B.

My honest answer is that I think there is a decent chance both are valid in sequence:

land the minimal reliability fix
later discuss whether the RPC layer deserves a more formal redesign

That tends to be the way mature infrastructure evolves anyway. First stop the bleeding, then decide whether the design itself needs surgery.

Why I Am Writing This Before Coding It

Because this is exactly the kind of change where the implementation is not the whole decision.

A targeted failover patch is one kind of change.

A streamer or RPC-layer redesign is a different kind of change.

They both solve the same visible problem, but they are not the same commitment.

So before I go from "I found the issue" to "here is the PR," I would rather be explicit about what the two paths look like and let the team and community weigh in on what level of change they actually want.

The Short Version

If I were pitching this in one paragraph, it would be this:

the current Hive-Engine streamer does not fail over cleanly when a primary Hive RPC degrades or dies, and I see two reasonable fixes: a minimal targeted repair to make failover actually work in the current design, or a more complete redesign that moves RPC health and failover into a dedicated layer.

I think Option A is the easier immediate sell.

I think Option B is the better long-term architecture.

And I think it is worth deciding that deliberately instead of pretending they are the same size change.

As always,
Michael Garcia a.k.a. TheCrazyGM

Hive-Engine RPC Failover Is Not Rolling Over Cleanly, Here Are the Two Fix Paths I’d Pitch

What Is Going Wrong

Option A: The Minimal Fix

Why I Like It

Why I Do Not Love It

Option B: The Redesign

Why I Like It

Why I Would Not Rush It Blindly

If I Had To Recommend One

Why I Am Writing This Before Coding It

The Short Version