Hey everyone,
Lately I have been digging into a streamer issue on the Hive-Engine node that I think a lot of operators have probably felt, even if they did not immediately know where the problem lived.
The short version is this:
if your first configured Hive RPC goes bad, the node does not always fail over the way you would expect.
Instead of cleanly moving on to the next healthy RPC, it can get stuck waiting on the bad one, start falling behind, and just sort of sit there looking alive while making less progress than it should.
That is exactly the kind of issue that is annoying for operators because it does not always look like a hard crash. Sometimes it looks more like the node just got dumb and slow.
After tracing through the streamer logic, I do not think this is just bad luck or one flaky endpoint. I think the failover behavior in the current implementation is weaker than people assume it is.
So this post is not code yet. This is the architecture discussion I would want to have before I go change it.
There are really two reasonable paths here.
What Is Going Wrong
At a high level, the current streamer is not treating streamNodes like a true failover chain for block reads.
What it is doing instead is closer to this:
- create separate clients for separate nodes
- assign block fetches into that structure
- rotate the configured list only after a harder stream-level failure
That means a bad first node can still end up pinning the current block fetch even when other nodes are healthy.
So from an operator point of view, what you see is:
- block processing stops moving normally
- the node starts lagging
- the backup RPCs do not seem to take over decisively
That is the problem I think we actually need to solve.
Option A: The Minimal Fix
This is the smallest practical path.
The idea here is not to redesign the whole RPC layer. It is to make the existing streamer stop behaving badly under RPC failure.
That would mean things like:
- giving block fetches a hard timeout
- marking a bad node unhealthy after repeated failures
- retrying the current block on another node instead of waiting forever
- making node cooldown and recovery explicit
- making sure rotation actually changes which node gets used next
This is the "fix the bug without starting a philosophy debate" option.
Why I Like It
- it solves the operator pain faster
- it is easier to review
- it is less invasive
- it is much easier to sell as a targeted reliability patch
Why I Do Not Love It
It still leaves failover behavior living inside streamer-specific logic instead of one shared RPC layer.
In other words, it is a good fix, but not necessarily the cleanest design.
Option B: The Redesign
This is the "do it properly" path.
Instead of having the streamer partially own node scheduling and failover behavior, I would move Hive RPC access into a dedicated abstraction and make the streamer consume that.
Something like:
- one RPC manager
- one place that owns node health
- one place that owns retry and failover policy
- one place that can answer "which node are we using and why?"
Under that model, the streamer stops caring about node order directly and just asks for blocks or global props through a shared interface.
That is the version that makes more sense to me architecturally.
Why I Like It
- cleaner ownership
- easier to reason about
- better long-term maintainability
- better base for logging, metrics, and future improvements
Why I Would Not Rush It Blindly
This is a larger change.
And larger changes in infrastructure code always cost more than the first diff makes them look.
This path would need:
- broader testing
- more careful review
- more buy-in from people who care about node stability and backward compatibility
So while I think it is the better long-term design, I also think it is the heavier community conversation.
If I Had To Recommend One
If the goal is:
"stop nodes from hanging behind a bad RPC as quickly and safely as possible"
then I would recommend Option A first.
If the goal is:
"clean up the architecture so this whole class of problem is handled properly going forward"
then I would recommend Option B.
My honest answer is that I think there is a decent chance both are valid in sequence:
- land the minimal reliability fix
- later discuss whether the RPC layer deserves a more formal redesign
That tends to be the way mature infrastructure evolves anyway. First stop the bleeding, then decide whether the design itself needs surgery.
Why I Am Writing This Before Coding It
Because this is exactly the kind of change where the implementation is not the whole decision.
A targeted failover patch is one kind of change.
A streamer or RPC-layer redesign is a different kind of change.
They both solve the same visible problem, but they are not the same commitment.
So before I go from "I found the issue" to "here is the PR," I would rather be explicit about what the two paths look like and let the team and community weigh in on what level of change they actually want.
The Short Version
If I were pitching this in one paragraph, it would be this:
the current Hive-Engine streamer does not fail over cleanly when a primary Hive RPC degrades or dies, and I see two reasonable fixes: a minimal targeted repair to make failover actually work in the current design, or a more complete redesign that moves RPC health and failover into a dedicated layer.
I think Option A is the easier immediate sell.
I think Option B is the better long-term architecture.
And I think it is worth deciding that deliberately instead of pretending they are the same size change.
As always,
Michael Garcia a.k.a. TheCrazyGM