Hey everyone,
I usually write posts sharing cool technical implementations or updates on my projects. Today's post is a bit different. It's a detective story, and I'm currently standing in the middle of the crime scene.
After setting up what I considered to be a fairly quiet, niche flask application, all was well, for a moment. A small part of it's job is resolving blockchain posts from the Hive (and previously Steem) networks, serving up content based on the standard @username/permlink URL structure. It's useful, but it's not exactly Netflix.
Or so I thought.
Recently, I noticed my server resources creeping up. Memory usage on my Gunicorn workers was hitting 20%, and things just felt sluggish. I decided to pop the hood and look at the Caddy logs, expecting to see maybe a burst of traffic or a buggy script.
What I found instead was staggering.
The Scale of the Attack
In a single 24-hour period, my unassuming little server processed requests from over 128,000 unique IP addresses.
Let that sink in. This wasn't one over-enthusiastic person hitting refresh. It wasn't even a standard "loud" scraper hitting me thousands of times from a single server. This was a highly coordinated, distributed botnet attack.
The pattern is insidious. It's a "low and slow" attack. Each individual IP address might only hit the server once or twice a day. Standard rate-limiting (which looks for too many requests from one IP in a short timeframe) is completely useless against this strategy.
The Anatomy of a Scraper
As I dug deeper, it became clear this wasn't legitimate traffic.
1. Ignoring the Rules: The very first thing a legitimate crawler (like Google or Bing) does is check robots.txt to see what they are allowed to index. This botnet completely ignores it.
User-agent: *
Disallow: /
2. The Infrastructure: When I analyzed the top offending IP ranges, they weren't residential internet users. They were clusters of cheap VPS hosting providers and known data centers-places like ColoCrossing, RackNerd, and weirdly, a massive amount of traffic routed through Azure and DigitalOcean data centers. These are classic launchpads for cheap, disposable compute power.
[
{
"ip": "192.210.150.198",
"network": "192.210.150.0/23",
"asn": "AS36352",
"org": "AS-COLOCROSSING"
},
{
"ip": "195.178.110.199",
"network": "195.178.110.0/24",
"asn": "AS48090",
"org": "Techoff Srv Limited"
}
]
3. The "Fake Human" Behavior: The most frustrating part is what they are targeting. They aren't just hitting 404s or probing for vulnerabilities or getting 403 access denied. They are hitting valid, real pages. They are scraping actual Hive/Steem posts. Because these pages return a "200 OK" status, it makes filtering them out incredibly difficult without blocking real users.
4. The Spam Signups: Alongside the scraping, I noticed an uptick in bogus account creations. They use data-broker style email addresses, a fake name, and randomly generated usernames that are usually 6-8 letter gibberish. It's clear they are trying to gain write-access to the platform, likely to comment-spam links.
The Billion-Dollar Question: WHY?
This is what keeps me up at night. Why go through the immense trouble of renting or compromising 120,000+ IP addresses just to scrape old blockchain data? Most of the content they are pulling is from the Steem era, it's many, many years old.
I have a few theories.
Theory 1: The AI Hunger Games (Most Likely)
We are in the golden age of Large Language Models (LLMs). These models require absolutely unfathomable amounts of text data to train. The Hive/Steem blockchain is a goldmine of public, immutable, varied human text. I strongly suspect my site is being used as a straw to suck up training data for some entity's new AI model. They need the text, and they don't care about my server bills.
Theory 2: The Content Farms
SEO spam is still a massive industry. Scrapers pull existing content, spin it using basic AI tools to make it look "unique," and repost it on ad-filled spam sites to game Google rankings. Blockchain content is easy pickings for this.
The fake signups support both theories, they either want to post spam links back to their content farms, or they are testing credential lists to see if they work elsewhere.
Fighting Back
I couldn't just let the server melt. I had to get creative with mitigation.
Since standard rate limiting failed, I moved to behavioral analysis. I shifted the Gunicorn setup to Unix sockets for better performance under load and started pre-filtering aggressive user agents right at the Caddy edge.
I also realized that trying to block 150,000+ IPs individually is whack-a-mole. Instead, I identified the worst-offending data center subnets and blocked entire /24 CIDR blocks in the firewall.
But my favorite defense is the Honeypot. I implemented hidden links in the HTML that humans can't see, but bots blindly follow. As soon as an IP hits that trap URL, Fail2Ban instantly slaps a one-week ban on them. It's incredibly satisfying to watch them ban themselves. (but it's very slow going)
The Endgame
I'm getting a handle on the traffic now, but the "why" still nags at me.
Does anyone else out there host a site that resolves @username/permlink style posts for Hive or Steem? Are you seeing similar patterns? Is this a targeted attack against my specific DApp, or is every blockchain explorer getting hammered right now by this same hungry botnet?
Let me know in the comments if you're in the same boat. It's a wild time to be hosting public data on the open web.
As always,
Michael Garcia a.k.a. TheCrazyGM