Half the internet went down because of one config change in Virgini...

server-rack

So last week Cloudflare pushed a bad BGP config in their Ashburn data center and took down half the internet for four hours. AWS went with it because turns out a massive chunk of AWS customer traffic routes through Cloudflare for DDoS protection. One config update, one facility in Virginia, and suddenly X is down, thousands of sites are unreachable, businesses in dozens of countries cant operate.

I keep thinking about this because I've been on the other end of this exact type of failure, just at a much smaller scale. Working at a hosting company in Lagos a few years ago, I accidentally fat-fingered a firewall rule and locked out the entire office from the management network. Took me 45 minutes to drive to the datacenter and fix it physically because there was no out-of-band access. My boss didn't speak to me for two days ;-)

But thats a company of 30 people. Cloudflare is routing traffic for like.. what, 20% of the web? And the failsafe against a bad BGP announcement was apparently "hope the engineer doing the config change doesn't make a typo"?? THe post-incident report talks about cascading failures and traffic surges but honestly.. the root cause is that we built the internet on the assumption that a handful of companies would never screw up simultaneously. ANd thats not engineering, thats faith.

What bothers me most is the concentration. Before the cloud era, if one hosting provider went down, their customers were affected. Full stop. Now when Cloudflare hiccups, AWS goes down too because they're interdependent. When AWS goes down, half of SaaS goes with it. The blast radius of a single mistake has grown exponentially but the safety mechanisms havent kept up.

In Nigeria we deal with this differently btw. Not because we're smarter, but because we dont have the luxury of trusting infrastructure to stay up. You run local backups, you have generator power (NEPA wahala, anyone?), you plan for the internet to be unreliable because it IS unreliable. Every sysadmin I know here has a plan B and a plan C and usually a plan D that involves driving somewhere with a USB stick. American companies just.. assume the cloud will be there? And when it isn't they act surprised??

idk maybe I'm being unfair. But after watching the Salt Typhoon thing (backdoors mandated by government, compromised by foreign intelligence) and now this (centralized infrastructure, single point of failure taking down everything).. it feels like the same pattern. Build something brittle, act shocked when it breaks.

anyway, if you're running anything in production, maybe think about what happens when your CDN provider has a bad day. Because apparently the answer for half the internet was "nothing works anymore"

Love Emeka / Hive account@delt

Half the internet went down because of one config change in Virginia