How a Router's Missed Range Check Nearly Crashed the Internet
Barlaam writes "A bug by router vendor A (omitting a range check from a critical field in the configuration interface) tickled a bug from router vendor B (dropping BGP sessions when processing some ASPATH attributes with length very close to 256), causing a ripple effect that caused widespread global routing instability last week. The flaw lay dormant until one of vendor A's systems was deployed in an autonomous system whose ASN, modulo 256, was greater than 250. At that point, the Internet was one typo away from disaster. Other router vendors, who were not affected by the bug, happily propagated the trigger message to every vulnerable system on the planet in about 30 seconds. Few people appreciate how fragile and unsecured the Internet's trust-based critical infrastructure really is — this is just the latest example." Vendor A, in this case, is a Latvian router vendor called MikroTik.
If people had upgraded their routers this wouldn't have happened. Newsflash: software has bugs. Not upgrading your software will bite you in the ass eventually, especially if this software runs critical systems like your routers.
It seems like we live in a world now where media go ridiculously out of their way to soften the blow and protect the parties who screwed up and shipped software that had mistakes in it, by playing PR on their behalf and hiding their name.
They had a bug; they deserve to be called on that fact, authors should be honest and direct, and always mention them by name. ESPECIALLY in this case, so people who bought their product KNOWM they need to update, even if they didn't notice the fact that they were impacted by the bug (not everyone impacted necessarily knows what caused their problems, a lot of people may still be wide open to the bug but not know about it).
Seriously, if you develop an implementation of an exterior routing protocol that untrusted devices participate in BY DESIGN...
How do you justify NOT taking basic steps to validate what happens in your implementation if another party decides to play dirty, and hit you with a ridiculously long or corrupt entry in a field (like AS path) ?
How does your QA team miss the potential consequences of how such a case can impact your re-advertisements of that long path? And miss testing that the result you send is still valid, or that you at least block it properly.
It doesn't mean they're totally inept, i'm sure their QA team does a lot of good work. But something fundamental seems to be missing, if these sort of elementary bugs slip through the cracks.
It may be hard on them PR wise, but the public deserves to know the facts, without the names being changed to protect the guilty.
except in the kdawson style it was a single link to a message board posting about a router "taking out half the internet." Dupe? Correction? I dont care as long as kdawson is kept away from the site for a while.
It just amazes me how differently presented this story is compared with the previous.
Previous story: kdawson. Current story: Timothy. Do you need any more explanation than that?
It seems like we live in a world now where media go ridiculously out of their way to soften the blow and protect the parties who screwed up and shipped software that had mistakes in it, by playing PR on their behalf and hiding their name.
Well that may be the case but in this case the criticism doesn't really seem deserved. For better or worse /. generally posts exactly what was written by the person who submitted the article. Blame that person for trying to "soften" the blow.
I want peace on earth and goodwill toward man.
We are the United States Government! We don't do that sort of thing.
... the crash will take out the entire interwebs for a full week. Wouldn't it be amazing if mankind as a whole had to "survive" an entire week without the face-to-face interaction killer that is the internet? I suppose that what's even more pathetic is that we depend on it so much now; countries would go into widespread panic if internet was lost for a single week. Isn't it sad how people seem to think that something that didn't even exist 30 years ago is now considered a bare necessity? Oh, the priorities of man.
Maybe if they updated their IOS back in 2003 when Cisco came out with the fix they wouldn't have these problems. You wouldn't give an XP user a pass on not updating for 6 years and having a problem, don't give these upstreams any.
-zifr
Just another reason for Cisco to opensource IOS and sell their hardware and service,instead. :-)
IOS has been famously pirated along with its hardware by Chinese knock-offs for years now.
Might as well finish the transition. Then again I'd like to see Mac OSX opensourced, too,
so it may be something in the water.
Actually, no. The problem is that you need to pay big bucks to have access to IOS updates, and too many people just buy the router, whatever IOS comes with it, and NEVER want to hear from Cisco's overpriced services ever again.
Really, critical internet infrastructure needs to be *easy* (as in low cost and not many technical pitfalls) to keep up-to-date, and we need to start doing Very Bad Things to those that don't implement BCP-38 (you're a danger to all your customers and downstream if you don't), egress filtering (good neighborhood requirements), automated up-to-date bogon filtering (or you will cause troubles for everyone that gets a new block of IP space freshly handed to a RIR), and strict BGP filtering...
Cisco's IOS update policies REALLY have a part of the blame on this.
Make your backup device be different to the main one... If you use 2 different vendors the chances of a bug affecting both is significantly reduced, It also means that the devices have to actually use standard interoperable protocols to handle the failover.
http://spamdecoy.net - free throwaway anonymous email - avoid spam!
Then again I'd like to see Mac OSX opensourced, too,
umm... http://www.opensource.apple.com/darwinsource/