How a Router's Missed Range Check Nearly Crashed the Internet
Barlaam writes "A bug by router vendor A (omitting a range check from a critical field in the configuration interface) tickled a bug from router vendor B (dropping BGP sessions when processing some ASPATH attributes with length very close to 256), causing a ripple effect that caused widespread global routing instability last week. The flaw lay dormant until one of vendor A's systems was deployed in an autonomous system whose ASN, modulo 256, was greater than 250. At that point, the Internet was one typo away from disaster. Other router vendors, who were not affected by the bug, happily propagated the trigger message to every vulnerable system on the planet in about 30 seconds. Few people appreciate how fragile and unsecured the Internet's trust-based critical infrastructure really is — this is just the latest example." Vendor A, in this case, is a Latvian router vendor called MikroTik.
Is this related to the story posted that stated:
"One Broken Router Takes Out Half the Internet?"
http://tech.slashdot.org/article.pl?sid=09/02/16/2233207
It just amazes me how differently presented this story is compared with the previous.
In fairness, there is much more information about this 'outage' now.
This news is alarming. Thanks for not making in alarmist this time.
Vendor B is Cisco btw. Dunno why they were being vague.
I'm sure nobody here would argue with me if I suggested that the internet would be a much safer place without routers.
If people had upgraded their routers this wouldn't have happened. Newsflash: software has bugs. Not upgrading your software will bite you in the ass eventually, especially if this software runs critical systems like your routers.
...so ISP's should filter AS paths!
No, no sig. Really.
ThePromenader
except in the kdawson style it was a single link to a message board posting about a router "taking out half the internet." Dupe? Correction? I dont care as long as kdawson is kept away from the site for a while.
... the crash will take out the entire interwebs for a full week. Wouldn't it be amazing if mankind as a whole had to "survive" an entire week without the face-to-face interaction killer that is the internet? I suppose that what's even more pathetic is that we depend on it so much now; countries would go into widespread panic if internet was lost for a single week. Isn't it sad how people seem to think that something that didn't even exist 30 years ago is now considered a bare necessity? Oh, the priorities of man.
I don't know about it nearly crashing the Internet. How many people actually noticed a difference that day, for that matter?
A lot of admins, especially after the alert went out over the NANOG list, set their routers to reject long ASPATHs (or I assume, from what I saw on those list, I am not a BGP admin myself.) Many routers simply rejected these ASPATHs as well; correct me if I'm wrong, but weren't old versions of IOS the only ones affected? It was a serious issue, but I'm not sure if it came anywhere near a disaster scenario.
"The Internet was back to normal in short order."
Well, not completely normal, not yet.
Reportedly all data was lost. And it was more than just the routers -- someone was clogging the tubes by running too many apps on their desktop.
We should be very thankful that the partial backup was found with some info from the Google Tube, however.
Few people appreciate how fragile and unsecured the Internet's trust-based critical infrastructure really is - this is just the latest example.
Yeah. Like how everyone is trusted not to google "google".
When I worked for *unnamed nw regional backbone here* we had peering agreements with everyone except uunet that we connected to, and it was pretty known that if we spat out an bad BGP route we could bring down the whole net by hitting enter ('cept uunet, although I'm pretty sure uunet woulda went down from everyone else routing around them to us)
How is this new? That was the 90's. and when we spent 100k+ on a Cisco 7513 with 64megs of ram so it could hold the BGP tables...
We even wrote our own manual ('cause none existed) on how to deal with BGP tables so junior admins working for us wouldn't fuq it up. (and on top of that, we wouldn't let them touch the routers either)
-meetme room in the westin in Seattle-
The critical bug is with the Cisco routers; a Mikrotik router merely nearly triggered the bug.
It would be possible to trigger this bug with any routing software that does not do range checking on the amount of times the ASN is pretended.
The summary is spreading FUD by making Mikrotik, the only named vendor in the summary, look like the vendor at fault.
The next time someone needs you to fix a computer problem and asks what went wrong, simply give them this article's summary as the reason why, replacing "router" and "Internet" with the the defective part in question. You're also guarenteed to look a bit sharper, too.
"A bug by power supply vendor A (omitting a range check from a critical field in the configuration interface) tickled a bug from power supply vendor B (dropping BGP sessions when processing some ASPATH attributes with length very close to 256), causing a ripple effect that caused widespread global routing instability last week. The flaw lay dormant until one of vendor A's systems was deployed in an autonomous system whose ASN, modulo 256, was greater than 250. At that point, the power supply was one typo away from disaster. Other power supply vendors, who were not affected by the bug, happily propagated the trigger message to every vulnerable system on the planet in about 30 seconds. Few people appreciate how fragile and unsecured the power supply's trust-based critical infrastructure really is â" this is just the latest example."
Mikrotik are known GPL violators, that use a modified Linux (they re-branded that as "RouterOS") and a terribly bad implementation of the BGP protocol..
In some custom community network, where MikroTik has been deployed internally, that stolen-Linux is being hacked to use the Quagga instead of MikroTik's BGP.
In short: that "RouterOS" has been higly unsuitable for the Internet. I can't believe somebody was so stupid to trust it.
Reminds me of a story that Keith Marzullo told our class in a graduate level reliability class. This was back in the days of using UUCP to send email, and the vendor that he worked for had just released a "failsafe" product they were very proud of -- essentially, it was a mail router that could detect if a path went down, and would try an alternate router instead. The company touted it as a bulletproof solution.
So they go to a conference, and set up some routers, unplug some of them, etc., and everything is going fine until they ask an audience member for his UUCP address. UUCP addresses are in the form of host1!host2!host3!username, with the routing for the username explicitly specified... the addresses could thus get quite long. In this case, the guy's email address was over the buffer limit the company's routers used.
Guess what happened?
The mail server tried sending an email to the next router in the chain. The router buffer overflowed and crashed. The reliable server than tried another router... and crashed it. It then went through the entire network, and crashed every single one of the nodes, turning a bug that would have been a single point of failure into a total network collapse.
=)
Yeah, one of my favorite stories from UCSD.
Maybe if they updated their IOS back in 2003 when Cisco came out with the fix they wouldn't have these problems. You wouldn't give an XP user a pass on not updating for 6 years and having a problem, don't give these upstreams any.
-zifr
Summary reads like the script for a bad disaster movie.
Carbon based humanoid in training.
I see the poor programmers thousands of miles away from their routers jammed with idiot traffic configs trying to fix a bug knowing the WORLD is waiting for their patch... would be bad.
...A Slashdot "Editor" notices these posts and mods them into oblivion.
But is that better or worse than having them modded down by sycophantic Slashdot readers?
My Slashdot login - a four-digit userid - is worthless now.
It's been stuck on Karma:-1, Terrible for a couple of years.
What did I do to deserve that terrible fate?
My sin was to post a message critical of dear Michael Sims and his editing methods and practices here on Slashdot.
I forget - are they nasty Russian stooges or decent US stooges these days?
It's not a bug...it's a feature
At that point, the Internet was one typo away from disaster.
I wonder how long that took?
Pulp Audio Weekly - Geek News and Reviews
A bug by device vendor A (twiddling a framis panel instead of sparting the glinbo interface) patted a bug from device vendor B (elevating ALP packets when deferring some GALAS modifiers with size benath 176), yielding a domino effect that caused widespread universal switching instability last week. The flaw lay dormant until one of vendor A's systems was deployed in an autonomous system whose LKM, divisor 965, was less than 1250. At that point, the Internet was one typo away from disaster. Other router vendors, who were not affected by the bug, happily propagated the trigger message to every vulnerable system on the planet in about 30 seconds. Few people appreciate how fragile and unsecured the Internet's trust-based critical infrastructure really is -- this is just the latest example.
Reads just about the same to me. I can't make any sense of either description of the bug
It's called humor.
I filtered out Jon Katz when he was still with slashdot and it was a huge improvement in my user experience.
Run and catch, run and catch, the lamb is caught in the blackberry patch.
So then you just have to enact secure connections, where everyone personally knows everyone else before you connect.
---- Booth was a patriot ----
In the last 90s I worked for a large American test equipment manufacturer. We had developed an embedded system for performing parametric testing of telephone lines when not in use (and the test would be rescheduled if the line became required).
It was great for detecting cables about to fail, that had failed, and could pinpoint where (by TDR) they likely had failed.
It worked like a charm, except for one little nuisance: downloading new firmware to the thousands of remote units usually failed. It took a while to track down, since we could not repro it in house. The control network was TCP/IP over PPP over PVCs set up over 9600 bps serial links multiplexed over X.25.
Turned out that small command and control requests did not send the large packets that software download did, and the combination of large packet size, consequently long ACK time (over the 9600 bps link), poor RTT convergence in the host TCP/IP stack, and incorrect handling of duplicated packets in the embedded TCP/IP stack was our undoing.
I engineered a small piece of code that would modify the embedded TCP/IP stack to get around the defect well enough so that it could be download, and in turn, allow for the download of a properly corrected full version.
In Liberty, Rene
At that point, the Internet was one typo away from disaster. ... Few people appreciate how fragile and unsecured the Internet's trust-based critical infrastructure really is -- this is just the latest example."
At that point, the internet as a whole remained largely unaffected for the majority of users. Few people appreciate how robust the Internet's trust-based critical infrastructure and its ability to dynamically reroute traffic through the remaining nodes even with the loss of a significant portion of the net really is -- this is just the latest example.
Insightful and funny are really the same thing, except one has a punch line.
Unless there really is a legitimate reason for it, this seems stupid. The only reason I can think of to put your own ASN more than once would be to artifically increase the AS_PATH size and lower other ASN's preference to route through you. But BGP has lots of other ways to accomplish that same goal.
Why would MikroTik have this as a required parameter? And what legitimate reasons are there to include your own ASN multiple times on an advertisement?
Cisco update policy? Isn't that called Juniper or Huawei?
Cisco used to be the best option (they weren't that great in product terms, but everyone else was worse, and Cisco had good service and support).
They're getting squeezed from both the top and bottom.
Well we replaced a cisco edge router with a Mikrotik based one after tests with the mikrotik showed it kicked the cisco's butt. Its far easier to deal with than the cisco was. Some care in how you configure BGP can stop this, but clearly the cisco side shouldnt have even accepted that update. So now we will hear the naysayers.
This is Slashdot. News for Nerds etc. Most readers should be able to use the filtering.
In the past, I believe many of us filtered out JonKatz.
Just because a vocal minority complain about kdawson doesn't mean the rest care that much.
And as I understand it the bug was pre-IOS 12.0-something.
Looks like the Net needed a good round of forklift upgrades anyway.
Give a man a fish and you have fed him for today. Teach a man to fish, and he'll say "WHERE'S MY FISH, YOU IDIOT?"
I believe this has been shown incorrect; from the article:
(emphasis mine). More info:
http://blog.ioshints.info/2009/02/oversized-as-paths-cisco-ios-bug.html
And the Cisco description (the bug ID, CSCsx73770, is linked in there, but you need a login to access it):
http://tools.cisco.com/security/center/viewAlert.x?alertId=17670
Does that westernise as NecroTic?
A bug by router vendor
So that is what they are calling it these days....lol, I know that was a bit tongue in cheek. Its just that when I read this I remembered all too well how Sprint made a business decision to remove a span of IP addresses from being reached by any sprint users of their DNS service. Effectively censoring any and all users of Sprint DNS.
I do not have the article handy, perhaps someone could post a couple of links to the news stories where Sprint was blocking IP address ranges. While I do NOT remember the year, it was pre 2003, possibly before 2000?
Is your Internet Throttled? Install DD-Wrt, OpenWRT or Tomato to learn the truth! Google: 1Gbps/1Gbps: 5 Communities
Vendor A had a bug also that didn't play well with vendor B's bug, so who was vendor A?
News articles should answer: What, Where, When, How and (sometimes) Why?
So... when?
Why not state when this network error was propagated? Did it happen this year some time?
MicroTik. And their 'bug' was the fact, that their command to choose the number of AS prepends was not restricted to a reasonable number.
It's being called a bug just because it was easy to misconfigure, and the misconfiguration could have nasty side effects on other people's networks where vendor B's equipment was used.
And an operator more familiar with vendor B's equipment would be likely to make a mistake when working with vendor A's equipment. (As in entering the explicit sequence of numbers to prepend where they should _instead_ have typed the number of prepends)
In other words, vendor A didn't include a device on their gun to prevent owners of their equipment from shooting other strangers in the foot, but just about everyone had bulletproof feet in this case, except vendor B.
Of course, if the operator of vendor A equipment was malicious, they could have used a malicious implementation of the protocol that would prepend that many, regardless of conventions that most routers restrict prepend to numbers somewhere between 10 and 16 hops.
Vendor A: "I accidentally the whole internet."