Cisco Routers to Blame for Japan Net Outtage

← Back to Stories (view on slashdot.org)

Cisco Routers to Blame for Japan Net Outtage

Posted by Zonk on Friday May 18, 2007 @10:28PM from the no-web-for-you-mr.-roboto dept.

An anonymous reader passed us a link to a Network World article filling in the details behind the massive internet outage Japanese web users experienced earlier this week. According to the site faulty Cisco routers were to blame for the lapse, which left millions of customers without service from late evening Tuesday until early in the morning on Wednesday. "NTT East and NTT West, both group companies of Japanese telecom giant Nippon Telegraph and Telephone (NTT), are in the process of finalizing their decisions on a core router upgrade, according to the report. The routing table rewrite overflowed the routing tables and caused the routers' forwarding process to fail, the CIBC report states."

10 of 78 comments (clear)

Min score:

Reason:

Sort:

CEF and the routers. by wickedsun · 2007-05-18 22:57 · Score: 5, Informative

I think it's funny. Usually, when you open Cisco TAC about a "faulty" router not forwarding traffic anymore, Cisco will tell you it's your config's fault if it's not working properly.

Usually what happens is that the router doesn't have enough memory to store all the CEF (Cisco Express Forwarding) info, causing the router to not forward packets for certain subnets. I've seen it happen often enough to know. While Cisco is right, the problem is caused by a lack of memory for the config, I think it shouldn't stop forwarding the packets all together (as in stop using CEF if the table gets out of hand).

While I think Cisco is not completely to blame (badly scaled networks, not upgrading routers in time), it sucks that this will hit them. There are better solutions out there, but I have to say that Cisco's support is quite good and they're pretty fast. I work in an all-Cisco environment (for the routers) and they've been fast whenever we needed a router analyzed.
Properly Filtering Prefixes by Anonymous Coward · 2007-05-18 23:04 · Score: 5, Informative

"The routing table rewrite overflowed the routing tables and caused the routers' forwarding process to fail, the CIBC report states"

Ok.. That says to me that their routing tables got really big, the routers ran out of memory... Or.. they Had a prefix limit set, and it kept dropping the BGP session(s)...

If either of the above is true, properly designed filtering of the prefixes they send/receive to their BGP neighbors would have resolved this outage... It sounds like someone may have been incompetent, and they are trying to pawn off the "ownership" of this outage on Cisco.

Either that, or its a major IOS bug, and the article's author just sucks and didn't mention that..
The NW story is too vague to rule out human error by James+Youngman · 2007-05-18 23:27 · Score: 4, Informative
On the basis of the information in the NW article, I can't make out what the general nature alleged fault is on the "faulty" routers. I get that some routing table size limit was exceeded. But what was the nature of the problem?
- Did a manual change exceed a design limit? If so, why wasn't the manual change rejected? (If it was rejected, that's not a fauilt, it's user error)
- Did an automatic change (like fail-over) applied to a valid configuration produce an invalid one? If so, did the routers report this, generate some kind of trap or alarm? If so, I guess the problem is a bit nebulous; maybe a monitoring failure, but maybe the system could have issued warnings that certain kinds of possible failover could exceed implementation limits. Hard to know without more detailed information.
- Did an automatic change silently produce the wrong result (like forwarding some traffic and not other traffic) *without* generating a trap or alert? If so, I would certainly call this a fault (bug). But the article doesn't contain enough information to point conclusively in this direction.
The event is big news, so I guess NW felt they had to say *something*. But while I'm no big fan of Cisco gear, it looks to me that the explanation is as likely to be human error as equipment faults or bugs. One potential cause of problems in big routers is that the high-level software's view of the state of the routing engine gets out of sync with the actual state of the ASICs. I wonder if that happened here. My guess is that once more details of the incident emerge it will turn into a not-news story.
TCAM exhaustion by anticypher · 2007-05-18 23:46 · Score: 5, Informative

This was certainly a problem with slightly older Cisco kit, such as 6500s with Sup720a cards. Their TCAM memory (that holds prefix+destination tuples in a form of cache) overflowed as the internet approaches 245,000 routes. Once there is no more space in TCAM, many Cisco architectures fall back to processor routing. That means that when traffic that was switched in hardware starts hitting the CPU, the box falls over whimpering for mercy.

If NTT had been following Cisco mailing lists, or keeping up to date on what their salesmen had been telling them for several years, they would have seen this problem looming and changed their routing structure or at least upgraded the processors for something with slightly more TCAM. The size of the internet is not going to stop growing because many companies chose to go with underpowered Cisco kit. The internet will continue to grow by 12,000 to 17,000 routes per month, accelerating over the next few years as IPv4 space becomes exhausted and de-aggregation becomes the norm.

This is one of my long standing grudges about Cisco design. They always are designing their core routers to be just slightly ahead of the size of the internet, forcing people to upgrade within a few years. Designed obsolescence is the term. Even their new CRS1 platform will fail over to CPU near 512,000 routes (0x80000), or sometime around the end of 2008 to mid 2009. By then, they'll probably have an expensive upgrade path for customers that will hold for just another year or two.

It's not just Cisco kit that is going to have problems over the next few months. By the end of June the internet will be at 256,000 routes (really 262,144 or 0x40000), which will be a problem for some other manufacturers. Some are starting to fail at 0x3C000 (245,000) routes, some already failed at 0x30000 last year.

On the plus side, the OpenBGPd crowd doesn't suffer from this, since their code is all CPU switched (but using very clever and efficiently coded routing tables) so their routing table is limited only by memory. But an OpenBGPd machine will never have the raw efficiency of a VLSI based hardware solution.

A quick look at my local looking glass shows 233,979 routes on the internet this morning.

the AC

--
Hemos is like...sci-fi fans;he thinks technology is cool, but he hasn't bothered to understand the science it's based on
1. Re:TCAM exhaustion by anticypher · 2007-05-19 00:27 · Score: 4, Informative
  
  The CRS-1 is tested with at least 2M IPv4 routes.
  
  It appears to be four separate instances of 512K routes, the total is for MPLS customers shoving full BGP tables into their mesh. With more than 8 MPLS customers doing screwy things today, the box starts hitting its CPUs. I haven't received a denial from the CRS-1 guys, just some hand waving and a promise to look into it. Implications that a better config would help hasn't actually produced an example of what to do, and the XR code is just different enough to hide underlying architecture deficiencies. The other problem is that every CRS-1 seems to be put into production before engineering has time to play with them and learn their tricks. Given time, all kinds of clever designs for XR code will spread around, just as there are tricks of the trade the most experienced IOS-based engineers grok.
  
  It should be enough to 2015 or more.
  
  And 640k should be enough for everyone. Seriously, I keep running across 2500s still doing their thing, but not as core BGP routers. So the CRS-1 platforms may quite well be running tucked into edges in 2015. Bean counters love kit that has amortised many times over.
  
  the AC
  
  --
  Hemos is like...sci-fi fans;he thinks technology is cool, but he hasn't bothered to understand the science it's based on
2. Re:TCAM exhaustion by swmike · 2007-05-19 01:04 · Score: 2, Informative
  
  If you're running four full-bgp VPN-customers in your core routers that can handle 2M routes, you've done a major design mistake and that's not the fault of the hardware manufacturer.
  
  Regarding routing table growth, hopefully IPv6 might stifle that a bit as we're going to be running out of IPv4 space in the next 3-5 years and IPv6 space is allocated in much larger blocks requiring fewer routes.
Re:The NW story is too vague to rule out human err by Anonymous Coward · 2007-05-19 01:16 · Score: 1, Informative

I have a friend who works for Cisco, setting up and configuring routers remotely via telepresence, all over the world. I know that lately they have been working on some routers in Japan.

*IF* (and I don't know for sure that this is the case, so it's a big IF) the project this person has been working on relates to the problem in this story, then I would say that your guess of 'human error' is likely a very large part of what happened.

I say this because my friend has filled me in on some of the stories relating to language barriers during the work they are doing. While the stories are very humorous (and that was their sole reason for telling me said stories), I can see where they would easily contribute to a situation where mistakes could be made.

From the stories, it is extremely evident that English is *very* much a second language on the Japan-side of things. To the point that it would be smart to have a person there whose first language and culture would be that of the company employees who were doing the install.

I'd guess that the lesson here is that globalization has a lot of bugs to be worked out still... :)
250k is lame, I just tested 1.1m on a Juniper by Anonymous Coward · 2007-05-19 01:27 · Score: 2, Informative

250K is quite lame. I just tested a bit over 1 million installed routes on a Juniper in my lab. ... a 5 year old m-series at that.

This is no shock: Juniper's first innovation was the use of high speed ram rather than tcam for tcam lookup so route table scaling has never been a problem for them...

Marketing on the other hand... geesh those cartoons still give me nightmares.
1. Re:250k is lame, I just tested 1.1m on a Juniper by frost22 · 2007-05-19 05:18 · Score: 2, Informative
  
  IIRC we tested amounts like that even in well equipped 75ers .
  
  As others already noted the 6500/7600 is a switch with limited
  routing capabilities. you use it as a core router at your own
  risk (and peril).
  
  --
  ...and here I stand, with all my lore, poor fool, no wiser than before.
Re:Eggs in one basket by Anonymous Coward · 2007-05-19 04:11 · Score: 0, Informative

I agree that there isn't enough information. Most likely, the routers ran out of memory. About seven years ago, I worked at an ISP. My boss was just setting up a multihomed connected, and he needed to get BGP working properly on our two cisco routers. He decided to upgrade them to IOS 12 (if i remember right). Well, he only had 64MB of ram in our main router and every twelve hours or so it would crash. We called cisco and the combination of the changes in our routing table setup and the larger OS ate up our memory.

Ever since, I've always thought of a cisco router to be like a Mac in terms of RAM. What do Macs do when they don't have enough RAM/swap? They crash! It doesn't matter if its OS9 or 10.4.9.