Cisco Routers to Blame for Japan Net Outtage

← Back to Stories (view on slashdot.org)

Cisco Routers to Blame for Japan Net Outtage

Posted by Zonk on Friday May 18, 2007 @10:28PM from the no-web-for-you-mr.-roboto dept.

An anonymous reader passed us a link to a Network World article filling in the details behind the massive internet outage Japanese web users experienced earlier this week. According to the site faulty Cisco routers were to blame for the lapse, which left millions of customers without service from late evening Tuesday until early in the morning on Wednesday. "NTT East and NTT West, both group companies of Japanese telecom giant Nippon Telegraph and Telephone (NTT), are in the process of finalizing their decisions on a core router upgrade, according to the report. The routing table rewrite overflowed the routing tables and caused the routers' forwarding process to fail, the CIBC report states."

11 of 78 comments (clear)

Min score:

Reason:

Sort:

Eggs in one basket by slashthedot · 2007-05-18 22:46 · Score: 4, Insightful

"Clearly, this failure doesn't reflect well on (Cisco) and at the very least highlights the need for two vendors," states CIBC analyst Ittai Kidron in the report. Yeah, don't keep all your routers in Cisco basket.
1. Re:Eggs in one basket by sarathmenon · 2007-05-19 00:10 · Score: 4, Insightful
  
  Yeah, don't keep all your routers in Cisco basket.
  
  I don't agree the blame is with Cisco, not until I see more evidence. Cisco has some of the most stable operating systems. The cmd line interface can sometimes suck, but their stability is very remarkable. The fault I am guessing is with the ISP for not planning network redundancy and not scaling their networks in time. Cisco might look bad in this article, but their track record in creating an OS with less number of bugs is much better than Microsoft, Sun and the others.
  
  --
  Microsoft: "You've got questions. We've got dancing paperclips."
2. Re:Eggs in one basket by Anonymous Coward · 2007-05-19 01:29 · Score: 2, Insightful
  
  Excuse me for speaking bluntly but:
  PEBCAK
3. Re:Eggs in one basket by sargon · 2007-05-19 08:11 · Score: 2, Insightful
  
  Cisco has some of the most stable operating systems. You must be using some Cisco OS I don't know about. I am in the process of upgrading 120 Cisco boxes thanks to that "stable operating system."
  
  Junipers are a different matter. MUCH more stable.
  Cisco might look bad in this article, but their track record in creating an OS with less number of bugs is much better than Microsoft, Sun and the others. Riiiiight. Apparently you have never had to deal with Cisco's inability to produce an IOS which doesn't have a BGP bug in it. Or MPLS bug. Or... Well, the list is long.
Should have used Junipers by Anonymous Coward · 2007-05-18 23:15 · Score: 3, Insightful

Being a current CCIE, and having extensive experience with both vendors boxes, I wouldn't use anything other than a Juniper for core infrastructure, and I'm never going back to cisco kit..

To be fair Cisco is untouchable in the enterprise class with their CPE's..
1. Re:Should have used Junipers by Anonymous Coward · 2007-05-19 04:17 · Score: 1, Insightful
  
  I run a 10Gig network with 0 Cisco products in it. That said, Cisco needs a little defending based on your statements:
  
  one train of code to follow and so on.
  
  Juniper has different code loads for M/T series vs. J-series vs. E-series. The nice thing is that there's only 1 load per series.
  
  Unlike Cisco 6500/7600 Sup720-3B/3BXL with hardware limitation of 256K and 512K IPv4 routes respectively
  
  The Cisco 6500 is a layer3 switch, not a router. The 7600 was designed by Cisco to be an edge aggregation router, not a core device, which I imagine had something to do with the low number of routes support (which I agree is rediculous). Juniper doesn't (yet) make switches (or rather, they aren't avail for mass consumption yet). We should probably keep this apples to apples.
  
  I see Juniper as the Unix of routers
  
  Could that be because JunOS is based on FreeBSD, and it even indicates such from the console during system boot? Likening IOS to Windows though, is just unfair.
  
  If you desire stability, security, performance and flexibility go Juniper. Cisco still has a place such as in enterprises that still run legacy IPX.
  
  So, you don't do ANYthing currently with Juniper, but you're condoning them based on 3rd-party info? Hrm. And some might argue that routers such as the CRS-1 'do' have their place in the carrier market.
  
  Junipers are great, though tremendously overpriced in my opinion (we have several T640s) for what they do. Companies like Foundry are creating similarly redundant software/hardware architectures for orders-of-magnitude less, and support 40 and 100Gbps already (just as an example...I'm no Foundry fan-boy).
nice work blaming cisco by ctime · 2007-05-18 23:28 · Score: 3, Insightful

I think that it's great when a company is blamed on having poor products when it's really the company using them (in this case, NTT). The way the article is presented seems unfairly biased. The problem isn't with cisco products here but the lack of knowledge on scaling them properly. The headline is similar to saying something like "Ford Motor company cars involved in most car accidents historically". A properly designed network with just about any vendor, especially cisco, would have avoided this issue.
Also.. by niceone · 2007-05-18 23:36 · Score: 3, Insightful

Cisco routers to blame for most of the rest of the internet's non-outage.

--
ccalam - acoustic versions of new songs.
Re:TCAM exhaustion by jacksonj04 · 2007-05-19 00:09 · Score: 2, Insightful

So basically what you're saying is there was insufficient routing capacity on the network causing it to fail? Well, I'm shocked!

Seriously though - would you try run a datacentre on a home router from NetGear? If I did and the network fell over in a fiery mass of routing tables I wouldn't say NetGear was to blame for building a bad router. I'd blame the network architect who thought they could shove hundreds of servers through a 5-port-with-wifi device.

--
How many people can read hex if only you and dead people can read hex?
Re:TCAM exhaustion by thogard · 2007-05-19 05:07 · Score: 2, Insightful

Yep, limit it to 16,777,216 /24 routes. That will fix it. If your router has 16 interfaces, you can do this with 8mb of cache ram to make the quick decisions and whatever else you need to processes the routes.
Don't blame Cisco... blame the Monitoring by Allnighterking · 2007-05-19 06:39 · Score: 4, Insightful

I'm sorry but I've got a total of 3 data centers and I can' tell you when where and which router/pix/switch has a problem. I do it my not only monitoring the item itself (this is what everyone should do.) but also by monitoring "through" the item in question. I monitor a specific point on the other side (or in some cases set of points) not to find out if that point is good, but to find out if the path is good. 3 things have to be monitored.

1. Local status (Am I alive)
2. Path (can I get from me to you, what is the quality of the path?)
3. End point (are you there?)

If at any time you let the number of paths and interconnects overwhelms you. Get a new job. You've lost control. Draw pictures of the network. When you have an outage start looking immediately at what you have connectivity with and what you don't. Large data centers can get complex in their interconnects. Divide it up into "blocks" verify a block and move on.

The biggest problem in a situation like this is that I'm willing to bet the techs were wasting their time trying to figure out why the network went down. Who cares why. You need to quickly assess what is down. What you can do. What you can't do. You need to know what is normal and what is not. If you don't a situation like this can happen.

The worst thing that can happen is if the network is divided into "territories." Usually in a case like this people spend more time trying to blame the other guy then they do finding the cause of the problem. Finally design. Somewhere along the line some pencil pusher decided that a single point of failure was economically feasible. The techs were willing to sheep right along, the Sr Admin was played politics and didn't rock the boat.

In the end. The techs blew it. The after action report and follow up will tell the final tale.

--
I'm sorry, I'm to tired to be witty at the moment so this message will have to do.