Cisco Routers to Blame for Japan Net Outtage

← Back to Stories (view on slashdot.org)

Cisco Routers to Blame for Japan Net Outtage

Posted by Zonk on Friday May 18, 2007 @10:28PM from the no-web-for-you-mr.-roboto dept.

An anonymous reader passed us a link to a Network World article filling in the details behind the massive internet outage Japanese web users experienced earlier this week. According to the site faulty Cisco routers were to blame for the lapse, which left millions of customers without service from late evening Tuesday until early in the morning on Wednesday. "NTT East and NTT West, both group companies of Japanese telecom giant Nippon Telegraph and Telephone (NTT), are in the process of finalizing their decisions on a core router upgrade, according to the report. The routing table rewrite overflowed the routing tables and caused the routers' forwarding process to fail, the CIBC report states."

3 of 78 comments (clear)

Min score:

Reason:

Sort:

Eggs in one basket by slashthedot · 2007-05-18 22:46 · Score: 4, Insightful

"Clearly, this failure doesn't reflect well on (Cisco) and at the very least highlights the need for two vendors," states CIBC analyst Ittai Kidron in the report. Yeah, don't keep all your routers in Cisco basket.
1. Re:Eggs in one basket by sarathmenon · 2007-05-19 00:10 · Score: 4, Insightful
  
  Yeah, don't keep all your routers in Cisco basket.
  
  I don't agree the blame is with Cisco, not until I see more evidence. Cisco has some of the most stable operating systems. The cmd line interface can sometimes suck, but their stability is very remarkable. The fault I am guessing is with the ISP for not planning network redundancy and not scaling their networks in time. Cisco might look bad in this article, but their track record in creating an OS with less number of bugs is much better than Microsoft, Sun and the others.
  
  --
  Microsoft: "You've got questions. We've got dancing paperclips."
Don't blame Cisco... blame the Monitoring by Allnighterking · 2007-05-19 06:39 · Score: 4, Insightful

I'm sorry but I've got a total of 3 data centers and I can' tell you when where and which router/pix/switch has a problem. I do it my not only monitoring the item itself (this is what everyone should do.) but also by monitoring "through" the item in question. I monitor a specific point on the other side (or in some cases set of points) not to find out if that point is good, but to find out if the path is good. 3 things have to be monitored.

1. Local status (Am I alive)
2. Path (can I get from me to you, what is the quality of the path?)
3. End point (are you there?)

If at any time you let the number of paths and interconnects overwhelms you. Get a new job. You've lost control. Draw pictures of the network. When you have an outage start looking immediately at what you have connectivity with and what you don't. Large data centers can get complex in their interconnects. Divide it up into "blocks" verify a block and move on.

The biggest problem in a situation like this is that I'm willing to bet the techs were wasting their time trying to figure out why the network went down. Who cares why. You need to quickly assess what is down. What you can do. What you can't do. You need to know what is normal and what is not. If you don't a situation like this can happen.

The worst thing that can happen is if the network is divided into "territories." Usually in a case like this people spend more time trying to blame the other guy then they do finding the cause of the problem. Finally design. Somewhere along the line some pencil pusher decided that a single point of failure was economically feasible. The techs were willing to sheep right along, the Sr Admin was played politics and didn't rock the boat.

In the end. The techs blew it. The after action report and follow up will tell the final tale.

--
I'm sorry, I'm to tired to be witty at the moment so this message will have to do.