Cisco Routers to Blame for Japan Net Outtage
An anonymous reader passed us a link to a Network World article filling in the details behind the massive internet outage Japanese web users experienced earlier this week. According to the site faulty Cisco routers were to blame for the lapse, which left millions of customers without service from late evening Tuesday until early in the morning on Wednesday. "NTT East and NTT West, both group companies of Japanese telecom giant Nippon Telegraph and Telephone (NTT), are in the process of finalizing their decisions on a core router upgrade, according to the report. The routing table rewrite overflowed the routing tables and caused the routers' forwarding process to fail, the CIBC report states."
Being a current CCIE, and having extensive experience with both vendors boxes, I wouldn't use anything other than a Juniper for core infrastructure, and I'm never going back to cisco kit..
To be fair Cisco is untouchable in the enterprise class with their CPE's..
I think that it's great when a company is blamed on having poor products when it's really the company using them (in this case, NTT). The way the article is presented seems unfairly biased. The problem isn't with cisco products here but the lack of knowledge on scaling them properly. The headline is similar to saying something like "Ford Motor company cars involved in most car accidents historically". A properly designed network with just about any vendor, especially cisco, would have avoided this issue.
Cisco routers to blame for most of the rest of the internet's non-outage.
ccalam - acoustic versions of new songs.
So basically what you're saying is there was insufficient routing capacity on the network causing it to fail? Well, I'm shocked!
Seriously though - would you try run a datacentre on a home router from NetGear? If I did and the network fell over in a fiery mass of routing tables I wouldn't say NetGear was to blame for building a bad router. I'd blame the network architect who thought they could shove hundreds of servers through a 5-port-with-wifi device.
How many people can read hex if only you and dead people can read hex?
Yep, limit it to 16,777,216 /24 routes. That will fix it. If your router has 16 interfaces, you can do this with 8mb of cache ram to make the quick decisions and whatever else you need to processes the routes.
1. Local status (Am I alive)
2. Path (can I get from me to you, what is the quality of the path?)
3. End point (are you there?)
If at any time you let the number of paths and interconnects overwhelms you. Get a new job. You've lost control. Draw pictures of the network. When you have an outage start looking immediately at what you have connectivity with and what you don't. Large data centers can get complex in their interconnects. Divide it up into "blocks" verify a block and move on.
The biggest problem in a situation like this is that I'm willing to bet the techs were wasting their time trying to figure out why the network went down. Who cares why. You need to quickly assess what is down. What you can do. What you can't do. You need to know what is normal and what is not. If you don't a situation like this can happen.
The worst thing that can happen is if the network is divided into "territories." Usually in a case like this people spend more time trying to blame the other guy then they do finding the cause of the problem. Finally design. Somewhere along the line some pencil pusher decided that a single point of failure was economically feasible. The techs were willing to sheep right along, the Sr Admin was played politics and didn't rock the boat.
In the end. The techs blew it. The after action report and follow up will tell the final tale.
I'm sorry, I'm to tired to be witty at the moment so this message will have to do.