Cisco Routers to Blame for Japan Net Outtage

Djikstra by pwrtool+45 · 2007-05-18 22:41 · Score: 5, Funny

Japanese police have put out an APB for some guy named "Dijkstra."

Eggs in one basket by slashthedot · 2007-05-18 22:46 · Score: 4, Insightful

"Clearly, this failure doesn't reflect well on (Cisco) and at the very least highlights the need for two vendors," states CIBC analyst Ittai Kidron in the report. Yeah, don't keep all your routers in Cisco basket.

Re:Eggs in one basket by sarathmenon · 2007-05-19 00:10 · Score: 4, Insightful

Yeah, don't keep all your routers in Cisco basket.

I don't agree the blame is with Cisco, not until I see more evidence. Cisco has some of the most stable operating systems. The cmd line interface can sometimes suck, but their stability is very remarkable. The fault I am guessing is with the ISP for not planning network redundancy and not scaling their networks in time. Cisco might look bad in this article, but their track record in creating an OS with less number of bugs is much better than Microsoft, Sun and the others.

--
Microsoft: "You've got questions. We've got dancing paperclips."
Re:Eggs in one basket by Anonymous Coward · 2007-05-19 01:29 · Score: 2, Insightful

Excuse me for speaking bluntly but:
PEBCAK
Re:Eggs in one basket by CrimsonScythe · 2007-05-19 03:11 · Score: 3, Funny

Yeah, I guess now they'll be supplementing with some Belking and D-Link routers as well.

--
The view was horrible and the smell was even worse; Julie severely regretted becoming a proctologist.
Re:Eggs in one basket by ScrewMaster · 2007-05-19 03:22 · Score: 1

... less number of bugs is much better than Microsoft, Sun and the others.

Microsoft sure, but Solaris is pretty reliable.

--
The higher the technology, the sharper that two-edged sword.
Re:Eggs in one basket by Anonymous Coward · 2007-05-19 04:11 · Score: 0, Informative

I agree that there isn't enough information. Most likely, the routers ran out of memory. About seven years ago, I worked at an ISP. My boss was just setting up a multihomed connected, and he needed to get BGP working properly on our two cisco routers. He decided to upgrade them to IOS 12 (if i remember right). Well, he only had 64MB of ram in our main router and every twelve hours or so it would crash. We called cisco and the combination of the changes in our routing table setup and the larger OS ate up our memory.

Ever since, I've always thought of a cisco router to be like a Mac in terms of RAM. What do Macs do when they don't have enough RAM/swap? They crash! It doesn't matter if its OS9 or 10.4.9.
Re:Eggs in one basket by oh_the_humanity · 2007-05-19 04:32 · Score: 1

We are talking about 2 -4 thousand routers. i think there redudancy was there. I however am willing to blame a routing protocol before i blame the OS. the article unfortunately doesnt talk about which routing protocol they were using. My guess is they were probably using IS-IS.

--
"When they invent bitch slaps that can go through a monitor you better f'ing duck" --deft (253558)
Re:Eggs in one basket by oh_the_humanity · 2007-05-19 04:42 · Score: 1

NM , its morelikely BGP4.

--
"When they invent bitch slaps that can go through a monitor you better f'ing duck" --deft (253558)
Re:Eggs in one basket by frost22 · 2007-05-19 05:08 · Score: 1

im a fraid you are a little off here.

As I interprete TFA, it was BGP problem - possibly a failover
situation not handled correctly. Or they (NTT) did some
seriously weird thing with their BGP design.

--
...and here I stand, with all my lore, poor fool, no wiser than before.
Re:Eggs in one basket by rekoil · 2007-05-19 05:24 · Score: 1

Actually, you use both. BGP is an Exterior Gateway Protocol, which gives each router an "exit point" to a given prefix - that is, how to get the packet out of NTT's network to get it where it needs to go (i.e. "send it to Google's peering point at the Tokyo exchange point"). IS-IS, OSPF, EIGRP, etc are Interior protocols, which map out the NTT network so that a given router knows which neighboring router is closer to that exchange point.

Effectively, it's a two stage lookup - BGP will tell you that your grandmother lives in Chicago, but you need IS-IS to tell you which highway to get on.
Re:Eggs in one basket by rekoil · 2007-05-19 05:34 · Score: 1

Might have been a EBGP-to-IGP redistribution event - the BGP table carries close to 217,000 routes today, as it's designed to do, but IGPs are only designed to carry at the most tens of thousands of routes, as those routes need far more detailed information on them than BGP routes. Occasionally due to either a config error of a software bug the BGP routes will get injected into the IGP (OSPF or IS-IS), and each router's IGP process chokes on the routes, but not before passing them on to the next router, and so on, and so on. It ain't pretty.
Re:Eggs in one basket by sargon · 2007-05-19 08:11 · Score: 2, Insightful

Cisco has some of the most stable operating systems. You must be using some Cisco OS I don't know about. I am in the process of upgrading 120 Cisco boxes thanks to that "stable operating system."

Junipers are a different matter. MUCH more stable.
Cisco might look bad in this article, but their track record in creating an OS with less number of bugs is much better than Microsoft, Sun and the others. Riiiiight. Apparently you have never had to deal with Cisco's inability to produce an IOS which doesn't have a BGP bug in it. Or MPLS bug. Or... Well, the list is long.
Re:Eggs in one basket by sirket · 2007-05-19 10:57 · Score: 1

Effectively, it's a two stage lookup - BGP will tell you that your grandmother lives in Chicago, but you need IS-IS to tell you which highway to get on.

This is a terrible analogy. It isn't a two stage lookup- it's a single routing table lookup. BGP populates the routing table with routes it learns from external autonomous systems, and an interior routing protocol like OSPF populates the routing table with routes learned from within the autonomous system itself. Where both protocols know of the same route then the protocol weight determines which route gets added.

That said- if your whole network comes crashing down then you've done something amazingly stupid- like run the same IOS version on every router in your network- or not put enough memory in your router- or god only knows what else.

-sirket
Re:Eggs in one basket by Anonymous Coward · 2007-05-19 15:41 · Score: 0

"i think there redudancy was there"

I think that that pretty much says it all. But, thanks for trying to contribute! We sure appreciate it.
Re:Eggs in one basket by medea · 2007-05-19 23:32 · Score: 1

> Cisco has some of the most stable operating systems.

Ah, yeah. I am not sure, but it seems you never really worked with Cisco gear in an serviceprovider world...

I just have one word for you: CEF-Bug
Re:Eggs in one basket by sloth+jr · 2007-05-20 02:03 · Score: 1

I don't agree the blame is with Cisco, not until I see more evidence. Cisco has some of the most stable operating systems. The cmd line interface can sometimes suck, but their stability is very remarkable. The fault I am guessing is with the ISP for not planning network redundancy and not scaling their networks in time. Cisco might look bad in this article, but their track record in creating an OS with less number of bugs is much better than Microsoft, Sun and the others.

I think you need more time with Cisco equipment to see some nasty failure scenarios and various out-of-memory bugs. While overall, networking hardware is a LOT more reliable than general purpose OS - it's my belief that a specialized OS like IOS ought to be more stable than it is. The main problem does appear to be OOM bugs - this experienced on Catalyst 6500s, PIX 535s, Firewall Services modules, and VPN modules.
Re:Eggs in one basket by Eunuchswear · 2007-05-20 02:34 · Score: 1

And the reason for that is that Cisco have the most expensive memory in the universe.

--
Watch this Heartland Institute video

Apparently... by DrYak · 2007-05-18 22:46 · Score: 3, Funny

...They will not be anymore the dot in .jp

--
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]

JunOS by Anonymous Coward · 2007-05-18 22:49 · Score: 5, Funny

For those that have used JunOS before, im sure are all saying.

"A Juniper router is like my girlfriend.. It will never go down on me."

Re:JunOS by The+AtomicPunk · 2007-05-19 00:55 · Score: 1

Yeesh - sorry man! :)
Re:JunOS by Anonymous Coward · 2007-05-19 01:59 · Score: 0

And for the Cisco people, they're saying:

"A Cisco router is like my girlfriend... expensive, high upkeep costs, requires a guru user to understand and configure... and when it overflows, it crashes with a memory dump."
Re:JunOS by Anonymous Coward · 2007-05-19 02:43 · Score: 0

In that context, Cisco is the true girlfriend since it goes down on you often.
Re:JunOS by Anonymous Coward · 2007-05-19 04:50 · Score: 0

and when it overflows, it crashes with a memory dump

They should use windows instead. Microsoft fixed it a long time ago; it now doesn't crash with the memory dump.
Re:JunOS by sirket · 2007-05-19 11:01 · Score: 1

I take exception to the expensive part-

Juniper M7i with 1 onboard gigabit port and a quad gigabit card (oversubscribed 4:1) and 16Mpps forwarding speed- $52k.

Cisco 7604 with Sup32 and 9 gigabit ports and 15Mpps forwarding speed- $18k.

Juniper definitely makes a better router in many cases- but does it justify paying three times as much?

I love both systems- and I would run Juniper everywhere if I could (for no other reason than the single JunOS software image) but they are just price prohbitive sometimes.

And if you really want raw speed- you use Foundry.

-sirket

CEF and the routers. by wickedsun · 2007-05-18 22:57 · Score: 5, Informative

I think it's funny. Usually, when you open Cisco TAC about a "faulty" router not forwarding traffic anymore, Cisco will tell you it's your config's fault if it's not working properly.

Usually what happens is that the router doesn't have enough memory to store all the CEF (Cisco Express Forwarding) info, causing the router to not forward packets for certain subnets. I've seen it happen often enough to know. While Cisco is right, the problem is caused by a lack of memory for the config, I think it shouldn't stop forwarding the packets all together (as in stop using CEF if the table gets out of hand).

While I think Cisco is not completely to blame (badly scaled networks, not upgrading routers in time), it sucks that this will hit them. There are better solutions out there, but I have to say that Cisco's support is quite good and they're pretty fast. I work in an all-Cisco environment (for the routers) and they've been fast whenever we needed a router analyzed.

Re:CEF and the routers. by Anonymous Coward · 2007-05-19 00:37 · Score: 1, Interesting

If that were the only breakage in CEF...
I have a Cisco with a complex config with tunnels and there really is no way it will work reliably with CEF enabled.
Re:CEF and the routers. by oh_the_humanity · 2007-05-19 04:34 · Score: 1

Sounds like to me that unless your dealing with a massive network, you are not summarizing your subnets well.

--
"When they invent bitch slaps that can go through a monitor you better f'ing duck" --deft (253558)
Re:CEF and the routers. by wickedsun · 2007-05-19 14:17 · Score: 1

I would think that the Japanese backbone qualifies as a "massive network" ;)

Underspec routers by ReidMaynard · 2007-05-18 22:59 · Score: 5, Interesting

Phrases like

The routing table rewrite overflowed the routing tables

and

router capacity was partly responsible for the failure

leads me to think this was a problem which was probably reported numerous times to middle management and perpetually postponed.

--
-- www.globaltics.net

Political discussion for a new world

Re:Underspec routers by thogard · 2007-05-19 05:01 · Score: 1

leads me to think this was a problem which was probably reported numerous times to middle management and perpetually postponed
Who's middle management? Cisco's?

Their routers have been perpetually running out of memory for reasonable routing tables since at least 1992.

Properly Filtering Prefixes by Anonymous Coward · 2007-05-18 23:04 · Score: 5, Informative

"The routing table rewrite overflowed the routing tables and caused the routers' forwarding process to fail, the CIBC report states"

Ok.. That says to me that their routing tables got really big, the routers ran out of memory... Or.. they Had a prefix limit set, and it kept dropping the BGP session(s)...

If either of the above is true, properly designed filtering of the prefixes they send/receive to their BGP neighbors would have resolved this outage... It sounds like someone may have been incompetent, and they are trying to pawn off the "ownership" of this outage on Cisco.

Either that, or its a major IOS bug, and the article's author just sucks and didn't mention that..

Should have used Junipers by Anonymous Coward · 2007-05-18 23:15 · Score: 3, Insightful

Being a current CCIE, and having extensive experience with both vendors boxes, I wouldn't use anything other than a Juniper for core infrastructure, and I'm never going back to cisco kit..

To be fair Cisco is untouchable in the enterprise class with their CPE's..

Re:Should have used Junipers by Anonymous Coward · 2007-05-19 03:18 · Score: 5, Interesting

We're a Cisco shop but are seriously looking into Juniper due to some negative service impacting experiences. Juniper, especially M series, look like it was designed very intelligently from the ground up with superior hardware architecture with separation of routing engine/packet forwarding/control plane, much more powerful CLI/config error checking/timed roll-back, wire rate granular filtering, one train of code to follow and so on. Unlike Cisco 6500/7600 Sup720-3B/3BXL with hardware limitation of 256K and 512K IPv4 routes respectively, even Juniper's older M20 platform has been tested with upwards of 1 million routes. As for stability, Juniper is found in the core of most service providers, government, academia and research (Internet2 high speed network http://www.abilene.iu.edu./ I see Juniper as the Unix of routers and Cisco the Windows of routers. If you desire stability, security, performance and flexibility go Juniper. Cisco still has a place such as in enterprises that still run legacy IPX.
Re:Should have used Junipers by Anonymous Coward · 2007-05-19 04:17 · Score: 1, Insightful

I run a 10Gig network with 0 Cisco products in it. That said, Cisco needs a little defending based on your statements:

one train of code to follow and so on.

Juniper has different code loads for M/T series vs. J-series vs. E-series. The nice thing is that there's only 1 load per series.

Unlike Cisco 6500/7600 Sup720-3B/3BXL with hardware limitation of 256K and 512K IPv4 routes respectively

The Cisco 6500 is a layer3 switch, not a router. The 7600 was designed by Cisco to be an edge aggregation router, not a core device, which I imagine had something to do with the low number of routes support (which I agree is rediculous). Juniper doesn't (yet) make switches (or rather, they aren't avail for mass consumption yet). We should probably keep this apples to apples.

I see Juniper as the Unix of routers

Could that be because JunOS is based on FreeBSD, and it even indicates such from the console during system boot? Likening IOS to Windows though, is just unfair.

If you desire stability, security, performance and flexibility go Juniper. Cisco still has a place such as in enterprises that still run legacy IPX.

So, you don't do ANYthing currently with Juniper, but you're condoning them based on 3rd-party info? Hrm. And some might argue that routers such as the CRS-1 'do' have their place in the carrier market.

Junipers are great, though tremendously overpriced in my opinion (we have several T640s) for what they do. Companies like Foundry are creating similarly redundant software/hardware architectures for orders-of-magnitude less, and support 40 and 100Gbps already (just as an example...I'm no Foundry fan-boy).
Re:Should have used Junipers by jack_csk · 2007-05-19 05:17 · Score: 1

IIRC, JunOS is actually a Unix (FreeBSD variant).
Re:Should have used Junipers by Anonymous Coward · 2007-05-19 08:18 · Score: 0

As for stability, Juniper is found in the core of most service providers, government, academia and research Yikes! You're forgetting about AT&T, Sprint, Comcast, BT, NTT, KT, CT when you say most service providers are Juniper.

The NW story is too vague to rule out human error by James+Youngman · 2007-05-18 23:27 · Score: 4, Informative

On the basis of the information in the NW article, I can't make out what the general nature alleged fault is on the "faulty" routers. I get that some routing table size limit was exceeded. But what was the nature of the problem?

Did a manual change exceed a design limit? If so, why wasn't the manual change rejected? (If it was rejected, that's not a fauilt, it's user error)
Did an automatic change (like fail-over) applied to a valid configuration produce an invalid one? If so, did the routers report this, generate some kind of trap or alarm? If so, I guess the problem is a bit nebulous; maybe a monitoring failure, but maybe the system could have issued warnings that certain kinds of possible failover could exceed implementation limits. Hard to know without more detailed information.
Did an automatic change silently produce the wrong result (like forwarding some traffic and not other traffic) *without* generating a trap or alert? If so, I would certainly call this a fault (bug). But the article doesn't contain enough information to point conclusively in this direction.

The event is big news, so I guess NW felt they had to say *something*. But while I'm no big fan of Cisco gear, it looks to me that the explanation is as likely to be human error as equipment faults or bugs. One potential cause of problems in big routers is that the high-level software's view of the state of the routing engine gets out of sync with the actual state of the ASICs. I wonder if that happened here. My guess is that once more details of the incident emerge it will turn into a not-news story.

nice work blaming cisco by ctime · 2007-05-18 23:28 · Score: 3, Insightful

I think that it's great when a company is blamed on having poor products when it's really the company using them (in this case, NTT). The way the article is presented seems unfairly biased. The problem isn't with cisco products here but the lack of knowledge on scaling them properly. The headline is similar to saying something like "Ford Motor company cars involved in most car accidents historically". A properly designed network with just about any vendor, especially cisco, would have avoided this issue.

We all know who the real culprit is by Anonymous Coward · 2007-05-18 23:32 · Score: 0

Hacked by Chinese kekeke ^_^

Also.. by niceone · 2007-05-18 23:36 · Score: 3, Insightful

Cisco routers to blame for most of the rest of the internet's non-outage.

--
ccalam - acoustic versions of new songs.

TCAM exhaustion by anticypher · 2007-05-18 23:46 · Score: 5, Informative

This was certainly a problem with slightly older Cisco kit, such as 6500s with Sup720a cards. Their TCAM memory (that holds prefix+destination tuples in a form of cache) overflowed as the internet approaches 245,000 routes. Once there is no more space in TCAM, many Cisco architectures fall back to processor routing. That means that when traffic that was switched in hardware starts hitting the CPU, the box falls over whimpering for mercy.

If NTT had been following Cisco mailing lists, or keeping up to date on what their salesmen had been telling them for several years, they would have seen this problem looming and changed their routing structure or at least upgraded the processors for something with slightly more TCAM. The size of the internet is not going to stop growing because many companies chose to go with underpowered Cisco kit. The internet will continue to grow by 12,000 to 17,000 routes per month, accelerating over the next few years as IPv4 space becomes exhausted and de-aggregation becomes the norm.

This is one of my long standing grudges about Cisco design. They always are designing their core routers to be just slightly ahead of the size of the internet, forcing people to upgrade within a few years. Designed obsolescence is the term. Even their new CRS1 platform will fail over to CPU near 512,000 routes (0x80000), or sometime around the end of 2008 to mid 2009. By then, they'll probably have an expensive upgrade path for customers that will hold for just another year or two.

It's not just Cisco kit that is going to have problems over the next few months. By the end of June the internet will be at 256,000 routes (really 262,144 or 0x40000), which will be a problem for some other manufacturers. Some are starting to fail at 0x3C000 (245,000) routes, some already failed at 0x30000 last year.

On the plus side, the OpenBGPd crowd doesn't suffer from this, since their code is all CPU switched (but using very clever and efficiently coded routing tables) so their routing table is limited only by memory. But an OpenBGPd machine will never have the raw efficiency of a VLSI based hardware solution.

A quick look at my local looking glass shows 233,979 routes on the internet this morning.

the AC

--
Hemos is like...sci-fi fans;he thinks technology is cool, but he hasn't bothered to understand the science it's based on

Re:TCAM exhaustion by swmike · 2007-05-18 23:57 · Score: 1

The CRS-1 is tested with at least 2M IPv4 routes. It should be enough to 2015 or more.
Re:TCAM exhaustion by jacksonj04 · 2007-05-19 00:09 · Score: 2, Insightful

So basically what you're saying is there was insufficient routing capacity on the network causing it to fail? Well, I'm shocked!

Seriously though - would you try run a datacentre on a home router from NetGear? If I did and the network fell over in a fiery mass of routing tables I wouldn't say NetGear was to blame for building a bad router. I'd blame the network architect who thought they could shove hundreds of servers through a 5-port-with-wifi device.

--
How many people can read hex if only you and dead people can read hex?
Re:TCAM exhaustion by anticypher · 2007-05-19 00:27 · Score: 4, Informative

The CRS-1 is tested with at least 2M IPv4 routes.

It appears to be four separate instances of 512K routes, the total is for MPLS customers shoving full BGP tables into their mesh. With more than 8 MPLS customers doing screwy things today, the box starts hitting its CPUs. I haven't received a denial from the CRS-1 guys, just some hand waving and a promise to look into it. Implications that a better config would help hasn't actually produced an example of what to do, and the XR code is just different enough to hide underlying architecture deficiencies. The other problem is that every CRS-1 seems to be put into production before engineering has time to play with them and learn their tricks. Given time, all kinds of clever designs for XR code will spread around, just as there are tricks of the trade the most experienced IOS-based engineers grok.

It should be enough to 2015 or more.

And 640k should be enough for everyone. Seriously, I keep running across 2500s still doing their thing, but not as core BGP routers. So the CRS-1 platforms may quite well be running tucked into edges in 2015. Bean counters love kit that has amortised many times over.

the AC

--
Hemos is like...sci-fi fans;he thinks technology is cool, but he hasn't bothered to understand the science it's based on
Re:TCAM exhaustion by Professor_UNIX · 2007-05-19 00:39 · Score: 2, Funny

The size of the internet is not going to stop growing because many companies chose to go with underpowered Cisco kit. The internet will continue to grow by 12,000 to 17,000 routes per month, accelerating over the next few years as IPv4 space becomes exhausted and de-aggregation becomes the norm.

It sounds like what we need is legislation to enforce some hard limits on the growth of Internet routing tables in order to avoid these kinds of DoS attacks in the future. If we lobby Congress now we can hopefully avoid these disastrous consequences from reaching the United States.
Re:TCAM exhaustion by swmike · 2007-05-19 01:04 · Score: 2, Informative

If you're running four full-bgp VPN-customers in your core routers that can handle 2M routes, you've done a major design mistake and that's not the fault of the hardware manufacturer.

Regarding routing table growth, hopefully IPv6 might stifle that a bit as we're going to be running out of IPv4 space in the next 3-5 years and IPv6 space is allocated in much larger blocks requiring fewer routes.
Re:TCAM exhaustion by Anonymous Coward · 2007-05-19 03:36 · Score: 0

It sounds like what we need is legislation to enforce some hard limits on the growth of Internet routing tables

Yeah, you could lump it in with the legislation to increase the speed of light, repeal the laws of thermodynamics, and set Pi equal to 3.
Re:TCAM exhaustion by Anonymous Coward · 2007-05-19 03:44 · Score: 1, Interesting

TCAM (tertiary memory) exhaustion sounds plausible. We were looking into 6500/7600 to upgrade our 7200 platforms and were quoted Sup720-3B supervisor/routing engine. Being doubtful of Cisco these days I did some double checking on my own since they seem to have put their focus on being a marketing gorilla instead of a technological leader. It turns out Sup720-3B has limited TCAM memory that only supports 256K IPv4 routes and even fewer IPv6 routes. The current BGP routing table is just shy of that mark. One important thing to keep in mind is that other functions and features within the router will also use up TCAM so do your research so that it doesn't bite you hard in the butt where your only fix is a hardware upgrade. I don't know if they were just uninformed or if they were trying to pawn off near obsolete hardware on us and forcing us to upgrading in the near future. We're now looking into Juniper and as I found out even their old M20 platform was tested to upwards to 1 million routes. Juniper seems to be a much superior platform all-around.
Re:TCAM exhaustion by Anonymous Coward · 2007-05-19 05:03 · Score: 0

Correction: TCAM is Ternary Content Addressable Memory
Re:TCAM exhaustion by thogard · 2007-05-19 05:07 · Score: 2, Insightful

Yep, limit it to 16,777,216 /24 routes. That will fix it. If your router has 16 interfaces, you can do this with 8mb of cache ram to make the quick decisions and whatever else you need to processes the routes.
Re:TCAM exhaustion by anticypher · 2007-05-19 05:14 · Score: 1

Everyone using older (as in two or three years old) Cisco kit has been dumping their Sup720-A or 3B cards on the used market. The price of those cards has completely collapsed. There is the -3BX card, which can handle 393,000 routes in TCAM, but they'll be obsolete by the end of 2008.

If you have a design where a 6500 or 7600 isn't doing core routing, somewhere out on the edge, just buy the chassis and line cards from Cisco, and pick up one of the TCAM-poor routing engines for less than 5% of GPL.

Juniper has no used kit market, because every one of their M or T series routers can handle more and more routes depending on how much high speed RAM you throw in it. M5's put into service years ago can still handle today's internet without the slightest problem.

the AC

Cisco naming of their products may be wrong because I'm far away from where I could look at their product lines and nomenclature.

--
Hemos is like...sci-fi fans;he thinks technology is cool, but he hasn't bothered to understand the science it's based on
Re:TCAM exhaustion by Anonymous Coward · 2007-05-19 06:45 · Score: 0

Everyone using older (as in two or three years old) Cisco kit has been dumping their Sup720-A or 3B cards on the used market.

This is very very very timely and interesting. When we bought our Cisco 7609 with two sup720s in it several years ago we figured we'd be good for at least 10+ years considering how expensive the kit was, but now I've got to seriously consider upgrading the damn supervisors already!? Looks like I'd need to move to the WS-SUP720-3BXL to be safe up to a million IPv4 routes.
Re:TCAM exhaustion by anticypher · 2007-05-19 07:12 · Score: 3, Interesting

you've done a major design mistake

Not one of MY designs, but you are right about the mistake part. I know of a carrier with CRS-1s struggling with a poor design coupled with an out of control sales force that will not ever say "NO!" to a customer doing bad things to their MPLS service. That's the origin of the idea of a maximum of four instances of 512K routes in 4 separate TCAMs per chassis (or per line card, or per virtual machine, or something). Not really my job any more, so I learn this over beers next to the data centre and extend my sympathies to those stuck in the Cisco world.

hopefully IPv6 might stifle that a bit

Well, the IPv6 table is ~850 routes right now, growing by 10 to 20 new routes per month. Just like the early days of the internet as BGP rolled out. Now I can toss out the obligatory "You kids get off my LAN".

Problems are already starting to be seen by the RIRs, where speculative companies have started grabbing IPv4 allocations with no intention of using them, betting on a market for buying and selling prefixes and forcing the RIRs out of business. Exactly what happened to the DNS market when it became apparent that second level domains could be rented for yearly fees for a large profit.

If companies start buying and selling prefixes in an unregulated free market frenzy, aggregation will become a fond memory and expect every router to need several Gigabytes to hold the 2 million+ routes on the old IPv4 internet. At RIPE meetings, there is a hope that this is a worst case scenario, but it seems to be a business plan for some less altruistic people at ICANN.

the AC

--
Hemos is like...sci-fi fans;he thinks technology is cool, but he hasn't bothered to understand the science it's based on
Re:TCAM exhaustion by sjames · 2007-05-19 11:19 · Score: 1

The thing is, 4GB would be enough to handle routing even if each single IP address was assigned randomly (that is, if everyone was allocated nothing but /32s and it was done such that no aggregation at all was possable). 4GB is not THAT huge these days. There's not really a great reason (other than planned obsolescence) not to be fully future proof.

Exactly by Megane · 2007-05-19 00:37 · Score: 1

The problem was that the internet had grown beyond the capacity of their core routers, hence the core router upgrade that was "in progress". The headline should actually read:

OLD Cisco Routers to Blame for Japan Net Outage (with only one 't' in "Outage', just as in TFA!)

Hey folks, don't stop CIDR'ing routes just because there seems to be enough routing table space "right now"!

--
#naabhaprzrag, #sverubfr-000, #agi-fcbafberq, negvpyr[pynff*=' negvpyr-ary-'] { qvfcynl: abar !vzcbegnag; }

Re:Exactly by FromellaSlob · 2007-05-19 01:12 · Score: 1

Since the routers would have continued to do fine if they had been used within their design specifications, surely it should actually be:

Failure to Invest to Blame for Japan Net Outage

Or, more to the point:

Management Idiocy To Blame for Japan Net Outage

Evidently, Japan has it's share of PHBs too.

mmm... yeah by The+AtomicPunk · 2007-05-19 01:00 · Score: 1

I read this as human error and lack of planning that some admins have successfully blamed on hardware.

In other news... by Anonymous Coward · 2007-05-19 01:01 · Score: 0

Ford to blame for all car crashes involving Fords...

Smith & Wesson to blame for misuse of firearms...

Yawn... Nothing to see here... move along...

Re:The NW story is too vague to rule out human err by Anonymous Coward · 2007-05-19 01:16 · Score: 1, Informative

I have a friend who works for Cisco, setting up and configuring routers remotely via telepresence, all over the world. I know that lately they have been working on some routers in Japan.

*IF* (and I don't know for sure that this is the case, so it's a big IF) the project this person has been working on relates to the problem in this story, then I would say that your guess of 'human error' is likely a very large part of what happened.

I say this because my friend has filled me in on some of the stories relating to language barriers during the work they are doing. While the stories are very humorous (and that was their sole reason for telling me said stories), I can see where they would easily contribute to a situation where mistakes could be made.

From the stories, it is extremely evident that English is *very* much a second language on the Japan-side of things. To the point that it would be smart to have a person there whose first language and culture would be that of the company employees who were doing the install.

I'd guess that the lesson here is that globalization has a lot of bugs to be worked out still... :)

250k is lame, I just tested 1.1m on a Juniper by Anonymous Coward · 2007-05-19 01:27 · Score: 2, Informative

250K is quite lame. I just tested a bit over 1 million installed routes on a Juniper in my lab. ... a 5 year old m-series at that.

This is no shock: Juniper's first innovation was the use of high speed ram rather than tcam for tcam lookup so route table scaling has never been a problem for them...

Marketing on the other hand... geesh those cartoons still give me nightmares.

Re:250k is lame, I just tested 1.1m on a Juniper by frost22 · 2007-05-19 05:18 · Score: 2, Informative

IIRC we tested amounts like that even in well equipped 75ers .

As others already noted the 6500/7600 is a switch with limited
routing capabilities. you use it as a core router at your own
risk (and peril).

--
...and here I stand, with all my lore, poor fool, no wiser than before.

Zawnk, eye neede yore hellpe... by dour+power · 2007-05-19 01:37 · Score: 1

...to ficks mye badd speling. Cant seam two finde "outtage" (orr "expirey") inn mye dikshunairy...

operators to blame for japan net outtage .. by rs232 · 2007-05-19 02:26 · Score: 1

'routers went down .. after a switchover to backup routes triggered the routers to rewrite routing tables'

"At this time, Cisco and NTT have not determined the specific cause of the problem"

--
davecb5620@gmail.com

Re:Cisco is a greedy corporation by Anonymous Coward · 2007-05-19 03:30 · Score: 0

Did you mean: "FUCK DSLAM"

girlfriend? by alexandreracine · 2007-05-19 04:32 · Score: 1

For those that have used JunOS before, im sure are all saying. "A Juniper router is like my girlfriend.. It will never go down on me."

Are you saying that /. readers have girlfriends here?

--
No sig for now.

Re:girlfriend? by Anonymous Coward · 2007-05-19 04:59 · Score: 0

Have you stopped beating your wi...? Never mind.

why all full routes ? by oh_the_humanity · 2007-05-19 05:07 · Score: 1

Why would it be neccisary to run full BGP routes in all of your routers anyways? Couldnt you rull partial routes in the edge routers and save your full routes for your core and peering points? 2 -4 thousand routers is a lot to go down at once.

--
"When they invent bitch slaps that can go through a monitor you better f'ing duck" --deft (253558)

Having worked at Cisco, I strongly disagree by Anonymous Coward · 2007-05-19 05:49 · Score: 3, Interesting

You're doing an Apples and Oranges comparision. Cisco's IOS is far more dedicated to a specific set of tasks than the other notable OS's. So yes, one would expect far less bugs to be visable. That doesn't mean they aren't there; just that they haven't been discovered.

Having worked at many of the companies which supply OS's, Cisco is, IMHO, the worst. They go for lots of cheap talent. The common theme is to hire lots of low paid talent rather than focusing on getting the best and the brightest. And it shows. Things which shouldn't happen, do. And the general level of code quality is below average.

The general development infrastructure sucks badly as well. So much so, that they've actually developed bandaids to make it semi-palitable.

This isn't to say that they don't have some good talent there. They do. But they are a minority, and are hindered by the general red-tape which keeps those folks from having a greater impact.

Sun, on the otherhand, had the best development environment, talent and infrastructure that I've ever seen, back in the 90's. I've heard that things have fallen off a bit since then, but I really can't say.

Anyway, the bottom line here is that I wouldn't at all be surprised if Cisco screwed up on the basics. The cheap talent is biting them daily in ways the top management can't see, and it all adds up eventually. Things like this are to be expected, and I also expect it to get worse over time, not better.

Don't blame Cisco... blame the Monitoring by Allnighterking · 2007-05-19 06:39 · Score: 4, Insightful

I'm sorry but I've got a total of 3 data centers and I can' tell you when where and which router/pix/switch has a problem. I do it my not only monitoring the item itself (this is what everyone should do.) but also by monitoring "through" the item in question. I monitor a specific point on the other side (or in some cases set of points) not to find out if that point is good, but to find out if the path is good. 3 things have to be monitored.

1. Local status (Am I alive)

2. Path (can I get from me to you, what is the quality of the path?)

3. End point (are you there?)

If at any time you let the number of paths and interconnects overwhelms you. Get a new job. You've lost control. Draw pictures of the network. When you have an outage start looking immediately at what you have connectivity with and what you don't. Large data centers can get complex in their interconnects. Divide it up into "blocks" verify a block and move on.

The biggest problem in a situation like this is that I'm willing to bet the techs were wasting their time trying to figure out why the network went down. Who cares why. You need to quickly assess what is down. What you can do. What you can't do. You need to know what is normal and what is not. If you don't a situation like this can happen.

The worst thing that can happen is if the network is divided into "territories." Usually in a case like this people spend more time trying to blame the other guy then they do finding the cause of the problem. Finally design. Somewhere along the line some pencil pusher decided that a single point of failure was economically feasible. The techs were willing to sheep right along, the Sr Admin was played politics and didn't rock the boat.

In the end. The techs blew it. The after action report and follow up will tell the final tale.

--

I'm sorry, I'm to tired to be witty at the moment so this message will have to do.

Re:Is this a bad thing? by Anonymous Coward · 2007-05-19 08:15 · Score: 0

WTF did parent get modded as a troll? Remember the 30% decrease in spams back when the earthquake earlier this year knocked out most of southeast Asia? I agree; keep them offline.

Um... Force10 Networks, people... by Anonymous Coward · 2007-05-19 08:39 · Score: 0

Why do people still buy Cisco's junk anyway? Ugh. Grow up, they're dinosaurs in the market place now! force10networks.com By happy there's a new networking player in town that trumps its predecessors.

Re:Um... Force10 Networks, people... by medea · 2007-05-19 23:28 · Score: 1

CMIIW, but Force10 produces some bad-ass 10G-Switches which are more and more adopted at IXPs, but I did not heard of them to produce core internet routers yet...

Call L by Anonymous Coward · 2007-05-21 11:51 · Score: 0

Someone wrote "Cisco" in their Death Note!

Slashdot Mirror

Cisco Routers to Blame for Japan Net Outtage

78 comments