Worldcom's Frame Relay Down

← Back to Stories (view on slashdot.org)

Posted by justin++ on Saturday August 14, 1999 @06:57PM from the most-of-it? dept.

Jim Trocki writes "MCI/Worldcom's frame relay network has been hosed for at least 8 days now. Read the story. This is the recorded message that is heard on Worldcom's tech support line: 'In accordance with our plan to repair the instability in one of our frame relay network platforms, we have taken our domestic frame relay platform out of service for a 24 hour period, from noon Saturday to noon Sunday Eastern standard time. As a result, your frame relay service will not be available for traffic.' " Here is the MCI Worldcom web page on the situation. The news.com article says that this outage might cut into their profits. It seems this is quite a severe outage...

4 of 86 comments (clear)

Min score:

Reason:

Sort:

It makes me happy. by Anonymous Coward · 1999-08-14 16:15 · Score: 3

I love news stories like this... a few weeks ago a server in our company got 'sploited by a script kiddie over the weekend. I recovered from backups before Monday, but the hack was fairly high profile anyway since I made everyone change their passwords. OH HEAVEN FORBID ANYONE SHOULD HAVE TO LEARN A NEW PASSWORD - EVER! I got no end of grief, until McAfee's site got hacked. Now this.

Stories like these give me someone to point at, saying "See? This computer stuff is goddamn hard. If a multibillion dollar corporation specializing in networking can't bring their network up, how the hell can I do anything on $58K per year? Gimme a raise!"
Re:How nice. by dattaway · 1999-08-14 16:33 · Score: 3

Yes, it will get worse. When all the little companies merge into One Big Corporation (it can't sue itself) there will be no competition and it can honestly market itself as the number one service. I prefer lots of little companies, like little servers, rather than One Big Company. Its One Big FU waiting to happen...
Here's a thought by unitron · 1999-08-14 17:29 · Score: 3

Here's a thought. They're running Lucent hardware. The trouble apparently is related to a recent sotware upgrade install from Lucent. Same upgrade on other Lucent customers not causing problems. Lucent used to be Bell Labs (sort of). Bell Labs used to be part of AT&T (sort of). AT&T and MCI are competitors big time. Co-incidence or conspiracy? (it's the newest Ludlum novel, "The Lucent Gambit")

--
I see even classic Slashdot is now pretty much unusable on dial up anymore.
How it happened. / Why is it not fixed? by RISCy+Business · 1999-08-15 02:56 · Score: 5

Okay, I'm intimately familiar with the situation at this point; I've been assisting a former employer in working towards a temporary emergency resolution since they're on MCI/WorldCom. Anyways, here's what appears to have happened. Please note the following disclaimer:

I do not speak for MCI/WorldCom or Lucent, I do not work for MCI/WorldCom or Lucent, I have no affiliation with MCI/WorldCom or Lucent, and I do not have any sort of business relationship with MCI/WorldCom or Lucent.

That said to cover my ass, here's what appears to have happened.

MCI/WorldCom has had capacity issues since mid-97. When it was just MCI, they stopped selling DS3's for a period of time a few years back because they simply didn't have the capacity. MCI has long had capacity issues, and as a direct result, has typically run their equipment at or near capacity. What appears to have happened is a cascade failure. I'm going to try and put this into words, but it's easier with pictures. Trust me.

What happens in a cascade failure? A network, at or near capacity, has a failure in a single core router for some reason, in MCI/WorldCom's case, a failure due to software. The load from that core is quickly distributed to the remaining core routers. These remaining core routers, being at or near capacity, almost immediately gave way under the load, failing due to other various reasons triggered by the excess load. As each router failed down the line, the load on the remaining routers increased exponentially, cascading into a full network outage.

Now, why isn't it fixed? Recovering from a cascade failure is extremely difficult. This is speaking from experience. I had a server cascade failure once; it's not an easy recovery from that. A network even moreso.

To recover from a cascade failure, load has to be taken out of the picture for a period of time so as to be able to bring things back online without any load. That's the reason for the 24 hour planned outage, I believe. Not working for MCI/WorldCom or Lucent, I can't be sure or confirm this. When the load is eliminated, what has to be done is each router all the way on down the line has to be fully reset, restored to the original configuration, reconfigured, then brought back online one by one, with *NO* load on the network. If there is load on the network equivalent to what there was when it went down, then that router will immediately fail again. After each router is brought back online, and tested, each interface must be brought back up, one at a time, so as to make sure the load does not cascade out of control again. Once this is done, stability can be assumed to be restored, assuming no more interfaces or connections are added.

Why MCI has taken so long to take this action, I don't know. Were I running the network, that would have been the first action upon noticing the cascade. Shut down all interfaces, cut off the load so that the cascade can be halted before the entire network is affected. Immediately notify all customers that, flat out, "a router failed due to a software upgrade, we don't know why, and we had to shut down all interfaces for a period of time to prevent the failure of the entire network. We don't know when we'll be able to get everyone back up." Furthermore, I'd do everything I could to get effected customers back online, and to find a way to get the customers attached to the failed router setup somewhere else, so as to get the network back up and be able to troubleshoot the failed router as quickly and cleanly as possible.

But like I said; I don't work for MCI/WorldCom or Lucent. I can't garauntee any of this information to be true. To be quite honest, I'm glad I don't work for either company. They have totally mishandled this whole situation, they're going to lose a lot of customers, and I believe they deserve it. You don't get and keep customers by keeping them in the dark and being very vague. Hell, if I call the Cleveland Verio NOC, they'll tell me exactly what happened when the T1 at work goes down. Either the 7513 had a failed RSP, or both powersupplies failed, etc. (And still my coworkers wonder why I hate Verio.. maybe because they're telling me these kind of things weekly?) MCI/WorldCom and Lucent have turned this into a disaster of proportions that never should have happened. Oh well. Their loss, other's gains.

Welcome to the Internet in this day and age, where information comes at a premium, and customer service is something of the past. Sad but true.

-RISCy Business | Rabid System Administrator and BOFH

--
your company here.
shelby != ford