Worldcom's Frame Relay Down

How nice. by Anonymous Coward · 1999-08-14 14:05 · Score: 1

It scares me how eaisly a big company can loose their network so eaisly. As the net gets bigger, will it just get worse? First post!

Re:How nice. by Anonymous Coward · 1999-08-14 22:38 · Score: 1

Damm straight. If your network is important enough to you and your customers, you'd better be multihomed, because backbone providers, regardless of size or marketing litrature, can and will go down. The Chicago Mercantile Exchange, which is making all these big noises about the outage and is threatening to leave MCI as a customer (if they haven't already done so), is an organization that _should_ have been multihomed, and I think it's just sily for them to blame and point fingers at MCI, when in fact they should be looking at themselves. They've got millions of dollars to throw at the problem, so I say, there's no fault but their own.
Re:How nice. by dattaway · 1999-08-14 16:33 · Score: 3

Yes, it will get worse. When all the little companies merge into One Big Corporation (it can't sue itself) there will be no competition and it can honestly market itself as the number one service. I prefer lots of little companies, like little servers, rather than One Big Company. Its One Big FU waiting to happen...
Re:How nice. by KyleCordes · 1999-08-14 18:38 · Score: 2

"Its One Big FU waiting to happen..."

I think this applies more to some companies that others. We had great difficulty with MCI WorldCom specifically, with something as simple as turning on a couple of T1's that had 45 notice of the install date for.

It's not all big companies, though. We have have some services from Frontier (itself the result of many mergers), and they generally display a much higher level of competence.

Re:More on the story...MCI to lose CBOT as Custome by Anonymous Coward · 1999-08-14 16:36 · Score: 1

umm.. that's not quite what they said. CBOT said the would ALREADY have used Plan B, but MCI kept saying the problem was almost fixed, so they didn't implement alternate routing when it could have saved them another 4 or 5 days of downtime.

Single attached networks by Anonymous Coward · 1999-08-14 20:30 · Score: 1

There was a huge power hit in California a couple of years back, and everyone in the ISP business was shocked at the neglignece of huge companies who only had WAN links to a single ISP, and lost all connectivity. Chicago Mercantile is looking at the same deal here w/ MCI's nuclear meltdown. The network engineer who decided that running a major exchange using only one ISP's resources was a good thing should be drummed out of the business. The moral of the story, for everyone affected, is that if you are too lazy or cheap to set up an Autonomous System, run BGP with your ISPs and dual-home yourself to the internet, then it is really only your fault when your business goes down the tubes. Until backbone providers hit a level of reliability that approaches metro telephone service (which is a long way off ... trust me) anyone who refuses to build redundancy into their systems will get stung. There has been much discussion about this around my network engineering circles, and the consensus is that it is the Network Service Provider's fault that there is an outage, but it is the customer's fault that there is downtime ... when will the big corporations learn?

Re:Single attached networks by sjames · 1999-08-14 22:02 · Score: 2

For large business, I agree, not dual homing is a bad idea.

The real issue is small business, which is less likely to survive the outage in the first place, and cannot afford to dual home. Unfortunatly, connectivity is still WAY more expensive than it has any right to be. For small business, doubling that cost is out of the question.
Re:Single attached networks by sjames · 1999-08-15 00:01 · Score: 2

I don't advise any business to NOT dual home if at all possable. In fact, I would advise business to put their servers in a colo facillity which is dual homed.

What I'm saying is that some businesses CANNOT afford to do that and provide decent connectivity to their office as well. I can understand their temptation to go with a single provider and hope for the best.
Re:Single attached networks by sjames · 1999-08-15 02:52 · Score: 2

What a corporation CANNOT pay for and what they WILL NOT pay for are two different things

CORPERATION? sure, odds are they can and should afford it. I am talking about SMALL operations. Think sole proprietors (that is Mom 'n Pop). $1200 could represent a significant portion of their income while they are getting established. It could make the difference between surviving long enough to become profitable and closing the doors.

Those are the people I feel for in this situation, especially since they are probably at the bottom of MCI's priority queue.>/p>
I agree that this incident will re-define what corperate america will pay for wrt redundancy. In a way, that is a shame since it essentially rewards the network providers for being unreliable.
Re:Single attached networks by Sonic-B-PHuCT · 1999-08-15 20:07 · Score: 1

Call me and IDEALIST, but...
1. If you pay $1,200 US for service, you get service. If your service is free, then you take your chances. If I go to a restraunt and pay money and don't get service, it's not my fault. It's not the little ISP's fault if they don't do redundancy. I think the little ISP's should beat themselves up, they don't need us to do it. The bottom line is that MCI went down, not the little guy and MCI should pay for everything not the little guy.

2. "any business" is not a corporation.

3. Your logic should also apply to MCI, perhaps they should have reduncancy built into their network so that this doesn't happen.
Re:Single attached networks by rubbah · 1999-08-14 21:34 · Score: 1

..and small ISPs are supposed to have dual homing even if they can't afford it because of their size??? You may be comfy in your suite, but those little ISPs are dying on this one.. We have a second route out, but once upon a time we did not.... Give those little guys a break!

What nobody's said yet by Anonymous Coward · 1999-08-14 23:45 · Score: 1

is that this lack of reliability is very acceptable in a different context: M$ Win and apps on a personal computer, if you consider the average user's point of view. Am I wrong? // Nicholas Bodley // nbodley@tiac.net

Re:What nobody's said yet by BigDaddyJ · 1999-08-15 00:50 · Score: 1

Yes, but you see - you can reboot a Win9x machine yourself and "continue working". Here, the outage is beyond your own control. Also, your Win9x machine doesn't become unusable for a week STRAIGHT - you'd be forced to bring it into the computer shop. What's the equivalent of that for MCI Worldcom?
--bdj
Re:What nobody's said yet by amber_1 · 1999-08-15 01:18 · Score: 1

Currently Lucent is providing support. This went through Level 1, 2 and should now be in product support (designer hands) at LUCENT. End users of Optical carrier equipment/switches are not capable of handling problems of this level. This started out as an upgrade that failed.

More on the story...MCI to lose CBOT as Customer by Anonymous Coward · 1999-08-14 14:31 · Score: 2

Try this on for size...seems the Chicago Board of Trade got so fed up with MCI, they have begun a PLAN B" to try to salvage PLAN A, their try at doing away with the dedicated trading links and VANnets. Read about it at

http://www.chicagotribune.com/tools/search/resul ts/1,1780,,00.html?qy=MCI+trader&rw=1

...hmmm, wonder what MCI is using for op systems behind the fiber(especially after the holes noted in a certain OS from the Northwestern US by the folks at NETWiz (see next article on the Internet Security Audit))?

It makes me happy. by Anonymous Coward · 1999-08-14 16:15 · Score: 3

I love news stories like this... a few weeks ago a server in our company got 'sploited by a script kiddie over the weekend. I recovered from backups before Monday, but the hack was fairly high profile anyway since I made everyone change their passwords. OH HEAVEN FORBID ANYONE SHOULD HAVE TO LEARN A NEW PASSWORD - EVER! I got no end of grief, until McAfee's site got hacked. Now this.

Stories like these give me someone to point at, saying "See? This computer stuff is goddamn hard. If a multibillion dollar corporation specializing in networking can't bring their network up, how the hell can I do anything on $58K per year? Gimme a raise!"

CBOT is trying to deflect blame to MCI... by Ami+Ganguli · 1999-08-14 22:24 · Score: 2

Granted MCI has screwed up badly - whoever does their change control should be fired over this - but the Chicago Board of Trade deserves the brunt of their customers anger.

You don't run a critical network without backup. Their slow links should have ISDN backups, their fast links should have dedicated reduntant connections. That's just common sense.

--
It is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail. - Abraham Maslow

Re:What could take this down... by sjames · 1999-08-14 21:13 · Score: 2

As I understand it, it's their frame relay that has gone down. That would most certainly be proprietary software/hardware. According to the article, they're using Lucent.

Re:Lucent the Ostrich by sjames · 1999-08-14 21:56 · Score: 2

Not having a backout plan is unforgivable when that many people and businesses depend on the service. Being truly prepared for a full backout is difficult, but is doable.

To further the problem, they apparently don't have much in the way of failover preparation in place, or an established mitigation procedure (or if they do, they stupidly switched everything to the new software at once.

The final problem at MCI has nothing to do with technical issues. Many are indicating that their problems would be much less severe if MCI would be more forthcoming with information, and would make a public statement. The only thing stopping them from doing that is arrogance, execs too busy packing their golden parachutes, and a determination to spin control themselves into the ground.

Re:What could take this down... by sjames · 1999-08-15 02:08 · Score: 2

I can see that. According to another poster, MCI initially told him they had messed up their routing tables.

Reports are that MCI has had some trouble for months, To me, that makes the cause if the outage an open question.

Re:ISPs endangered by sjames · 1999-08-15 02:31 · Score: 2

Many small ISPs are colos that lease a dial-up bank from their provider. Probably, the dial-up is linked to the colo servers through a Worldcom frame relay. Thus, the email is down. It's a pretty common setup these days.

Dual-homing can be harder than you think by Max+Hyre · 1999-08-15 23:58 · Score: 1

In a former life, I was an co-op student at Western Union (Anyone remember them? They used to actually transfer data as well as money.), and I was involved in leasing data lines from AT&T and the local telcos wherever our customers needed connections.

Some of the larger outfits were indeed interested in redundancy and paid a premium for, say, two links from NYC to SF, one by way of Chicago, and another through Houston. We frequently had trouble verifying that we were really getting independent routes.

The telcos bundled connectivity (back when a single voice line was lotsa data links [after they'd been digitized]) in a hierarchy so deep that it took days or weeks to verify that every link in the route used facilities physically separate from every other.

Why? 'Cause to Ma Bell, bandwidth is fungible. Got noise on link 37A from Manhattan to Albany? Take it out of service for maintenance, and swap in some spare bps from Manhattan to Jersey City to Albany. It's all the same....

Until a manhole floods in Jersey City, and *oops* it turns out your route to Chicago (formerly via Albany) is now cheek-by-jowl with the wire to Houston, and you're dead!

Among other things, we supposedly rented circuits with huge Do Not Reroute tags hanging all over them, but on occasion someone overlooked it, or worse, the facility two levels up the hierarchy got rerouted and our link went along for the ride unknowingly.

I wish them well---it can only be much worse with fiber optics these days.

--
I refuse to believe corporations are people until Texas executes one. -- desert rain on http://www.dailykos.com/user/

Diversity is Power by Max+Hyre · 1999-08-16 00:14 · Score: 1

I bet a lot of the companies it applies to more are those valuing economy of scale over reliablity. ``We can save 0.3% by buying 10k Grace L. Furgeson routers? Great! ... They don't interoperate with any others? That's OK, we'll buy 20k and use them exclusively.''

So now when the GLF equipment shows a bug under certain wildly-unlikely circumstances, you're sincerely screwed. Much better to insist on at least two (and if you're serious, three or four) vendor's equipment, interoperating to an open standard, throughout your network. That way, as with genetic diversity in crops, livestock, and humans, you're much better able to withstand climate variation and new diseases. Half your network may be down with the bug, but you still have significant bandwidth running.

(Of course, this doesn't innoculate you against errors in the protocol, but that's better debugged than the equipment implementing it.)

--
I refuse to believe corporations are people until Texas executes one. -- desert rain on http://www.dailykos.com/user/

Re:Lucent the Ostrich by synaptic · 1999-08-14 18:30 · Score: 1

I work for a mid-sized ISP and we lost connectivity on Friday the 6th, periodically through the week, and Friday the 13th. As I write this, our frame-relay connectivity to UUNet (via MCI Worldcom) is down!

We'll survive it. We have a connection to another local provider (who uses UUNet as well but doesn't seem to be experiencing the same problems) and a PPP T1 link to AT&T. A simple route-map to prepend our AS to the UUNet BGP announcements and whala, AT&T handles most of the traffic.

As far as a secondary feed, I don't know what you're talking about. If you have a frame-relay circuit with MCI Worldcom, a secondary frame-relay circuit won't solve the problems. We'll expect compensation for violation of our service level agreement and CIR and be done with it.

This is a big deal. Someone will lose their job or face severe disciplinary action over it but they'll figure it out and things will be back to normal. Besides this, we've had *EXCELLENT* service for several years.

Re:Not quite 8 days by danbeck · 1999-08-15 06:03 · Score: 1

Yep, our SC and PA offices have been down most of the week... Not good, not good.

Re:ISPs endangered by danbeck · 1999-08-15 06:04 · Score: 1

Um, that's what they deserve for using a sorry ass frame cloud to do business. Frame relay isn't worth a shit and any ISP worth it's salt knows this. Sure it's cheaper, but hell, so is not being in business, which is what will happen to these cheap businesses... you have to spend money to make money and if you buy poor, you stay poor.

Re:Here's a thought by unitron · 1999-08-14 17:32 · Score: 1

Here's another thought. I left the "f" out of software.
(sotware--overimbibing instructions)

--

I see even classic Slashdot is now pretty much unusable on dial up anymore.

Re:BellCore by unitron · 1999-08-16 16:18 · Score: 1

I forget, where did BellCore fit into all of that?

--

I see even classic Slashdot is now pretty much unusable on dial up anymore.

ISPs endangered by unitron · 1999-08-14 17:16 · Score: 2

MCI will still be around when this is over, but will a lot of "mom and pop" ISPs?
There's a story on C/NET "ISPs say MCI outage could kill
businesses" that's more than a little bit scary. Does MCI have their own ISP business? One that would just as soon see the little guys dry up and blow away? Do they have any corporate buddies that do?

--

I see even classic Slashdot is now pretty much unusable on dial up anymore.

Re:ISPs endangered by igjeff · 1999-08-14 18:52 · Score: 2

Yeah...its call UU.Net. :)

Seriously though...my question is...does MCI WorldCom use the same service level agreements on their frame cloud that uu.net does on their internet access? If so...this could be a *SIGNIFICANT* hit to MCI WorldCom's pocketbook.

For those of you who don't know...UU.Net's SLA's basically say that for every hour of downtime, you get a *day* of credit on your circuit. So...for 8 days of downtime...that's over 6 months of service for free!

Ouch.

Jeff
Re:ISPs endangered by BigDaddyJ · 1999-08-15 00:49 · Score: 1

Well, a lot of people can't afford the cost of multiple frame relay lines, especially in remote areas where the cost of local loops, etc. is quite high.
I agree with you, though, on the mail bit, unless they're not hosting the mail server, just sending smtp upstream. That doesn't sound right, though.
--bdj

Here's a thought by unitron · 1999-08-14 17:29 · Score: 3

Here's a thought. They're running Lucent hardware. The trouble apparently is related to a recent sotware upgrade install from Lucent. Same upgrade on other Lucent customers not causing problems. Lucent used to be Bell Labs (sort of). Bell Labs used to be part of AT&T (sort of). AT&T and MCI are competitors big time. Co-incidence or conspiracy? (it's the newest Ludlum novel, "The Lucent Gambit")

--

I see even classic Slashdot is now pretty much unusable on dial up anymore.

Re:Here's a thought by bwz · 1999-08-15 06:15 · Score: 1

If you 'read between the lines' you see that he said he worked with hardware design (actually where they design the hardware, he didn't say that he was a hardware designer) and that the "cards they sent us" worked OK. Implying that it was the software (from Lucent I persume) that failed (I dont't think he meant that "the software on the punched cards that they sent us was OK..."). He's correct that running a critical system at or close to 100% utilization is a far, far less than brilliant thing to do... All systems that I've worked with tend to fail more often when they are stressed in that way, and quite a few of todays computer systems are not designed to behave well if you put 'too much' load on them...

I've got no hard facts to back those statements up, just experience :-)

Erik

Has it ever occurred to you that God might be a committee?

--

Has it ever occurred to you that God might be a committee?
--- Jubal Harshaw
Re:Here's a thought by NavisCore · 1999-08-15 00:53 · Score: 1

Lucent's hardware is not the problem here. I work in Lucent's Core Switching division (formerly Ascend Communications), where there hardware is designed, and we've had some interesing shouting matches with the people from MCI over this whole thing. The cards they sent us which they claimed had failed worked perfectly. Their network went down because they oversubscribed their lines and caused the software to crash.

-NavisCore

--
-NavisCore
Re:Here's a thought by fwr · 1999-08-14 21:00 · Score: 1

I sincerely doubt it. I would like to know exactly what the technical reasons for the outage were though. I'm in the position where I might need to support Lucent hardware in the near future, and I'd certainly like a heads-up on this one!

Lucent? by Dr.Hair · 1999-08-14 21:51 · Score: 1

Well when our frame connectivity went out at work, the first thing that we got told (Friday the 6th) was that WorldCom was installing a new router somewhere in the NorthEast US. The routing tables were screwed up and that these propagated out from the new router and that WorldCom couldn't fix it. WorldCom changed their story after Day 5 of this nonsense to blame Lucent's software. I'm not sure if an upgrade to software would in fact create a propagating problem in routing tables for non upgraded routers, but the traceroute times I saw during the last week and packet loss makes me think that the problem was with the routing tables and not just a matter of "network congestion". Personally I'll always tend to believe the first story that I hear from a tech before the story that I hear 5 days later from the Public Relations Office.

Re:Lucent? by parc · 1999-08-15 05:26 · Score: 1

We've got a non-frame connection out to Houston, and have seen no routing problems. If it was a problem with routes going out, we'd see some effect(besides not being able to hit WC/UUnet frame customers). UUnet's NOC page was blaming Lucent early this week.

Oddly enough, the engineer we usually deal with a UU has been unavailable the past 2-3 weeks because they had him working on a "big project." He didn't even show up as at work when he was on that project. I wonder if this was it?

Software upgrade = better sureveillance ... by Lazy+Jones · 1999-08-14 20:19 · Score: 1

... possibilities for the gov't. Of course, I'm just guessing.

--
"I love my job, but I hate talking to people like you" (Freddie Mercury)

Re:What could take this down... by hagan · 1999-08-14 16:11 · Score: 2

lots of things could keep this down. i'm wondering if their "upgrade" was an emergency last ditch effort to fix things. i'm betting that this will all come down to a bad piece of hardware/software which has a failure mode that is protocol correct (remember the zost cost route stuff from way back?). goes back to the old "your secondary systems should be different location, different hardware, and different vendors (e.g. if you use sun, fail to dec/hp/etc). of course many companies will whine that the QA cost is too high :) the sad thing is that they don't seem to be willing to take public responsibility for the failure, much like an auction company that seems to point blame at everyone except themselves.

ISP's affected? by Diggety_Dank · 1999-08-14 22:26 · Score: 2

I use bellsouth as an internet service provider, and i got the following letter in email:

Valued Customers:

At 11:15 pm 8/13/99 WorldCom, our Global Service Provider, notified our Network Operations Center of the need to perform emergency maintenance on their Frame Relay network beginning at 12 Noon (EDT) Saturday 8-14-99 and finishing at approximately 12 Noon (EDT) Sunday 8-15-99.

During the course of this emergency maintenance, you may or may not experience the following: congestion over the network, latency and potentially, loss of connectivity. The work being performed by WorldCom necessitates the complete shutdown of all frame relay switches within the WorldCom network, and a controlled, one by one, reinstatement of each frame relay switch back onto the network.

We have been assured by WorldCom that every effort will be made to reduce the impact to our network and to resolve the issue necessitating the emergency maintenance as expediently as possible.

We will notify you once we have received confirmation from WorldCom that all work has been completed.

Thank you for your patience and continued business,

BellSouth.net

--
--- Stampede linux for me! I play with fire to break the ice..

Re:Text of CBT letter to MCI CEO by shri · 1999-08-14 22:01 · Score: 1

This is so flipping *STUPID*. Does any resonable IT person with a brain belive a telco vendor when they say that they do not need a contingency plan???

While MCI and apparently Lucent share a major percentage of the blame in this particular situation someone at CBOT screwed up big time. What next? Someone told us we did not need a UPS for our systems?

Does anyone else remember the flack Ebay took for apparently running their business without redundancies in their infrastructure?

Re:Hmm, Are these ATM lines? by lucidvein · 1999-08-15 04:43 · Score: 1

I have a friend up here in the NW who's in charge of watching the ATM lines and making sure they get rerouted when necessary. His interpretation of the Worldcom outage was, 'They were testing a Y2K patch and it failed.' I think the ATMs will do OK. Far better than us normal people with businesses to run.

Leads me to wonder why it wasn't handled a bit better, but them's the breaks. We had flaky access last weekend and off and on through the week. Completely dead yesterday at noon but up early this morning.

I just pushed our machines thru a proxy server on a dialup during the outages to avoid complete loss of connectivity. One good reason to have lots of small companies to choose from for your service.

--

"I have a cunning plan..."

So... by ywwg · 1999-08-15 07:00 · Score: 1

Is this why a lot of sites, slashdot included, are really really slow right now? I'm on New England's Mediaone. I'm not clear as to what this outage affects.

Lucent the Ostrich by mudpuppy · 1999-08-14 18:09 · Score: 1

What kind of planning went into the "upgrade" of the Lucent software? Was there no back-out plan? No testing? No contingency planning? Someone at Lucent ought to be taking quite a beating right about now.... Lucent owes an explanation of this one, and they've been mighty quiet. Head-in-sand syndrome...

The real losers here are all the small ISP's who rely (possibly unknowingly) on the MCI backbone since it's wholesaled to them by other companies. 8 days of downtime will put some of these guys out of business. They don't have the cash flow to refund all their customers for lost service.

The wholesalers owe it to their ISP customers to have a secondary feed (hopefully using non-MCI, non-Lucent equipment) to prevent such a disaster as this.

--
faster faster... 'til the thrill of speed overcomes the fear of death...

Re:Lucent the Ostrich by NavisCore · 1999-08-15 00:59 · Score: 1

>Lucent owes an explanation of this one, and they've been mighty quiet.

They've been quiet because it's not their problem, this one's MCI's. Read my previous post.

--
-NavisCore
Re:Lucent the Ostrich by JordanH · 1999-08-14 20:34 · Score: 2

I agree about a back-out plan and a contingency plan, but testing for systems like this is often just not going to happen.

It's usually not practical to really test complex systems under anything that approaches the real-world. MCI-WorldComm can't maintain a test environment that anywhere near mirrors the significant portion of the Internet that they service. Anything less than a test under real-world loads will not be representative of what will happen when you put it into service.

And remember, testing only demonstrates the presence of defects, not their absence.

Who's to say that Lucent is at fault here? I would guess that the same equipment is running outside of MCI and we're not hearing about problems there. I don't actually know the situation with regard to the the Lucent hardware/software. It may be that this is something that only MCI has, or something that only MCI has put under such loads.

People need to get some perspective. With the growth of Internet and bandwidth demands in general, combined with the cut-throat cost competition environment for these carriers, it's really surprising to me that we don't have a lot more failures like this one. Get over it.

Re:What could take this down... by j_d · 1999-08-15 00:52 · Score: 1

According to the article, they're using Lucent.

I've got a friend who works at Lucent, and when I told him about this, his comment was "All big telephone companies blame us when their networks go down. We waste more time proving that our products weren't the cause of outages than I care to think about."

Which makes some sense. If you were in charge of damage control at MCI, would you want to say,"Yes, this was entirely our fault, and we're incompetent monkeys for letting it go on so long"?

How it happened. / Why is it not fixed? by RISCy+Business · 1999-08-15 02:56 · Score: 5

Okay, I'm intimately familiar with the situation at this point; I've been assisting a former employer in working towards a temporary emergency resolution since they're on MCI/WorldCom. Anyways, here's what appears to have happened. Please note the following disclaimer:

I do not speak for MCI/WorldCom or Lucent, I do not work for MCI/WorldCom or Lucent, I have no affiliation with MCI/WorldCom or Lucent, and I do not have any sort of business relationship with MCI/WorldCom or Lucent.

That said to cover my ass, here's what appears to have happened.

MCI/WorldCom has had capacity issues since mid-97. When it was just MCI, they stopped selling DS3's for a period of time a few years back because they simply didn't have the capacity. MCI has long had capacity issues, and as a direct result, has typically run their equipment at or near capacity. What appears to have happened is a cascade failure. I'm going to try and put this into words, but it's easier with pictures. Trust me.

What happens in a cascade failure? A network, at or near capacity, has a failure in a single core router for some reason, in MCI/WorldCom's case, a failure due to software. The load from that core is quickly distributed to the remaining core routers. These remaining core routers, being at or near capacity, almost immediately gave way under the load, failing due to other various reasons triggered by the excess load. As each router failed down the line, the load on the remaining routers increased exponentially, cascading into a full network outage.

Now, why isn't it fixed? Recovering from a cascade failure is extremely difficult. This is speaking from experience. I had a server cascade failure once; it's not an easy recovery from that. A network even moreso.

To recover from a cascade failure, load has to be taken out of the picture for a period of time so as to be able to bring things back online without any load. That's the reason for the 24 hour planned outage, I believe. Not working for MCI/WorldCom or Lucent, I can't be sure or confirm this. When the load is eliminated, what has to be done is each router all the way on down the line has to be fully reset, restored to the original configuration, reconfigured, then brought back online one by one, with *NO* load on the network. If there is load on the network equivalent to what there was when it went down, then that router will immediately fail again. After each router is brought back online, and tested, each interface must be brought back up, one at a time, so as to make sure the load does not cascade out of control again. Once this is done, stability can be assumed to be restored, assuming no more interfaces or connections are added.

Why MCI has taken so long to take this action, I don't know. Were I running the network, that would have been the first action upon noticing the cascade. Shut down all interfaces, cut off the load so that the cascade can be halted before the entire network is affected. Immediately notify all customers that, flat out, "a router failed due to a software upgrade, we don't know why, and we had to shut down all interfaces for a period of time to prevent the failure of the entire network. We don't know when we'll be able to get everyone back up." Furthermore, I'd do everything I could to get effected customers back online, and to find a way to get the customers attached to the failed router setup somewhere else, so as to get the network back up and be able to troubleshoot the failed router as quickly and cleanly as possible.

But like I said; I don't work for MCI/WorldCom or Lucent. I can't garauntee any of this information to be true. To be quite honest, I'm glad I don't work for either company. They have totally mishandled this whole situation, they're going to lose a lot of customers, and I believe they deserve it. You don't get and keep customers by keeping them in the dark and being very vague. Hell, if I call the Cleveland Verio NOC, they'll tell me exactly what happened when the T1 at work goes down. Either the 7513 had a failed RSP, or both powersupplies failed, etc. (And still my coworkers wonder why I hate Verio.. maybe because they're telling me these kind of things weekly?) MCI/WorldCom and Lucent have turned this into a disaster of proportions that never should have happened. Oh well. Their loss, other's gains.

Welcome to the Internet in this day and age, where information comes at a premium, and customer service is something of the past. Sad but true.

-RISCy Business | Rabid System Administrator and BOFH

--
your company here.
shelby != ford

Re:How it happened. / Why is it not fixed? by biga · 1999-08-15 14:57 · Score: 1

It was actually Worldcom's Frame Relay Network that went down not MCI's. The two networks haven't been integrated into each other yet.

--
It is a mistake to think you can solve any major problems just with potatoes -- Douglas Adam -- Life, The Universe and E
Re:How it happened. / Why is it not fixed? by bmo · 1999-08-15 10:07 · Score: 1

The example you gave is similar to what happened back in ....wavy lines....the East Coast Blackout of 1965. Essentially it was exactly as you describe. A substation got overloaded, switched off to dump the load to the rest of the grid, and the rest of the grid being at/near capacity cascaded shutdowns until there was *nothing* (except for bits of Maine) electical running in the entire Northeast within 5 minutes.

http://www.cmpco.com/aboutCMP/powersystem/blacko ut.html

Can't fix software, eh? by SEWilco · 1999-08-14 18:12 · Score: 2

So MCI can't fix the problem because it is Lucent software? So because MCI does not have the source code their entire network business is at risk. Were their licenses of proprietary software listed as a liability?

Re:What is it? by Jburkholder · 1999-08-14 23:16 · Score: 2

whatis.com/framerelay

Looking for a job? by rjreb · 1999-08-14 22:55 · Score: 1

Bet the resumes are just overwhelming the fax machine.

Field Service Engineers

You'll install, test, and repair circuits and equipment at the customer premises. You'll satisfy customers with face-to-face interaction. You'll participate in performance improvement efforts, and be responsible for an MCI WorldCom vehicle. You'll perform dispatch duties after hours when necessary, and participate in a call-out rotation. To qualify, you'll need an AS degree in a technical field, 1 year of field service or 2 years central office experience, knowledge of personal computer software, and hardware operations will be beneficial. We have this position available in the Northern and Western suburbs as well as downtown Chicago.

--
Pork is not a verb

Y2K related? by rjreb · 1999-08-15 11:24 · Score: 1

Off of Drudge's site

--
Pork is not a verb

Ask the web... by QuantumG · 1999-08-14 14:29 · Score: 2

http://altavista.yellowpages.com.au/cgi-bin/query? mss=simple&pg=q&what=web&enc=iso88591&kl =en&locale=xx&q=%2B%22frame+relay%22+%2B%22introdu ction+to+networking%22&search=Search

--
How we know is more important than what we know.

ISP's and Netcom suffer too by JediLuke · 1999-08-15 09:00 · Score: 1

well i work for an isp that uses netcom t1, and i guess they use mci, so we were down the better part of wednesday. i mean if you are going to hold a quarter of the infrastructure to the net...i think you ought to be able to have redundant systems and no down time. they owe the world something now

JediLuke

--

JediLuke
-Do or Do Not, There is no Try

Text of CBT letter to MCI CEO by Lawrence_Bird · 1999-08-14 21:39 · Score: 2

[B] CBT president blasts MCI WorldCom in wake of Project A outage

By Bridge News
Chicago--Aug 13--On the heels of Thursday's power outage in downtown
Chicago, which forced an early shutdown at the Chicago Board of Trade, the
exchange was forced to suspend trading again today on its Project A system. CBT
President Thomas Donovan sent a letter to MCI WorldCom CEO Bernard Ebbers,
blasting the company for its part in a string of other disruptions that have
plagued the system. MCI WorldCom is the exchange's network provider and has been
unable to cope with the crises to the exchange's satisfaction.
* * *
Donovan said today's shutdown and others in the past few weeks were a direct
result of MCI WorldCom's "catastrophic service disruptions," which have deprived
large segments of the CBT's constituents access to Project A through their
trading terminals on the system's wide-area network.
"All told, our Project A markets have been down over 60% of the time since
Project A's scheduled Thursday evening trading session last week, exposing our
members and their customers to market risk and depriving them of significant
trading and revenue opportunities," he said. "The CBOT has also experienced a
sizable loss of transaction fee revenues."
MCI WorldCom has "tarnished the CBOT's 151-year reputation as a provider of
dependable and reliable market facilities," said Donovan, adding that the
problems put the exchange in the hot seat with its federal regulatory body, the
Commodity Futures Trading Commission.
He said MCI WorldCom led the CBT to believe it would not need a contingency
plan, but the exchange would now be forced to implement one beginning with the
Project A session that begins at 1800 CT Sunday.
Under the plan, many exchange members will have to move or duplicate their
Project A operations and staffing to back up locations within the building,
entailing added costs and hardships.
Last week, Project A suffered a shutdown after MCI began to upgrade its
communications network and an outage occurred at a switching center. The company
provided assurances it would try harder to restore customer confidence.
"As a result of MCI WorldCom's failure to deliver on their promises to me
early last week, the CBOT is pursuing all available remedies," Donovan said.
He said the exchange had lost all confidence in MCI WorldCom's ability to
provide reliable service and was awaiting the company's immediate response as to
how it would remedy the situation. End
Bridge News, Tel: (312) 454-3468
Send comments to Internet address:futures@bridge.com
[symbols:US;WCOM]

Aug-13-1999 17:26 GMT
Source [B] BridgeNews Global Markets
Categories:
COM/GRAIN COM/SOY COM/LIVE CAP/FOREX CAP/CREDIT CAP/INDEX COM/AGRI
COM/LUMBER COM/ENERGY CAP/STOCKS

Re:What could take this down... by amber_1 · 1999-08-15 02:23 · Score: 1

We will not know who is to blame till a finger pointing excercise happens. LUCENT doen't want to be blamed MCI doesn't want it to be them, some poor contractor will probably get fired for not following procedure during the upgrade, whether he did or not.

Man Oh Man by Ginsberg+Terrorist · 1999-08-15 20:21 · Score: 1

Here at work we're wired by WorldCom
I don't even think about it when I logon
But then, when the logo keeps churning
And something smell's like its burining
I ask the question,
Where's my connection?

59 of 86 comments (clear)