Worldcom's Frame Relay Down
Jim Trocki writes "MCI/Worldcom's frame relay network has been hosed for at least 8 days now. Read the story.
This is the recorded message that is heard on Worldcom's tech support line:
'In accordance with our plan to repair the instability in one of our frame
relay network platforms, we have taken our domestic frame relay platform
out of service for a 24 hour period, from noon Saturday to noon Sunday
Eastern standard time. As a result, your frame relay service will not
be available for traffic.' " Here is the MCI Worldcom web page on the situation. The news.com article says that this outage might cut into their profits. It seems this is quite a severe outage...
It scares me how eaisly a big company can loose their network so eaisly. As the net gets bigger, will it just get worse? First post!
umm.. that's not quite what they said. CBOT said the would ALREADY have used Plan B, but MCI kept saying the problem was almost fixed, so they didn't implement alternate routing when it could have saved them another 4 or 5 days of downtime.
There was a huge power hit in California a couple of years back, and everyone in the ISP business was shocked at the neglignece of huge companies who only had WAN links to a single ISP, and lost all connectivity. Chicago Mercantile is looking at the same deal here w/ MCI's nuclear meltdown. The network engineer who decided that running a major exchange using only one ISP's resources was a good thing should be drummed out of the business. The moral of the story, for everyone affected, is that if you are too lazy or cheap to set up an Autonomous System, run BGP with your ISPs and dual-home yourself to the internet, then it is really only your fault when your business goes down the tubes. Until backbone providers hit a level of reliability that approaches metro telephone service (which is a long way off ... trust me) anyone who refuses to build redundancy into their systems will get stung. There has been much discussion about this around my network engineering circles, and the consensus is that it is the Network Service Provider's fault that there is an outage, but it is the customer's fault that there is downtime ... when will the big corporations learn?
is that this lack of reliability is very acceptable in a different context: M$ Win and apps on a personal computer, if you consider the average user's point of view. Am I wrong? // Nicholas Bodley // nbodley@tiac.net
Try this on for size...seems the Chicago Board of Trade got so fed up with MCI, they have begun a PLAN B" to try to salvage PLAN A, their try at doing away with the dedicated trading links and VANnets. Read about it at
l ts/1,1780,,00.html?qy=MCI+trader&rw=1
http://www.chicagotribune.com/tools/search/resu
...hmmm, wonder what MCI is using for op systems behind the fiber(especially after the holes noted in a certain OS from the Northwestern US by the folks at NETWiz (see next article on the Internet Security Audit))?
I love news stories like this... a few weeks ago a server in our company got 'sploited by a script kiddie over the weekend. I recovered from backups before Monday, but the hack was fairly high profile anyway since I made everyone change their passwords. OH HEAVEN FORBID ANYONE SHOULD HAVE TO LEARN A NEW PASSWORD - EVER! I got no end of grief, until McAfee's site got hacked. Now this.
Stories like these give me someone to point at, saying "See? This computer stuff is goddamn hard. If a multibillion dollar corporation specializing in networking can't bring their network up, how the hell can I do anything on $58K per year? Gimme a raise!"
Granted MCI has screwed up badly - whoever does their change control should be fired over this - but the Chicago Board of Trade deserves the brunt of their customers anger.
You don't run a critical network without backup. Their slow links should have ISDN backups, their fast links should have dedicated reduntant connections. That's just common sense.
It is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail. - Abraham Maslow
As I understand it, it's their frame relay that has gone down. That would most certainly be proprietary software/hardware. According to the article, they're using Lucent.
Not having a backout plan is unforgivable when that many people and businesses depend on the service. Being truly prepared for a full backout is difficult, but is doable.
To further the problem, they apparently don't have much in the way of failover preparation in place, or an established mitigation procedure (or if they do, they stupidly switched everything to the new software at once.
The final problem at MCI has nothing to do with technical issues. Many are indicating that their problems would be much less severe if MCI would be more forthcoming with information, and would make a public statement. The only thing stopping them from doing that is arrogance, execs too busy packing their golden parachutes, and a determination to spin control themselves into the ground.
I can see that. According to another poster, MCI initially told him they had messed up their routing tables.
Reports are that MCI has had some trouble for months, To me, that makes the cause if the outage an open question.
Many small ISPs are colos that lease a dial-up bank from their provider. Probably, the dial-up is linked to the colo servers through a Worldcom frame relay. Thus, the email is down. It's a pretty common setup these days.
In a former life, I was an co-op student at Western Union (Anyone remember them? They used to actually transfer data as well as money.), and I was involved in leasing data lines from AT&T and the local telcos wherever our customers needed connections.
Some of the larger outfits were indeed interested in redundancy and paid a premium for, say, two links from NYC to SF, one by way of Chicago, and another through Houston. We frequently had trouble verifying that we were really getting independent routes.
The telcos bundled connectivity (back when a single voice line was lotsa data links [after they'd been digitized]) in a hierarchy so deep that it took days or weeks to verify that every link in the route used facilities physically separate from every other.
Why? 'Cause to Ma Bell, bandwidth is fungible. Got noise on link 37A from Manhattan to Albany? Take it out of service for maintenance, and swap in some spare bps from Manhattan to Jersey City to Albany. It's all the same....
Until a manhole floods in Jersey City, and *oops* it turns out your route to Chicago (formerly via Albany) is now cheek-by-jowl with the wire to Houston, and you're dead!
Among other things, we supposedly rented circuits with huge Do Not Reroute tags hanging all over them, but on occasion someone overlooked it, or worse, the facility two levels up the hierarchy got rerouted and our link went along for the ride unknowingly.
I wish them well---it can only be much worse with fiber optics these days.
I refuse to believe corporations are people until Texas executes one. -- desert rain on http://www.dailykos.com/user/
I bet a lot of the companies it applies to more are those valuing economy of scale over reliablity. ``We can save 0.3% by buying 10k Grace L. Furgeson routers? Great! ... They don't interoperate with any others? That's OK, we'll buy 20k and use them exclusively.''
So now when the GLF equipment shows a bug under certain wildly-unlikely circumstances, you're sincerely screwed. Much better to insist on at least two (and if you're serious, three or four) vendor's equipment, interoperating to an open standard, throughout your network. That way, as with genetic diversity in crops, livestock, and humans, you're much better able to withstand climate variation and new diseases. Half your network may be down with the bug, but you still have significant bandwidth running.
(Of course, this doesn't innoculate you against errors in the protocol, but that's better debugged than the equipment implementing it.)
I refuse to believe corporations are people until Texas executes one. -- desert rain on http://www.dailykos.com/user/
I work for a mid-sized ISP and we lost connectivity on Friday the 6th, periodically through the week, and Friday the 13th. As I write this, our frame-relay connectivity to UUNet (via MCI Worldcom) is down!
We'll survive it. We have a connection to another local provider (who uses UUNet as well but doesn't seem to be experiencing the same problems) and a PPP T1 link to AT&T. A simple route-map to prepend our AS to the UUNet BGP announcements and whala, AT&T handles most of the traffic.
As far as a secondary feed, I don't know what you're talking about. If you have a frame-relay circuit with MCI Worldcom, a secondary frame-relay circuit won't solve the problems. We'll expect compensation for violation of our service level agreement and CIR and be done with it.
This is a big deal. Someone will lose their job or face severe disciplinary action over it but they'll figure it out and things will be back to normal. Besides this, we've had *EXCELLENT* service for several years.
Yep, our SC and PA offices have been down most of the week... Not good, not good.
Um, that's what they deserve for using a sorry ass frame cloud to do business. Frame relay isn't worth a shit and any ISP worth it's salt knows this. Sure it's cheaper, but hell, so is not being in business, which is what will happen to these cheap businesses... you have to spend money to make money and if you buy poor, you stay poor.
Here's another thought. I left the "f" out of software.
(sotware--overimbibing instructions)
I see even classic Slashdot is now pretty much unusable on dial up anymore.
I forget, where did BellCore fit into all of that?
I see even classic Slashdot is now pretty much unusable on dial up anymore.
MCI will still be around when this is over, but will a lot of "mom and pop" ISPs?
There's a story on C/NET "ISPs say MCI outage could kill
businesses" that's more than a little bit scary. Does MCI have their own ISP business? One that would just as soon see the little guys dry up and blow away? Do they have any corporate buddies that do?
I see even classic Slashdot is now pretty much unusable on dial up anymore.
Here's a thought. They're running Lucent hardware. The trouble apparently is related to a recent sotware upgrade install from Lucent. Same upgrade on other Lucent customers not causing problems. Lucent used to be Bell Labs (sort of). Bell Labs used to be part of AT&T (sort of). AT&T and MCI are competitors big time. Co-incidence or conspiracy? (it's the newest Ludlum novel, "The Lucent Gambit")
I see even classic Slashdot is now pretty much unusable on dial up anymore.
Well when our frame connectivity went out at work, the first thing that we got told (Friday the 6th) was that WorldCom was installing a new router somewhere in the NorthEast US. The routing tables were screwed up and that these propagated out from the new router and that WorldCom couldn't fix it. WorldCom changed their story after Day 5 of this nonsense to blame Lucent's software. I'm not sure if an upgrade to software would in fact create a propagating problem in routing tables for non upgraded routers, but the traceroute times I saw during the last week and packet loss makes me think that the problem was with the routing tables and not just a matter of "network congestion". Personally I'll always tend to believe the first story that I hear from a tech before the story that I hear 5 days later from the Public Relations Office.
... possibilities for the gov't. Of course, I'm just guessing.
"I love my job, but I hate talking to people like you" (Freddie Mercury)
lots of things could keep this down. i'm wondering if their "upgrade" was an emergency last ditch effort to fix things. i'm betting that this will all come down to a bad piece of hardware/software which has a failure mode that is protocol correct (remember the zost cost route stuff from way back?). goes back to the old "your secondary systems should be different location, different hardware, and different vendors (e.g. if you use sun, fail to dec/hp/etc). of course many companies will whine that the QA cost is too high :) the sad thing is that they don't seem to be willing to take public responsibility for the failure, much like an auction company that seems to point blame at everyone except themselves.
I use bellsouth as an internet service provider, and i got the following letter in email:
Valued Customers:
At 11:15 pm 8/13/99 WorldCom, our Global Service Provider, notified our Network Operations Center of the need to perform emergency maintenance on their Frame Relay network beginning at 12 Noon (EDT) Saturday 8-14-99 and finishing at approximately 12 Noon (EDT) Sunday 8-15-99.
During the course of this emergency maintenance, you may or may not experience the following: congestion over the network, latency and potentially, loss of connectivity. The work being performed by WorldCom necessitates the complete shutdown of all frame relay switches within the WorldCom network, and a controlled, one by one, reinstatement of each frame relay switch back onto the network.
We have been assured by WorldCom that every effort will be made to reduce the impact to our network and to resolve the issue necessitating the emergency maintenance as expediently as possible.
We will notify you once we have received confirmation from WorldCom that all work has been completed.
Thank you for your patience and continued business,
BellSouth.net
--- Stampede linux for me! I play with fire to break the ice..
This is so flipping *STUPID*. Does any resonable IT person with a brain belive a telco vendor when they say that they do not need a contingency plan???
While MCI and apparently Lucent share a major percentage of the blame in this particular situation someone at CBOT screwed up big time. What next? Someone told us we did not need a UPS for our systems?
Does anyone else remember the flack Ebay took for apparently running their business without redundancies in their infrastructure?
I have a friend up here in the NW who's in charge of watching the ATM lines and making sure they get rerouted when necessary. His interpretation of the Worldcom outage was, 'They were testing a Y2K patch and it failed.' I think the ATMs will do OK. Far better than us normal people with businesses to run.
Leads me to wonder why it wasn't handled a bit better, but them's the breaks. We had flaky access last weekend and off and on through the week. Completely dead yesterday at noon but up early this morning.
I just pushed our machines thru a proxy server on a dialup during the outages to avoid complete loss of connectivity. One good reason to have lots of small companies to choose from for your service.
"I have a cunning plan..."
Is this why a lot of sites, slashdot included, are really really slow right now? I'm on New England's Mediaone. I'm not clear as to what this outage affects.
What kind of planning went into the "upgrade" of the Lucent software? Was there no back-out plan? No testing? No contingency planning? Someone at Lucent ought to be taking quite a beating right about now.... Lucent owes an explanation of this one, and they've been mighty quiet. Head-in-sand syndrome...
The real losers here are all the small ISP's who rely (possibly unknowingly) on the MCI backbone since it's wholesaled to them by other companies. 8 days of downtime will put some of these guys out of business. They don't have the cash flow to refund all their customers for lost service.
The wholesalers owe it to their ISP customers to have a secondary feed (hopefully using non-MCI, non-Lucent equipment) to prevent such a disaster as this.
faster faster... 'til the thrill of speed overcomes the fear of death...
I've got a friend who works at Lucent, and when I told him about this, his comment was "All big telephone companies blame us when their networks go down. We waste more time proving that our products weren't the cause of outages than I care to think about."
Which makes some sense. If you were in charge of damage control at MCI, would you want to say,"Yes, this was entirely our fault, and we're incompetent monkeys for letting it go on so long"?
Okay, I'm intimately familiar with the situation at this point; I've been assisting a former employer in working towards a temporary emergency resolution since they're on MCI/WorldCom. Anyways, here's what appears to have happened. Please note the following disclaimer:
I do not speak for MCI/WorldCom or Lucent, I do not work for MCI/WorldCom or Lucent, I have no affiliation with MCI/WorldCom or Lucent, and I do not have any sort of business relationship with MCI/WorldCom or Lucent.
That said to cover my ass, here's what appears to have happened.
MCI/WorldCom has had capacity issues since mid-97. When it was just MCI, they stopped selling DS3's for a period of time a few years back because they simply didn't have the capacity. MCI has long had capacity issues, and as a direct result, has typically run their equipment at or near capacity. What appears to have happened is a cascade failure. I'm going to try and put this into words, but it's easier with pictures. Trust me.
What happens in a cascade failure? A network, at or near capacity, has a failure in a single core router for some reason, in MCI/WorldCom's case, a failure due to software. The load from that core is quickly distributed to the remaining core routers. These remaining core routers, being at or near capacity, almost immediately gave way under the load, failing due to other various reasons triggered by the excess load. As each router failed down the line, the load on the remaining routers increased exponentially, cascading into a full network outage.
Now, why isn't it fixed? Recovering from a cascade failure is extremely difficult. This is speaking from experience. I had a server cascade failure once; it's not an easy recovery from that. A network even moreso.
To recover from a cascade failure, load has to be taken out of the picture for a period of time so as to be able to bring things back online without any load. That's the reason for the 24 hour planned outage, I believe. Not working for MCI/WorldCom or Lucent, I can't be sure or confirm this. When the load is eliminated, what has to be done is each router all the way on down the line has to be fully reset, restored to the original configuration, reconfigured, then brought back online one by one, with *NO* load on the network. If there is load on the network equivalent to what there was when it went down, then that router will immediately fail again. After each router is brought back online, and tested, each interface must be brought back up, one at a time, so as to make sure the load does not cascade out of control again. Once this is done, stability can be assumed to be restored, assuming no more interfaces or connections are added.
Why MCI has taken so long to take this action, I don't know. Were I running the network, that would have been the first action upon noticing the cascade. Shut down all interfaces, cut off the load so that the cascade can be halted before the entire network is affected. Immediately notify all customers that, flat out, "a router failed due to a software upgrade, we don't know why, and we had to shut down all interfaces for a period of time to prevent the failure of the entire network. We don't know when we'll be able to get everyone back up." Furthermore, I'd do everything I could to get effected customers back online, and to find a way to get the customers attached to the failed router setup somewhere else, so as to get the network back up and be able to troubleshoot the failed router as quickly and cleanly as possible.
But like I said; I don't work for MCI/WorldCom or Lucent. I can't garauntee any of this information to be true. To be quite honest, I'm glad I don't work for either company. They have totally mishandled this whole situation, they're going to lose a lot of customers, and I believe they deserve it. You don't get and keep customers by keeping them in the dark and being very vague. Hell, if I call the Cleveland Verio NOC, they'll tell me exactly what happened when the T1 at work goes down. Either the 7513 had a failed RSP, or both powersupplies failed, etc. (And still my coworkers wonder why I hate Verio.. maybe because they're telling me these kind of things weekly?) MCI/WorldCom and Lucent have turned this into a disaster of proportions that never should have happened. Oh well. Their loss, other's gains.
Welcome to the Internet in this day and age, where information comes at a premium, and customer service is something of the past. Sad but true.
-RISCy Business | Rabid System Administrator and BOFH
your company here.
shelby != ford
So MCI can't fix the problem because it is Lucent software? So because MCI does not have the source code their entire network business is at risk. Were their licenses of proprietary software listed as a liability?
whatis.com/framerelay
Bet the resumes are just overwhelming the fax machine.
Field Service Engineers
You'll install, test, and repair circuits and equipment at the customer premises. You'll satisfy customers with face-to-face interaction. You'll participate in performance improvement efforts, and be responsible for an MCI WorldCom vehicle. You'll perform dispatch duties after hours when necessary, and participate in a call-out rotation. To qualify, you'll need an AS degree in a technical field, 1 year of field service or 2 years central office experience, knowledge of personal computer software, and hardware operations will be beneficial. We have this position available in the Northern and Western suburbs as well as downtown Chicago.
Pork is not a verb
Off of Drudge's site
Pork is not a verb
http://altavista.yellowpages.com.au/cgi-bin/query? mss=simple&pg=q&what=web&enc=iso88591&kl =en&locale=xx&q=%2B%22frame+relay%22+%2B%22introdu ction+to+networking%22&search=Search
How we know is more important than what we know.
well i work for an isp that uses netcom t1, and i guess they use mci, so we were down the better part of wednesday. i mean if you are going to hold a quarter of the infrastructure to the net...i think you ought to be able to have redundant systems and no down time. they owe the world something now
JediLuke
JediLuke
-Do or Do Not, There is no Try
[B] CBT president blasts MCI WorldCom in wake of Project A outage
By Bridge News
Chicago--Aug 13--On the heels of Thursday's power outage in downtown
Chicago, which forced an early shutdown at the Chicago Board of Trade, the
exchange was forced to suspend trading again today on its Project A system. CBT
President Thomas Donovan sent a letter to MCI WorldCom CEO Bernard Ebbers,
blasting the company for its part in a string of other disruptions that have
plagued the system. MCI WorldCom is the exchange's network provider and has been
unable to cope with the crises to the exchange's satisfaction.
* * *
Donovan said today's shutdown and others in the past few weeks were a direct
result of MCI WorldCom's "catastrophic service disruptions," which have deprived
large segments of the CBT's constituents access to Project A through their
trading terminals on the system's wide-area network.
"All told, our Project A markets have been down over 60% of the time since
Project A's scheduled Thursday evening trading session last week, exposing our
members and their customers to market risk and depriving them of significant
trading and revenue opportunities," he said. "The CBOT has also experienced a
sizable loss of transaction fee revenues."
MCI WorldCom has "tarnished the CBOT's 151-year reputation as a provider of
dependable and reliable market facilities," said Donovan, adding that the
problems put the exchange in the hot seat with its federal regulatory body, the
Commodity Futures Trading Commission.
He said MCI WorldCom led the CBT to believe it would not need a contingency
plan, but the exchange would now be forced to implement one beginning with the
Project A session that begins at 1800 CT Sunday.
Under the plan, many exchange members will have to move or duplicate their
Project A operations and staffing to back up locations within the building,
entailing added costs and hardships.
Last week, Project A suffered a shutdown after MCI began to upgrade its
communications network and an outage occurred at a switching center. The company
provided assurances it would try harder to restore customer confidence.
"As a result of MCI WorldCom's failure to deliver on their promises to me
early last week, the CBOT is pursuing all available remedies," Donovan said.
He said the exchange had lost all confidence in MCI WorldCom's ability to
provide reliable service and was awaiting the company's immediate response as to
how it would remedy the situation. End
Bridge News, Tel: (312) 454-3468
Send comments to Internet address:futures@bridge.com
[symbols:US;WCOM]
Aug-13-1999 17:26 GMT
Source [B] BridgeNews Global Markets
Categories:
COM/GRAIN COM/SOY COM/LIVE CAP/FOREX CAP/CREDIT CAP/INDEX COM/AGRI
COM/LUMBER COM/ENERGY CAP/STOCKS
We will not know who is to blame till a finger pointing excercise happens. LUCENT doen't want to be blamed MCI doesn't want it to be them, some poor contractor will probably get fired for not following procedure during the upgrade, whether he did or not.
Here at work we're wired by WorldCom
I don't even think about it when I logon
But then, when the logo keeps churning
And something smell's like its burining
I ask the question,
Where's my connection?