Worldcom's Frame Relay Down
Jim Trocki writes "MCI/Worldcom's frame relay network has been hosed for at least 8 days now. Read the story.
This is the recorded message that is heard on Worldcom's tech support line:
'In accordance with our plan to repair the instability in one of our frame
relay network platforms, we have taken our domestic frame relay platform
out of service for a 24 hour period, from noon Saturday to noon Sunday
Eastern standard time. As a result, your frame relay service will not
be available for traffic.' " Here is the MCI Worldcom web page on the situation. The news.com article says that this outage might cut into their profits. It seems this is quite a severe outage...
Try this on for size...seems the Chicago Board of Trade got so fed up with MCI, they have begun a PLAN B" to try to salvage PLAN A, their try at doing away with the dedicated trading links and VANnets. Read about it at
l ts/1,1780,,00.html?qy=MCI+trader&rw=1
http://www.chicagotribune.com/tools/search/resu
...hmmm, wonder what MCI is using for op systems behind the fiber(especially after the holes noted in a certain OS from the Northwestern US by the folks at NETWiz (see next article on the Internet Security Audit))?
I love news stories like this... a few weeks ago a server in our company got 'sploited by a script kiddie over the weekend. I recovered from backups before Monday, but the hack was fairly high profile anyway since I made everyone change their passwords. OH HEAVEN FORBID ANYONE SHOULD HAVE TO LEARN A NEW PASSWORD - EVER! I got no end of grief, until McAfee's site got hacked. Now this.
Stories like these give me someone to point at, saying "See? This computer stuff is goddamn hard. If a multibillion dollar corporation specializing in networking can't bring their network up, how the hell can I do anything on $58K per year? Gimme a raise!"
Granted MCI has screwed up badly - whoever does their change control should be fired over this - but the Chicago Board of Trade deserves the brunt of their customers anger.
You don't run a critical network without backup. Their slow links should have ISDN backups, their fast links should have dedicated reduntant connections. That's just common sense.
It is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail. - Abraham Maslow
As I understand it, it's their frame relay that has gone down. That would most certainly be proprietary software/hardware. According to the article, they're using Lucent.
Not having a backout plan is unforgivable when that many people and businesses depend on the service. Being truly prepared for a full backout is difficult, but is doable.
To further the problem, they apparently don't have much in the way of failover preparation in place, or an established mitigation procedure (or if they do, they stupidly switched everything to the new software at once.
The final problem at MCI has nothing to do with technical issues. Many are indicating that their problems would be much less severe if MCI would be more forthcoming with information, and would make a public statement. The only thing stopping them from doing that is arrogance, execs too busy packing their golden parachutes, and a determination to spin control themselves into the ground.
For large business, I agree, not dual homing is a bad idea.
The real issue is small business, which is less likely to survive the outage in the first place, and cannot afford to dual home. Unfortunatly, connectivity is still WAY more expensive than it has any right to be. For small business, doubling that cost is out of the question.
I don't advise any business to NOT dual home if at all possable. In fact, I would advise business to put their servers in a colo facillity which is dual homed.
What I'm saying is that some businesses CANNOT afford to do that and provide decent connectivity to their office as well. I can understand their temptation to go with a single provider and hope for the best.
I can see that. According to another poster, MCI initially told him they had messed up their routing tables.
Reports are that MCI has had some trouble for months, To me, that makes the cause if the outage an open question.
Many small ISPs are colos that lease a dial-up bank from their provider. Probably, the dial-up is linked to the colo servers through a Worldcom frame relay. Thus, the email is down. It's a pretty common setup these days.
What a corporation CANNOT pay for and what they WILL NOT pay for are two different things
CORPERATION? sure, odds are they can and should afford it. I am talking about SMALL operations. Think sole proprietors (that is Mom 'n Pop). $1200 could represent a significant portion of their income while they are getting established. It could make the difference between surviving long enough to become profitable and closing the doors.
Those are the people I feel for in this situation, especially since they are probably at the bottom of MCI's priority queue.>/p>
I agree that this incident will re-define what corperate america will pay for wrt redundancy. In a way, that is a shame since it essentially rewards the network providers for being unreliable.
Yes, it will get worse. When all the little companies merge into One Big Corporation (it can't sue itself) there will be no competition and it can honestly market itself as the number one service. I prefer lots of little companies, like little servers, rather than One Big Company. Its One Big FU waiting to happen...
MCI will still be around when this is over, but will a lot of "mom and pop" ISPs?
There's a story on C/NET "ISPs say MCI outage could kill
businesses" that's more than a little bit scary. Does MCI have their own ISP business? One that would just as soon see the little guys dry up and blow away? Do they have any corporate buddies that do?
I see even classic Slashdot is now pretty much unusable on dial up anymore.
Here's a thought. They're running Lucent hardware. The trouble apparently is related to a recent sotware upgrade install from Lucent. Same upgrade on other Lucent customers not causing problems. Lucent used to be Bell Labs (sort of). Bell Labs used to be part of AT&T (sort of). AT&T and MCI are competitors big time. Co-incidence or conspiracy? (it's the newest Ludlum novel, "The Lucent Gambit")
I see even classic Slashdot is now pretty much unusable on dial up anymore.
lots of things could keep this down. i'm wondering if their "upgrade" was an emergency last ditch effort to fix things. i'm betting that this will all come down to a bad piece of hardware/software which has a failure mode that is protocol correct (remember the zost cost route stuff from way back?). goes back to the old "your secondary systems should be different location, different hardware, and different vendors (e.g. if you use sun, fail to dec/hp/etc). of course many companies will whine that the QA cost is too high :) the sad thing is that they don't seem to be willing to take public responsibility for the failure, much like an auction company that seems to point blame at everyone except themselves.
"Its One Big FU waiting to happen..."
I think this applies more to some companies that others. We had great difficulty with MCI WorldCom specifically, with something as simple as turning on a couple of T1's that had 45 notice of the install date for.
It's not all big companies, though. We have have some services from Frontier (itself the result of many mergers), and they generally display a much higher level of competence.
I use bellsouth as an internet service provider, and i got the following letter in email:
Valued Customers:
At 11:15 pm 8/13/99 WorldCom, our Global Service Provider, notified our Network Operations Center of the need to perform emergency maintenance on their Frame Relay network beginning at 12 Noon (EDT) Saturday 8-14-99 and finishing at approximately 12 Noon (EDT) Sunday 8-15-99.
During the course of this emergency maintenance, you may or may not experience the following: congestion over the network, latency and potentially, loss of connectivity. The work being performed by WorldCom necessitates the complete shutdown of all frame relay switches within the WorldCom network, and a controlled, one by one, reinstatement of each frame relay switch back onto the network.
We have been assured by WorldCom that every effort will be made to reduce the impact to our network and to resolve the issue necessitating the emergency maintenance as expediently as possible.
We will notify you once we have received confirmation from WorldCom that all work has been completed.
Thank you for your patience and continued business,
BellSouth.net
--- Stampede linux for me! I play with fire to break the ice..
Okay, I'm intimately familiar with the situation at this point; I've been assisting a former employer in working towards a temporary emergency resolution since they're on MCI/WorldCom. Anyways, here's what appears to have happened. Please note the following disclaimer:
I do not speak for MCI/WorldCom or Lucent, I do not work for MCI/WorldCom or Lucent, I have no affiliation with MCI/WorldCom or Lucent, and I do not have any sort of business relationship with MCI/WorldCom or Lucent.
That said to cover my ass, here's what appears to have happened.
MCI/WorldCom has had capacity issues since mid-97. When it was just MCI, they stopped selling DS3's for a period of time a few years back because they simply didn't have the capacity. MCI has long had capacity issues, and as a direct result, has typically run their equipment at or near capacity. What appears to have happened is a cascade failure. I'm going to try and put this into words, but it's easier with pictures. Trust me.
What happens in a cascade failure? A network, at or near capacity, has a failure in a single core router for some reason, in MCI/WorldCom's case, a failure due to software. The load from that core is quickly distributed to the remaining core routers. These remaining core routers, being at or near capacity, almost immediately gave way under the load, failing due to other various reasons triggered by the excess load. As each router failed down the line, the load on the remaining routers increased exponentially, cascading into a full network outage.
Now, why isn't it fixed? Recovering from a cascade failure is extremely difficult. This is speaking from experience. I had a server cascade failure once; it's not an easy recovery from that. A network even moreso.
To recover from a cascade failure, load has to be taken out of the picture for a period of time so as to be able to bring things back online without any load. That's the reason for the 24 hour planned outage, I believe. Not working for MCI/WorldCom or Lucent, I can't be sure or confirm this. When the load is eliminated, what has to be done is each router all the way on down the line has to be fully reset, restored to the original configuration, reconfigured, then brought back online one by one, with *NO* load on the network. If there is load on the network equivalent to what there was when it went down, then that router will immediately fail again. After each router is brought back online, and tested, each interface must be brought back up, one at a time, so as to make sure the load does not cascade out of control again. Once this is done, stability can be assumed to be restored, assuming no more interfaces or connections are added.
Why MCI has taken so long to take this action, I don't know. Were I running the network, that would have been the first action upon noticing the cascade. Shut down all interfaces, cut off the load so that the cascade can be halted before the entire network is affected. Immediately notify all customers that, flat out, "a router failed due to a software upgrade, we don't know why, and we had to shut down all interfaces for a period of time to prevent the failure of the entire network. We don't know when we'll be able to get everyone back up." Furthermore, I'd do everything I could to get effected customers back online, and to find a way to get the customers attached to the failed router setup somewhere else, so as to get the network back up and be able to troubleshoot the failed router as quickly and cleanly as possible.
But like I said; I don't work for MCI/WorldCom or Lucent. I can't garauntee any of this information to be true. To be quite honest, I'm glad I don't work for either company. They have totally mishandled this whole situation, they're going to lose a lot of customers, and I believe they deserve it. You don't get and keep customers by keeping them in the dark and being very vague. Hell, if I call the Cleveland Verio NOC, they'll tell me exactly what happened when the T1 at work goes down. Either the 7513 had a failed RSP, or both powersupplies failed, etc. (And still my coworkers wonder why I hate Verio.. maybe because they're telling me these kind of things weekly?) MCI/WorldCom and Lucent have turned this into a disaster of proportions that never should have happened. Oh well. Their loss, other's gains.
Welcome to the Internet in this day and age, where information comes at a premium, and customer service is something of the past. Sad but true.
-RISCy Business | Rabid System Administrator and BOFH
your company here.
shelby != ford
So MCI can't fix the problem because it is Lucent software? So because MCI does not have the source code their entire network business is at risk. Were their licenses of proprietary software listed as a liability?
whatis.com/framerelay
http://altavista.yellowpages.com.au/cgi-bin/query? mss=simple&pg=q&what=web&enc=iso88591&kl =en&locale=xx&q=%2B%22frame+relay%22+%2B%22introdu ction+to+networking%22&search=Search
How we know is more important than what we know.
[B] CBT president blasts MCI WorldCom in wake of Project A outage
By Bridge News
Chicago--Aug 13--On the heels of Thursday's power outage in downtown
Chicago, which forced an early shutdown at the Chicago Board of Trade, the
exchange was forced to suspend trading again today on its Project A system. CBT
President Thomas Donovan sent a letter to MCI WorldCom CEO Bernard Ebbers,
blasting the company for its part in a string of other disruptions that have
plagued the system. MCI WorldCom is the exchange's network provider and has been
unable to cope with the crises to the exchange's satisfaction.
* * *
Donovan said today's shutdown and others in the past few weeks were a direct
result of MCI WorldCom's "catastrophic service disruptions," which have deprived
large segments of the CBT's constituents access to Project A through their
trading terminals on the system's wide-area network.
"All told, our Project A markets have been down over 60% of the time since
Project A's scheduled Thursday evening trading session last week, exposing our
members and their customers to market risk and depriving them of significant
trading and revenue opportunities," he said. "The CBOT has also experienced a
sizable loss of transaction fee revenues."
MCI WorldCom has "tarnished the CBOT's 151-year reputation as a provider of
dependable and reliable market facilities," said Donovan, adding that the
problems put the exchange in the hot seat with its federal regulatory body, the
Commodity Futures Trading Commission.
He said MCI WorldCom led the CBT to believe it would not need a contingency
plan, but the exchange would now be forced to implement one beginning with the
Project A session that begins at 1800 CT Sunday.
Under the plan, many exchange members will have to move or duplicate their
Project A operations and staffing to back up locations within the building,
entailing added costs and hardships.
Last week, Project A suffered a shutdown after MCI began to upgrade its
communications network and an outage occurred at a switching center. The company
provided assurances it would try harder to restore customer confidence.
"As a result of MCI WorldCom's failure to deliver on their promises to me
early last week, the CBOT is pursuing all available remedies," Donovan said.
He said the exchange had lost all confidence in MCI WorldCom's ability to
provide reliable service and was awaiting the company's immediate response as to
how it would remedy the situation. End
Bridge News, Tel: (312) 454-3468
Send comments to Internet address:futures@bridge.com
[symbols:US;WCOM]
Aug-13-1999 17:26 GMT
Source [B] BridgeNews Global Markets
Categories:
COM/GRAIN COM/SOY COM/LIVE CAP/FOREX CAP/CREDIT CAP/INDEX COM/AGRI
COM/LUMBER COM/ENERGY CAP/STOCKS
I agree about a back-out plan and a contingency plan, but testing for systems like this is often just not going to happen.
It's usually not practical to really test complex systems under anything that approaches the real-world. MCI-WorldComm can't maintain a test environment that anywhere near mirrors the significant portion of the Internet that they service. Anything less than a test under real-world loads will not be representative of what will happen when you put it into service.
And remember, testing only demonstrates the presence of defects, not their absence.
Who's to say that Lucent is at fault here? I would guess that the same equipment is running outside of MCI and we're not hearing about problems there. I don't actually know the situation with regard to the the Lucent hardware/software. It may be that this is something that only MCI has, or something that only MCI has put under such loads.
People need to get some perspective. With the growth of Internet and bandwidth demands in general, combined with the cut-throat cost competition environment for these carriers, it's really surprising to me that we don't have a lot more failures like this one. Get over it.