MS, CNET On 7-Day Messenger Outage
imipak writes: "Microsoft have
finally commented on the recent seven day outage
at their
Messenger IM service -- some users have permanently
lost data, and there's still no explanation of the cause.
Interesting earlier story from CNet News. Key quote:
"... an outage that lasts seven days with no valid
explanation really starts to make you think about
.Net, and about Microsoft's plans for the Internet.
What if this were the new Office software
verification service that was down?"" Here 's a story on MSNBC as well.
we still don't know the real story
Sometimes, news is news.
> From what I've read, they had a disk controller failure, and the secondary (ie. backup) controller also had some kind of fault which lead them "further down the disaster recovery plan." Which means now they go to tape backups, probably.
.net with this kind of high-profile outage on everyone's minds? If you're the CIO of a big company and you move your company to .net, do you just send all your employees home for a week when something like this happens?
Sounds like they're just making excuses. No matter how they spin it, there's no excuse for a disk controller to put them out of service for a week. Lose data since the last backup, sure, but not a week long shutdown.
Recall also that Hotmail suffered 10+ days outage for a subset of their users last summer, and some of those users permanently lost their data and had to just start new accounts.
I agree with Pinball Wizard: there's no excuse for this kind of thing. (Frankly, I think it's because Microsoft still doesn't 'get' anything beyond a single-user system.)
FWIW, they've been saying the same thing over at c.o.l.a. for a few days, and even one of the tenured trolls is agreeing that it's inexcusable.
And how are they going to sell
--
Sheesh, evil *and* a jerk. -- Jade
The CNet article seems to have someone from Microsoft hinting it was a hardware failure..
What Happened? did all three CTRL-ALT and DEL keys fall off at the same time?
Enquiring minds want to know, cos if you talk to Cisco, HP, et al.. they'll sell you something called a 'Maintenance Contract'.
EZ
"Oops, I always forget the purpose of competition is to divide people into winners and losers." - Hobbes
Not 7 days. And, they published an entire blow-by-blow timeline of what happened.
So, who's more responsible?
-- You can't idiot-proof anything, because they're always coming out with better idiots.
"This outage is not indicative of Microsoft's ability to move forward with its .Net strategy," Visse said. "This was one isolated issue brought on by a series of extremely rare hardware failures."
I find this highly improbable. Any ISP worth it's service has either service contracts on it's hardware, or a closet of "critical spares" (hardware that the ISP couldn't function without, and therefore keeps a second piece or the parts to repair equipment), usually both.
A company the size of MS, this should be a foregone conclusion that both of these things should already have been covered. I know that the Messenger Service isn't quite as mission-critical as say a primary filesever, nor is the messenger service as important as many other ISP services (web, mail, authentication, etc), but come on! Hardware failure isn't an excuse for a mutiple *workday* outage. Not for a company the size of MS.
EveryDNS. Use it. It works.
AC's need not reply
I think the notes to be examined here are the lack of PR and customer support on the issue and the extended length of time of the outage. You can bet that the issue is being examined very closely by MS and will not happen the same in the future. I guess what I'm saying is that yelling and pointing doesn't fix anything and that the same could happen to you so learn from the down side of the whole thing.
Availability is a very hard problem to solve for any service. I think MSN did well to keep as many people connected as they did (I for instance did not lose service).
Unfortunately this, like the ./ outage, was a hardware issue and things that should have worked (and probably worked hundreds of times when tested) did not work.
"You can now flame me, I am full of love,"
And, of course, it too lasted 7 days, without so much as a report on it, and wiped out user data....
Seriously, there's no comparison.
And I agree with the header - this really does make you question how much to trust their plans to make everything need to authenticate remotely.
-= rei =-
"This may be presumptuous..." "That's my favorite kind of 'This'."
"This incident went completely unsupported."
Apparently it bears pointing out once again that this is a key issue for companies doing business with the internet community. Someone at MS hasn't read the Cluetrain Manifesto yet! Some particular points from it:
11. People in networked markets have figured out that they get far better information and support from one another than from vendors. So much for corporate rhetoric about adding value to commoditized products.
12. There are no secrets. The networked market knows more than companies do about their own products. And whether the news is good or bad, they tell everyone.
25. Companies need to come down from their Ivory Towers and talk to the people with whom they hope to create relationships.
28. Most marketing programs are based on the fear that the market might see what's really going on inside the company.
30. Brand loyalty is the corporate version of going steady, but the breakup is inevitable and coming fast. Because they are networked, smart markets are able to renegotiate relationships with blinding speed.
There are many more they would do well to take into account as well, particularly down around #82...
LEXX
"Gold still represents the ultimate form of payment in the world." - Alan Greenspan, 1999
Hardly - its not that I don't TRUST the authentication of Passport. Its the fear of a company like Microsoft storing all this personal data of mine to access other sites, pay for stuff, etc.
Top Most Bizarre/Disturbing Error Messages
So you could access it from another computer which you were validated on. Of course - it seems to me that using a local CACHE of the data would be a brilliant idea - if you change your list offline - local changes and sevrer gets updated, etc. If hte server was down, you could still send stuff to folks peer to peer if you had a local copy stored - course with ICQ you'd be stuff offline if the servers died...
Top Most Bizarre/Disturbing Error Messages
Through HailStorm, Microsoft hopes to deliver e-commerce services, address books online, and password management to disparate devices such as PCs, handhelds or cell phones.
"No, your PC won't be useable 'till next week. We're sorry about that. No, you can't use your cell phone either. We have a minor I/O problem."
>It's such a nice, comforting feeling knowing everything is taken care of, and in good hands. The future's bright. Where do you want to go today?>
OK, it could happen to any negligent sysadmin (uhm, count me in). I don't have any problem with the way Microsoft runs its business (OK, maybe I do, on moral grounds). What I do have a problem with, is any kind of centralized information center. Data cannot be stored safely on one location, one system, prone to failure. I'm sure even a complete idiot would NOT have overseen this. And please, let's not even think about the consequences of one company keeping records of 98% of the desktop users. Fortunately, we do have a choice. Would be a shame to waste, considering the alternative.
I admit, my opninion is biased. So is yours.
No need to fear, by the time .NET is up all of Microsoft's servers will be running FreeBSD..
So, there ARE legitimate, work-related uses for instant messanger software. =)
---
I realize I'm just a lowly mathematician and all, but doesn't this seem reasonable, even for people that design real-life applications?
Curmudgeon Gamer: Not happy
lasts seven days with no valid explanation really starts to make you think about .Net
.NET services were distributed to many 'equal' computers (think the internet as it is structured today) than we can withstand the loss of one machine, in the M$ vision of the future many-many-many services and machines rely on their .NET systems. Imagine if TCP/IP had to 'ping' authorize.big-toll-gate.com's 'license' server in order to start - now imagine they go down....
/.; but what happens when Passport crashes for a week and no one is able to pay bills or maybe Office.NET file storage site burns down and takes millions of people's family photos (yes I know about off-site backups).
.NET has serious potential for peril. I hope all the PHBs and DoJ are paying attention...
At least it is not my families 7 years of financial data, or the copies of my child's baby-pictures - or my presentation that I needed for a job-interview. We dont have to tell MS that distributed resources increases fault tolerance. When you devise a massive system, with a single point of failure (M$.Net) you are going to burn - and burn big-time. If
This may not be a surprise to any one on
The point is simple - you cannot build a reliable system with such a glaring single-point-of-failure. Downtime happens - and as this MSMessenger event shows us -
I think the real cause was something like...
Error:
MsgrSvr.exe caused an invalid page fault in module KERNEL32.DLL at 015f:bff9dba7.
--
All opinions presented here aren't mine.
This crowd? Nah - we all wrote scripts that sent us email alerts to our cellphones when slashdot came back up and we could finally find 'CowboyNeal' somewhere in the HTML source :)
Top Most Bizarre/Disturbing Error Messages
No, it is virtually allways used for leisure: Pretending to do work whilst actually swapping sweet-little-nothings with Jane in accounts, or arranging a Q3 duel with DukeQuakem. (if someone actually has an important, legit reason for using a messenger service, please correct me...).
Basically, if you cant us MSN messenger, you can us email, or pick up the phone. I'm sure, when MSN messenger breaks down, its not on MS top list of priorites.
Perhaps, er, they had better things to do? Or perhaps it got lost at the bottom of someones in-draw?
However, it probably wasn't a good idea for MS to leave it so long. So many bloody people use it, that it does send out a helluvalot of bad publicity (I'm not going to get that date with Jane this weekend and it is ALL Micro$ofts fault!! Bah!). However, I think if a important component of .NET where to fail, and adversely affect many critical services, MS might react a little quicker, with greater resources & assurance
See this report from The Register for the grisly details.
I suppose you could say this is because VeriSign and Network Solutions are insane, deranged companies, and there is most likely truth to this. But I'm not convinced; I HAVE TO deal with these idiots for my domain names, and now I have to rely on .NET to do it. Ick.
D
----
Ya know there are two possible causes from the minute information they've released (It was caused by a freak failure when a hard disk controler crashed).
1) Caused by a freak failure when a hard disk controler crashed.
2) They've said they have to restore from backups.
If both are true, then it sounds like they were using a distributed database (or filesystem?) and one machine going down very badly managed to infect lots of others... doesn't bode well, especially when MS's solution to competing in the Server environment is traditionally to Cluster lots of machines together. The more you have the more chance one may have problems.
If the first statement is false, then the only thing I can think of is that the system was infected by either an outside source, or some other malicious virus. Standard Operating Procedure in this case would be to disconnect the machines, diagnose the problem (so new machines wouldn't be infected), and then restore from backup. Its also possible someone over-reacted and they went into this mode when in actuallity Item 1 was true.
Anybody else think we're hearing the whole story?
This space for rent. All reasonable inquiries will be entertained at proprietors discretion.
when these kind of outages happen, of Peter Deutsch's 8 Fallacies of Distributed Computing:
This is, of course, why the idea of remote authentication being necessary to use your word processor is a bad thing. Heck, even losing something as innocuous as an instant messaging program brought thousands of people to a screeching halt for a week. It seems to me that Microsoft (although they're certainly not the only ones) seem to believe these 8 fallacies blindly, espcially 1, 4, and (they're hoping) 6.
Right...
I am utterly amazed at times the things I hear about how system administration is performed at MS. Ever check their jobs page? They're really picky about who they hire, you know.
Yet we repeatedly hear about security problems with their own servers, how all their DNS servers were on the same network segment, hotmail goes down and now this? Lost data??!!!
I'm sorry, but as a former full-time sysadmin, there is absolutely no excuse for losing data. Preserving your companies data is the #1 priority of any sysadmin, regardless of the company. And preserving data with 100% certainty is acheivable by anyone who takes the time to set things up right.
Oh well, I was never a fan of their passport/hailstorm idea anyway. Things like this can only cause more people to run away from using those services.
No, Thursday's out. How about never - is never good for you?
Bet they know how I feel at work every day now...
-- Geof F. Morris
Think about Hobbes social contract.
.NET. It is the companies responsibility to give us fair service, and tell us what's going on.
.NET and using all of the authorization features to access Microsoft's sites that require Passport/Messenger, just like in Hobbes social contract you are giving up some rights and some control. Your taking a risk. But remember, their are other choices.
'People give up certain rights and freedoms for a feeling of safety etc.'
This is the same sort of situation kinda. People give up having their own servers for communications and data storage in technologies like
If we do not like what's going on, it is our right and responsibility to seek alternatives.
Your always going to risk loss of data and loss of service if you let someone else handle your data, communications, authorization, etc. It's a risk that you take. You hope that the company is able to do a good job and maintain good service. Remember, if you start using
[Something witty and intelligent should have appeared here.]
[Something witty and intelligent should have appeared here.]
{Traicovn}
As a (curious) sysadmin I wouldn't mind reading a post mortem like what the /. crew did a few weeks ago. I think MS is missing out on a lot of brownie points by not publishing a blow by blow summary of how an enterprise goes about troubleshooting/fixing a system like that. It would be possible to do something like that w/o disclosing sensitive information. Like I said, wishfull thinking.
BOSTON SUCKS!
Recent surveys show that employees that use Microsoft's popular Instant Messenger software are having one of the most productive weeks in recent years.
Now if only Slashdot would have a week-long outage, I could get some work done.
We sent out an instant message to all the users letting them know about the outage.
Remember when users couldn't get through because there were busy signals all the time?
Remember how people said that there was going to be a mass exodus from AOL?
Remember how that didn't happen?
No matter how badly MS screws this incident up, no matter how many judgements get made against them, the average business drone and Joe User will still end up using .NET.
"Enough of this wretched, whining monkey life." -- Marcus Aurelius, _Meditations_, Book 9, 37