Slashdot Mirror


MS, CNET On 7-Day Messenger Outage

imipak writes: "Microsoft have finally commented on the recent seven day outage at their Messenger IM service -- some users have permanently lost data, and there's still no explanation of the cause. Interesting earlier story from CNet News. Key quote: "... an outage that lasts seven days with no valid explanation really starts to make you think about .Net, and about Microsoft's plans for the Internet. What if this were the new Office software verification service that was down?"" Here 's a story on MSNBC as well.

29 of 249 comments (clear)

  1. Re:Is there really any reason to comment on server by Anonymous Coward · · Score: 3
    Anne Tomlinson

    we still don't know the real story

  2. Re:Shit happens... by sachmet · · Score: 3
    The bottom line here is this wouldn't be news if it wasn't Microsoft. Hardware fails (think back to the /. failure a short while ago) and sometimes it takes a while to get it going again.
    Not quite; if AIM or ICQ failed for a week, it would be news, too. The tie-in to .NET is auxilliary to the real point: that a company which is claiming to have a large, scalable architecture had a problem which affects a large number of Internet users (over 10m) and couldn't fix it in a week's time.

    Sometimes, news is news.
  3. Re:Absolutely no excuse for this. by Black+Parrot · · Score: 3

    > From what I've read, they had a disk controller failure, and the secondary (ie. backup) controller also had some kind of fault which lead them "further down the disaster recovery plan." Which means now they go to tape backups, probably.


    Sounds like they're just making excuses. No matter how they spin it, there's no excuse for a disk controller to put them out of service for a week. Lose data since the last backup, sure, but not a week long shutdown.

    Recall also that Hotmail suffered 10+ days outage for a subset of their users last summer, and some of those users permanently lost their data and had to just start new accounts.

    I agree with Pinball Wizard: there's no excuse for this kind of thing. (Frankly, I think it's because Microsoft still doesn't 'get' anything beyond a single-user system.)

    FWIW, they've been saying the same thing over at c.o.l.a. for a few days, and even one of the tenured trolls is agreeing that it's inexcusable.

    And how are they going to sell .net with this kind of high-profile outage on everyone's minds? If you're the CIO of a big company and you move your company to .net, do you just send all your employees home for a week when something like this happens?

    --

    --
    Sheesh, evil *and* a jerk. -- Jade
  4. Hardware Failure by EasyTarget · · Score: 3

    The CNet article seems to have someone from Microsoft hinting it was a hardware failure..

    What Happened? did all three CTRL-ALT and DEL keys fall off at the same time?

    Enquiring minds want to know, cos if you talk to Cisco, HP, et al.. they'll sell you something called a 'Maintenance Contract'.

    EZ

    --
    "Oops, I always forget the purpose of competition is to divide people into winners and losers." - Hobbes
  5. Re:Is there really any reason to comment on server by iceT · · Score: 3

    Not 7 days. And, they published an entire blow-by-blow timeline of what happened.

    So, who's more responsible?

    --
    -- You can't idiot-proof anything, because they're always coming out with better idiots.
  6. Hardware Failure? by AsbestosRush · · Score: 3

    "This outage is not indicative of Microsoft's ability to move forward with its .Net strategy," Visse said. "This was one isolated issue brought on by a series of extremely rare hardware failures."

    I find this highly improbable. Any ISP worth it's service has either service contracts on it's hardware, or a closet of "critical spares" (hardware that the ISP couldn't function without, and therefore keeps a second piece or the parts to repair equipment), usually both.
    A company the size of MS, this should be a foregone conclusion that both of these things should already have been covered. I know that the Messenger Service isn't quite as mission-critical as say a primary filesever, nor is the messenger service as important as many other ISP services (web, mail, authentication, etc), but come on! Hardware failure isn't an excuse for a mutiple *workday* outage. Not for a company the size of MS.

    --
    EveryDNS. Use it. It works.
    AC's need not reply
  7. Shit happens... by malfunct · · Score: 3
    The bottom line here is this wouldn't be news if it wasn't Microsoft. Hardware fails (think back to the /. failure a short while ago) and sometimes it takes a while to get it going again.

    I think the notes to be examined here are the lack of PR and customer support on the issue and the extended length of time of the outage. You can bet that the issue is being examined very closely by MS and will not happen the same in the future. I guess what I'm saying is that yelling and pointing doesn't fix anything and that the same could happen to you so learn from the down side of the whole thing.

    Availability is a very hard problem to solve for any service. I think MSN did well to keep as many people connected as they did (I for instance did not lose service).

    Unfortunately this, like the ./ outage, was a hardware issue and things that should have worked (and probably worked hundreds of times when tested) did not work.

    --

    "You can now flame me, I am full of love,"

  8. Re:Is there really any reason to comment on server by Rei · · Score: 3

    And, of course, it too lasted 7 days, without so much as a report on it, and wiped out user data....

    Seriously, there's no comparison.

    And I agree with the header - this really does make you question how much to trust their plans to make everything need to authenticate remotely.

    -= rei =-

    --
    "This may be presumptuous..." "That's my favorite kind of 'This'."
  9. Re:Consulting by thelexx · · Score: 3

    "This incident went completely unsupported."

    Apparently it bears pointing out once again that this is a key issue for companies doing business with the internet community. Someone at MS hasn't read the Cluetrain Manifesto yet! Some particular points from it:

    11. People in networked markets have figured out that they get far better information and support from one another than from vendors. So much for corporate rhetoric about adding value to commoditized products.

    12. There are no secrets. The networked market knows more than companies do about their own products. And whether the news is good or bad, they tell everyone.

    25. Companies need to come down from their Ivory Towers and talk to the people with whom they hope to create relationships.

    28. Most marketing programs are based on the fear that the market might see what's really going on inside the company.

    30. Brand loyalty is the corporate version of going steady, but the breakup is inevitable and coming fast. Because they are networked, smart markets are able to renegotiate relationships with blinding speed.

    There are many more they would do well to take into account as well, particularly down around #82...

    LEXX

    --
    "Gold still represents the ultimate form of payment in the world." - Alan Greenspan, 1999
  10. Re:MS to use Verisign for Hailstorm and Passport by baptiste · · Score: 3
    I guess we would all be more trusting if it were Verisign and not MS

    Hardly - its not that I don't TRUST the authentication of Passport. Its the fear of a company like Microsoft storing all this personal data of mine to access other sites, pay for stuff, etc.

  11. Re:Why are the Buddy lists not local??? by baptiste · · Score: 3
    What brilliant software designer thought that it was a good idea for MSN Messenger NOT to store the buddy lists locally?

    So you could access it from another computer which you were validated on. Of course - it seems to me that using a local CACHE of the data would be a brilliant idea - if you change your list offline - local changes and sevrer gets updated, etc. If hte server was down, you could still send stuff to folks peer to peer if you had a local copy stored - course with ICQ you'd be stuff offline if the servers died...

  12. So it's not such a big deal huh? by pjgunst · · Score: 3

    Through HailStorm, Microsoft hopes to deliver e-commerce services, address books online, and password management to disparate devices such as PCs, handhelds or cell phones.
    "No, your PC won't be useable 'till next week. We're sorry about that. No, you can't use your cell phone either. We have a minor I/O problem."
    >It's such a nice, comforting feeling knowing everything is taken care of, and in good hands. The future's bright. Where do you want to go today?>
    OK, it could happen to any negligent sysadmin (uhm, count me in). I don't have any problem with the way Microsoft runs its business (OK, maybe I do, on moral grounds). What I do have a problem with, is any kind of centralized information center. Data cannot be stored safely on one location, one system, prone to failure. I'm sure even a complete idiot would NOT have overseen this. And please, let's not even think about the consequences of one company keeping records of 98% of the desktop users. Fortunately, we do have a choice. Would be a shame to waste, considering the alternative.
    I admit, my opninion is biased. So is yours.

  13. don't worry by Anonymous Coward · · Score: 4

    No need to fear, by the time .NET is up all of Microsoft's servers will be running FreeBSD..

  14. I Use ICQ 80% For Work. by citizenc · · Score: 4
    (if someone actually has an important, legit reason for using a messenger service, please correct me...).
    Grr. Igornace ;-) I work for GameSpy Industries. (Specifically, I run 3DActionPlanet.) Now, the offices are in Irvine, California. (By contrast, I live in Winnipeg, MB, Canada.) Now, one of the benefits of running a website for a living is that it can be done remotely. I have anybody who I would need to talk to about work on my ICQ list -- if I need to ask a question about policy, I ICQ the big boss and get a response fast. Looking for somebody to write a preview of a game? I just look on my '3DAP Writers' group and see who's online.

    So, there ARE legitimate, work-related uses for instant messanger software. =)

    ---
  15. No local storage? by jvmatthe · · Score: 4
    I'm a bit shocked that the MSN IM service doesn't somehow have a way to store and retrieve user data on the client end for exactly this kind of situation. Heck, just do the usual thing and pile it all into the registry...that seems to be where everything else important ends up. If such a feature existed (caveat: and worked) in the MSN IM client then all those users that lost data would simply have the service send a message to the client on next connection to use the last backup and POOF! no lost buddy lists.

    I realize I'm just a lowly mathematician and all, but doesn't this seem reasonable, even for people that design real-life applications?

  16. Internet Crashes for all WinXP users at once... by SubtleNuance · · Score: 4

    lasts seven days with no valid explanation really starts to make you think about .Net

    At least it is not my families 7 years of financial data, or the copies of my child's baby-pictures - or my presentation that I needed for a job-interview. We dont have to tell MS that distributed resources increases fault tolerance. When you devise a massive system, with a single point of failure (M$.Net) you are going to burn - and burn big-time. If .NET services were distributed to many 'equal' computers (think the internet as it is structured today) than we can withstand the loss of one machine, in the M$ vision of the future many-many-many services and machines rely on their .NET systems. Imagine if TCP/IP had to 'ping' authorize.big-toll-gate.com's 'license' server in order to start - now imagine they go down....

    This may not be a surprise to any one on /.; but what happens when Passport crashes for a week and no one is able to pay bills or maybe Office.NET file storage site burns down and takes millions of people's family photos (yes I know about off-site backups).

    The point is simple - you cannot build a reliable system with such a glaring single-point-of-failure. Downtime happens - and as this MSMessenger event shows us - .NET has serious potential for peril. I hope all the PHBs and DoJ are paying attention...

  17. Ooops... by AlphaOne · · Score: 4

    I think the real cause was something like...

    Error:
    MsgrSvr.exe caused an invalid page fault in module KERNEL32.DLL at 015f:bff9dba7.
    --

    --
    All opinions presented here aren't mine.
  18. Re:In related news... by baptiste · · Score: 4
    Most people would probably spend most of their day hitting "Reload" to see if it's finally up yet.

    This crowd? Nah - we all wrote scripts that sent us email alerts to our cellphones when slashdot came back up and we could finally find 'CowboyNeal' somewhere in the HTML source :)

  19. Um, its a messenger service folks!? by Marcus+Brody · · Score: 4
    Lets face it, is MSN messenger really an essential service? Do you pay your bills with it? Inform friends & family of a bereavement? Tell your boss you will be late for work?

    No, it is virtually allways used for leisure: Pretending to do work whilst actually swapping sweet-little-nothings with Jane in accounts, or arranging a Q3 duel with DukeQuakem. (if someone actually has an important, legit reason for using a messenger service, please correct me...).

    Basically, if you cant us MSN messenger, you can us email, or pick up the phone. I'm sure, when MSN messenger breaks down, its not on MS top list of priorites.

    Perhaps, er, they had better things to do? Or perhaps it got lost at the bottom of someones in-draw?

    However, it probably wasn't a good idea for MS to leave it so long. So many bloody people use it, that it does send out a helluvalot of bad publicity (I'm not going to get that date with Jane this weekend and it is ALL Micro$ofts fault!! Bah!). However, I think if a important component of .NET where to fail, and adversely affect many critical services, MS might react a little quicker, with greater resources & assurance

  20. I'd like to believe you, but ... by daviddennis · · Score: 5
    why, then, did VeriSign basically turn over its entire authentication process to Microsoft and start deploying Windows 2000 servers in its core business?

    See this report from The Register for the grisly details.

    I suppose you could say this is because VeriSign and Network Solutions are insane, deranged companies, and there is most likely truth to this. But I'm not convinced; I HAVE TO deal with these idiots for my domain names, and now I have to rely on .NET to do it. Ick.

    D
    ----

  21. Possible causes? by powerlord · · Score: 5

    Ya know there are two possible causes from the minute information they've released (It was caused by a freak failure when a hard disk controler crashed).

    1) Caused by a freak failure when a hard disk controler crashed.
    2) They've said they have to restore from backups.

    If both are true, then it sounds like they were using a distributed database (or filesystem?) and one machine going down very badly managed to infect lots of others... doesn't bode well, especially when MS's solution to competing in the Server environment is traditionally to Cluster lots of machines together. The more you have the more chance one may have problems.

    If the first statement is false, then the only thing I can think of is that the system was infected by either an outside source, or some other malicious virus. Standard Operating Procedure in this case would be to disconnect the machines, diagnose the problem (so new machines wouldn't be infected), and then restore from backup. Its also possible someone over-reacted and they went into this mode when in actuallity Item 1 was true.

    Anybody else think we're hearing the whole story?

    --
    This space for rent. All reasonable inquiries will be entertained at proprietors discretion.
  22. I'm often reminded... by Dr.Evil · · Score: 5

    when these kind of outages happen, of Peter Deutsch's 8 Fallacies of Distributed Computing:

    1. The network is reliable
    2. Latency is zero
    3. Bandwidth is infinite
    4. The network is secure
    5. Topology doesn't change
    6. There is one administrator
    7. Transport cost is zero
    8. The network is homogeneous

    This is, of course, why the idea of remote authentication being necessary to use your word processor is a bad thing. Heck, even losing something as innocuous as an instant messaging program brought thousands of people to a screeching halt for a week. It seems to me that Microsoft (although they're certainly not the only ones) seem to believe these 8 fallacies blindly, espcially 1, 4, and (they're hoping) 6.

    --
    Right...
  23. Absolutely no excuse for this. by Pinball+Wizard · · Score: 5
    ...permanently lost data

    I am utterly amazed at times the things I hear about how system administration is performed at MS. Ever check their jobs page? They're really picky about who they hire, you know.

    Yet we repeatedly hear about security problems with their own servers, how all their DNS servers were on the same network segment, hotmail goes down and now this? Lost data??!!!

    I'm sorry, but as a former full-time sysadmin, there is absolutely no excuse for losing data. Preserving your companies data is the #1 priority of any sysadmin, regardless of the company. And preserving data with 100% certainty is acheivable by anyone who takes the time to set things up right.

    Oh well, I was never a fan of their passport/hailstorm idea anyway. Things like this can only cause more people to run away from using those services.

    --

    No, Thursday's out. How about never - is never good for you?

  24. Feel My Pain! by TOTKChief · · Score: 5
    Not only has Microsoft been struggling to restore full service, but on Thursday the company also shut down MSN Messenger as it restarted the network of servers that handle messaging traffic. That "reboot" failed to immediately fix the problem. [Emphasis added.]

    Bet they know how I feel at work every day now...

  25. Letting others handle your data. by Traicovn · · Score: 5

    Think about Hobbes social contract.
    'People give up certain rights and freedoms for a feeling of safety etc.'

    This is the same sort of situation kinda. People give up having their own servers for communications and data storage in technologies like .NET. It is the companies responsibility to give us fair service, and tell us what's going on.

    If we do not like what's going on, it is our right and responsibility to seek alternatives.

    Your always going to risk loss of data and loss of service if you let someone else handle your data, communications, authorization, etc. It's a risk that you take. You hope that the company is able to do a good job and maintain good service. Remember, if you start using .NET and using all of the authorization features to access Microsoft's sites that require Passport/Messenger, just like in Hobbes social contract you are giving up some rights and some control. Your taking a risk. But remember, their are other choices.

    [Something witty and intelligent should have appeared here.]

    --

    [Something witty and intelligent should have appeared here.]
    {Traicovn}
  26. Post Mortem Summary (aka Wishfull Thinking) by Fatal0E · · Score: 5

    As a (curious) sysadmin I wouldn't mind reading a post mortem like what the /. crew did a few weeks ago. I think MS is missing out on a lot of brownie points by not publishing a blow by blow summary of how an enterprise goes about troubleshooting/fixing a system like that. It would be possible to do something like that w/o disclosing sensitive information. Like I said, wishfull thinking.

  27. In related news... by ryanvm · · Score: 5
    In related news:
    Recent surveys show that employees that use Microsoft's popular Instant Messenger software are having one of the most productive weeks in recent years.

    Now if only Slashdot would have a week-long outage, I could get some work done.

  28. You didn't get it? by fiber_halo · · Score: 5

    We sent out an instant message to all the users letting them know about the outage.

  29. People will still use .NET in droves by UberOogie · · Score: 5
    Remember when AOL had huge outages several years ago?

    Remember when users couldn't get through because there were busy signals all the time?

    Remember how people said that there was going to be a mass exodus from AOL?

    Remember how that didn't happen?

    No matter how badly MS screws this incident up, no matter how many judgements get made against them, the average business drone and Joe User will still end up using .NET.

    --
    "Enough of this wretched, whining monkey life." -- Marcus Aurelius, _Meditations_, Book 9, 37