Slashdot Mirror


Microsoft's Azure Cloud Suffers Major Downtime

New submitter dcraid writes with a quote from El Reg: "Microsoft's cloudy platform, Windows Azure, is experiencing a major outage: at the time of writing, its service management system had been down for about seven hours worldwide. A customer described the problem to The Register as an 'admin nightmare' and said they couldn't understand how such an important system could go down. 'This should never happen,' said our source. 'The system should be redundant and outages should be confined to some data centres only.'" The Azure service dashboard has regular updates on the situation. According to their update feed the situation should have been resolved a few hours ago but has instead gotten worse: "We continue to work through the issues that are blocking the restoration of service management for some customers in North Central US, South Central US and North Europe sub-regions. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers." To be fair, other cloud providers have had similar issues before.

34 of 210 comments (clear)

  1. But Remember - by Ralph+Spoilsport · · Score: 5, Insightful
    Your data's safe in the Cloud.

    Until it isn't.

    --
    Shoes for Industry. Shoes for the Dead.
    1. Re:But Remember - by Anonymous Coward · · Score: 5, Funny

      It's very safe though - just so safe no one can get access to it! :)

    2. Re:But Remember - by tripleevenfall · · Score: 4, Funny

      Nonsense, Microsoft is the name you can trust for security.

    3. Re:But Remember - by masternerdguy · · Score: 5, Insightful

      Also remember the cloud is just the 21st century spin of the dummy terminal-mainframe model.

      --
      To offset political mods, replace Flamebait with Insightful.
    4. Re:But Remember - by Barsteward · · Score: 4, Insightful

      Stop talking sense, its no use here on /.

      --
      "The hands that help are better far than lips that pray." - Robert Ingersoll (1833-1899)
    5. Re:But Remember - by poetmatt · · Score: 3, Interesting

      When you rely on a 3rd party for cloud storage and that 3rd party has a basically nonexistent SLA for an under 30 day outage, it becomes your own fault for making a horrible business decision.

      when you take a 3rd party cloud storage solution and implement it yourself for your enterprise, guess what? it works. And if there's issues, you know who's to blame.
      https://spideroak.com/diy/ - this is one example of but many.

    6. Re:But Remember - by icebraining · · Score: 4, Insightful

      Except those dumb terminals were, well, dumb, while nowadays the "terminals" are essentially the same as the "mainframe" but slower. So you can have hybrid configurations were a dedicated machines handles the base load and spins up remote resources on demand to handle peaks. If those resources are unavailable, the dedicated machine can still do the job, just with some performance degradation.

      A good example would be a script on your laptop that started an EC2 instance running distcc to reduce your compilation time from hours to minutes. If the instance can't be loaded, you could still compile, it just takes more time.

    7. Re:But Remember - by dave420 · · Score: 4, Insightful

      Except this time you can add as many mainframes you wanted, dynamically. And access them over the internet. And serve content to millions of people over said internet. That wasn't possible with this clichéd "mainframes!!!!!1" nonsense. Yes, you are using a remote computer. That's the only similarity. The current terminals are far from dumb, and the server being connected to is vastly different to the mainframes of old.

    8. Re:But Remember - by Surt · · Score: 3, Insightful

      If only even a single cloud service were actually built this way, it'd be great!

      --
      "Who is the Journal of Quantum Physics going to believe?" --Stephen Hawking
    9. Re:But Remember - by hawguy · · Score: 4, Interesting

      Except this time you can add as many mainframes you wanted, dynamically. And access them over the internet. And serve content to millions of people over said internet. That wasn't possible with this clichéd "mainframes!!!!!1" nonsense. Yes, you are using a remote computer. That's the only similarity. The current terminals are far from dumb, and the server being connected to is vastly different to the mainframes of old.

      I wonder how old you are? The current "Web 2.0" paradigm reminds me very much of the old 3270 style mainframe environment.

      The 3270 terminal (well, the controller) was not exactly "dumb" - it had some base level of intelligence, it knew how to display forms, it could do input validation, etc but it didn't really do much with the data beyond sending it up to the mainframe. The mainframe on the backend took the data and actually did something with it. This is pretty much exactly how "Web 2.0" works, except instead of a 3270 terminal communicating to the mainframe over SNA, you have web browsers calling back to the web server over HTTP using Javascript.

      Yes, both the endpoints and servers have become more capable, but there are still many similarities to the old style model.

    10. Re:But Remember - by Dog-Cow · · Score: 4, Funny

      So they can have virtually unlimited "read only" downtime as long as they turn write back on every

      Let me guess. Switched you to read-only right in the middle.

  2. Eggs? by OzPeter · · Score: 4, Insightful

    Basket?

    Or how about "Never outsource your core functionality?

    --
    I am Slashdot. Are you Slashdot as well?
    1. Re:Eggs? by Sir_Sri · · Score: 4, Insightful

      Ah, so there's the question. How much would it cost for you to run a system with 'no' downtime? I'm at a university, some of our labs (not so much in comp sci but generally) have fairly specific requirements about say not losing power, because it would damage/destroy equipment or running experiments.

      But IT is more than just power. In almost 4 years here every year we've had several days of downtime for our main undergraduate server (the one undergrads are supposed to use for various things, and that handles their logins and file storage), and several on the separate but arguably more important staff server, which is supposed does the same thing, but that includes all of our grant applications.

      Causes of our server outages (I'm not an IT guy, this is just what they've told us that I can remember): Power failures. Yes we have battery backups, but they're only good for so long, and since none of our equipment suffers permanent damage without power this isn't high priority. Networking. We only have two redundant pipes. That, for home use for example, or most businesses is pretty good. For our pipes one goes to a host to the west, one to the east. I'm not specifically familiar with what failed that took our networking offline for 7 or 8 hours but it affected both pipes. Storage: bad raid controller on the main fileserver. This has a few cascading effects. If you don't realizing it's garbling data it ends up distributing that garble off to the backups or clones. When it crashes (which doesn't take that long after the controller starts getting messy) you may have several backups that need to be repaired. We can't do much to the file system while it's being repaired or rebuilt (which, afaik you should be able to do on most professional grade setups, but for whatever reason our linux guys can't get it to behave). Added fun: When the system comes back up, if you tried to access your e-mail while the file system was garbled you probably still can't. And you get no error message about it. It just spits back nothing, as though you have no new mail. The system is 'up' but doesn't work and you have to go into your directory and delete some files that most people have never heard of. It's not hard to do, but because you have no idea that there's a problem the less technically inclined (or just ESL) people in building full of computer scientists don't always fix it immediately. The net effect is that if the storage controller gets messed up, we're down for 3 or 4 days if not longer.

      And that's just one university department. We have a relatively decent amount of money, and several full time staff for these things. But we probably can't match any cloud services uptime, even with 7 or 8 hours of downtime regularly, not even close. It's not a trivial calculation, even a 50 or 60 employee outfit will probably have trouble matching Amazon or Azure uptime with a full time IT guy. There's probably a cross over point where you have enough employees to support big enterprise IT infrastructure and manpower, but only support it badly (there's not enough money for proper replication or whatever), and then eventually you get big enough that you just run everything in house anyway because there's definitely no cost advantage to hiring someone. For us, I think we have 5 or 6 IT staff, if we could toss 3 of them, + all of their equipment, you're looking at somewhere around 350, 400k/year to spend on a support contract. I'm guessing, but don't know, if you can get a cloud service for ~20 TB of reasonably reliable file and e-mail storage for less than 350k/year from these guys.

      The big place I see people right now (as a sort of flavour of the month) using cloud service as an augment to burst capacity needs. That's a whole other analysis.

  3. Re:Gloat gloat gloat. by gral · · Score: 4, Insightful

    The companies I deal with tend to say things like, we want to go with a company like this so we can can get "Support". Which usually means, so we can blame them if something goes wrong.

    --
    Scott Carr
  4. 2/29/2012 by MacBrave · · Score: 5, Interesting

    Leap year strikes again?

    1. Re:2/29/2012 by the_other_chewey · · Score: 5, Informative

      From the service dashboard:

      "4:00 AM UTC We have identified the root cause of this incident. It has been traced back to a cert issue triggered on 2/29/2012 GMT."

      So yeah, a leap day bug sounds probable.

  5. To quote the lady in the commercial... by Pollux · · Score: 4, Funny

    Yay, cloud!

  6. Re:Cloud services not ready by characterZer0 · · Score: 3, Insightful

    Cluster at the application level and have nodes at different providers. If your volume is too high for that, you are big enough to host your own stuff.

    --
    Go green: turn off your refrigerator.
  7. Now they're slashdotted, too... by Sqr(twg) · · Score: 4, Funny

    This is not helping, guys!

  8. Re:So merely days after announcing the G-Cloud... by Chris+Mattern · · Score: 4, Funny

    The hilarious part of this link is that the article detailing how screwed people are for depending on Microsoft's cloud services is stuffed with rollover ads for...Microsoft's cloud services!

  9. last time by phantomfive · · Score: 5, Informative

    Last time a Microsoft cloud product went down, users sustained real data loss. Of course, Microsoft claimed it couldn't happen with Azure.

    --
    "First they came for the slanderers and i said nothing."
  10. Re:Cloud services not ready by timeOday · · Score: 3, Informative

    I agree, I have nothing against the idea of cloud services, but they do need to work and reputations are based on events like this. After an outage this long, it takes a LOOONG time to earn your way back to five nines (which works out to 5.5 minutes of downtime per year).

  11. Feature Suggestion! by fuzzyfuzzyfungus · · Score: 5, Funny

    Since the image that "Azure" and "Cloud" conjurs up is more "sky" than "cloud" it would be my suggestion that Microsoft simply register chickenlit.tl and set up an Azure service status monitor/report page there.

    They could have an adorable cartoon chicken that, when the system is working normally, runs around scratching and pecking(speed dependent on load). When downtime occurs, it would begin squawking about how the sky is falling. What could make failure more endearing?

    Just to add that Microsoft touch, they could do the entire thing as a Microsoft Agent ActiveX control!

  12. To the cloud! by Howard+Beale · · Score: 4, Funny

    Well...maybe not right now...

  13. Re:Cloud services not ready by vlm · · Score: 3, Insightful

    After an outage this long, it takes a LOOONG time to earn your way back to five nines (which works out to 5.5 minutes of downtime per year).

    Only 84 years per the article, and growing at a rate of a year every 5 minutes.

    Thats probably about how long it would take me to trust MS in an enterprise environment.

    --
    "Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
  14. Ah, the cloud... by ErichTheRed · · Score: 4, Insightful

    It's funny how those of us who bring up issues of data security and service resiliency are dismissed as just trying to protect our jobs.

    Like so many other things, the actual technical underpinnings of "the cloud" are great, and have been standard fare for years. Virtual machines + flexible networking are a godsend for systems guys tasked with getting capacity for a new project up and going yesterday. I love being able to build and rip down entire test environments just to try something out...that used to mean a rack of physical servers, switchgear, etc. tied up while it was being used. That's why everyone's slowly coming around to the "private/hybrid cloud" model, which is really just code for "VMs + network capacity + something to tie it all together + maybe some external hosting".

    The problem is that "the cloud" is very badly misunderstood. As sson as a CIO sees "virtual, on-demand capacity without those pesky physical on-site machines and IT staff, for a fixed cost per compute-hour" everything else takes a back seat. Then, it's "why do we need IT staff on-site, everything's being taken care of in the cloud." Public clouds like Amazon or Azure are great for startups who can't really afford their own data centers, or even bigger businesses to offload some of the nonessential stuff. When you start looking at hosting everything though, the marketing hype of the cloud sometimes distracts people from realities that they have to contend with.

    Also, I'm not saying that businesses who go the private cloud or traditional hosting/outsourcing route won't have downtime -- they will. However, having onsite staff and infrastructure means you can work those staff until they fix the problem, and you have control over them. Most sane outsourcing contracts have SLAs in them stating that the vendor will expend X amount of effort to fix your problems. Cloud provider agreements, unless specifically mentioned otherwise, are "as is, where is, best effort restoration with no warranty." OK, maybe some providers will give you an SLA, but all that does is buy you free service at a later date if they violate it...it doesn't bring your application back online. You still have no choice but to sit and wait around for the provider to fix whatever's wrong...just ask Amazon EC2 customers about what happened during their last outage...

    Companies need to draw sane boundaries around hosted systems, and decide what is critical and what can be offloaded. Do I care about a set of development/test machines that get used once a month? Probably a lot less than the critical database/application servers that run my core business. Comfort level, cost per minute of downtime vs. cost of dedicated resources and other factors need to be carefully considered before jumping into the cloud with both feet.

  15. Re:Cloud services not ready by hawguy · · Score: 4, Insightful

    One of the selling points of using cloud services was that it would be more reliable than managing your own hardware/software. But to date, every single big player has suffered major downtime. If I would be hesitant to believe the sales pitch.

    But still, for most companies that are good candidates for cloud offerings, even 8 hours of downtime once a year is probably better than they can guarantee using their own infrastructure. Companies in this range tend to not have redundant servers, offsite backups, disaster recovery sites, etc. Larger companies that can build redundant infrastructure (and staff it properly) are probably better off staying away from the cloud since they can guarantee any level of uptime and redunancy they want to pay for.

    Of course, when a small company Admin spills a cup of coffee in the Exchange server and they are down for 5 days while building a replacement server, it doesn't make the news so you never hear about it...while when a large cloud provider has a 2 hour outage, it's all over the news.

  16. Advice by DickBreath · · Score: 4, Informative

    Use the MCSE mantra:
    1. Perform virus scan.
    2. If that doesn't work, find a different program that will display a reassuring green graphic.
    3. If that doesn't work, reboot.
    4. If that doesn't work, reformat, reinstall.
    5. If that doesn't work, GOTO 1.

    Microsoft wouldn't know anything about data center running if it were chase aftering them at full speedo.

    Google this: "Microsoft Sidekick / Danger"

    http://techcrunch.com/2009/10/10/t-mobile-sidekick-disaster-microsofts-servers-crashed-and-they-dont-have-a-backup/

    https://www.pcworld.com/article/173470/microsoft_redfaced_after_massive_sidekick_data_loss.html

    http://www.appleinsider.com/articles/09/10/11/microsofts_danger_sidekick_data_loss_casts_dark_on_cloud_computing.html

    --

    I'll see your senator, and I'll raise you two judges.
  17. Great uptime! by gmuslera · · Score: 5, Funny

    Put your servers in the Azure cloud to have an uptime of 9.999999999%

  18. Cloud ain't so bad by Martz · · Score: 5, Insightful

    I wrote a comment on slashdot a while back which questioned the sensibleness of running services in the cloud. I used to be a sceptic.

    Since then I've used Rackspace Cloud and found that it's actually a very good idea, for certain things.

    The benefits of using a cloud system are scalability and no commitment- it's not about reliability or higher availability - but you do get a little win in those areas.

    To give some examples, I was recently able to play around with mysql clustering. I followed a mysql clustering howto and played around with it, setup a mysql cluster with load balancers. Once I was finished geeking about, I saved the VMs to the file storage and deleted the cloud instances. Total cost a £/$2-3 maximum. I hadn't previously been able to do this, I would have had to rent a dedicated server which would serve websites, email etc. I couldn't really use the dedicated server to play with new technology in case it had a negative impact on the live systems. I did have development box for a while, but it essentially doubled my costs without making any more money, just offering some protecting.

    Now I have staging/development instances in the cloud - and no commitments to them - I don't have to worry about a £250 monthly bill or sign a 12 month contract to get my own box. I can fire up some resources, use them, and throw it away when I'm done.

    The upshot is that I can play around with other peoples cool open source software without risk or buggering something up on my live box, and the costs are insignificant since I'm only renting it per hour. I can try something new, if it works great - it might go/stay in production. If not, delete it and move onto the next cool thing.

    If I need high availability, I would use Rackspace, Amazon, Azure, and I'd ensure that I have a plan to deal with a major outage with any of the providers. Each have APIs, so in theory I could create new instances automagically and failover between different cloud providers with a quick DNS change, while keep costs low.

    To recap, the cloud isn't all about high availability - no matter what the marketing says. It's about scaling systems and running resources for small amounts of time, and is perfectly suited to services which have peak demand (ticket sales for example).

  19. Re:Cloud services not ready by leonardluen · · Score: 4, Funny

    it's a leap year, they can be down a full day and still claim they were up for 365 days this year!

  20. Re:Cloud services not ready by UnknowingFool · · Score: 4, Informative

    You mean besides Amazon, SalesForce, VMWare, Google Gmail, Yahoo Mail, Apple iCloud. Seriously who hasn't had downtime?

    --
    Well, there's spam egg sausage and spam, that's not got much spam in it.
  21. Office 365 by TheNinjaroach · · Score: 3, Funny

    I've always had to laugh at the name "Office 365" -- the fact this happened on Leap Day amuses me to no end.

    --
    I went to eat some animal crackers and the box said, "Do not eat if seal is broken." I opened the box and sure enough..
  22. Maybe it's *because* it's Leap Day? by zooblethorpe · · Score: 3, Insightful

    I've always had to laugh at the name "Office 365" -- the fact this happened on Leap Day amuses me to no end.

    In light of Excel's horribly buggy code of handling Leap Day, I have to wonder if Microsoft's problems here might not be because it's Leap Day? Whaddaya bet Azure comes back up all fine and dandy once the date rolls over to 1 March instead of 29 February? I'm actually serious about this conjecture, this is not just an attempt at humor.

    On a different angle, does anyone else find it amusingly ironic that this service is named Azure, and now it's blue-screened? They've only gone up one letter -- now it's the ASOD.

    Cheers,

    --
    "What in the name of Fats Waller is that?"
    "A four-foot prune."