Microsoft's Azure Cloud Suffers Major Downtime
New submitter dcraid writes with a quote from El Reg: "Microsoft's cloudy platform, Windows Azure, is experiencing a major outage: at the time of writing, its service management system had been down for about seven hours worldwide. A customer described the problem to The Register as an 'admin nightmare' and said they couldn't understand how such an important system could go down. 'This should never happen,' said our source. 'The system should be redundant and outages should be confined to some data centres only.'"
The Azure service dashboard has regular updates on the situation. According to their update feed the situation should have been resolved a few hours ago but has instead gotten worse: "We continue to work through the issues that are blocking the restoration of service management for some customers in North Central US, South Central US and North Europe sub-regions. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers." To be fair, other cloud providers have had similar issues before.
Until it isn't.
Shoes for Industry. Shoes for the Dead.
Basket?
Or how about "Never outsource your core functionality?
I am Slashdot. Are you Slashdot as well?
One of the worst things about the cloud is that it can go wrong when someone else screws up, so you get the blame for their mistakes.
One of the selling points of using cloud services was that it would be more reliable than managing your own hardware/software. But to date, every single big player has suffered major downtime. If I would be hesitant to believe the sales pitch.
Well, there's spam egg sausage and spam, that's not got much spam in it.
...the British Governments Cloud service suffers the inevitable Microsoft kiss of death.
Leap year strikes again?
Never trust Microsoft. For anything. They can't even manage water vapor, for crying out loud.
Yay, cloud!
This is not helping, guys!
Wait, so Azure isn't down just the admin functionality is? Who gives a crap. Man, I can't spin up a new VM for 8 hours, boo hoo. This isn't an admin nightmare, the VM's being down for 8 hours would absolutely be a nightmare but the only admins this is a nightmare for are the poor guys working for MS trying to fix whatever the code monkeys screwed up =)
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
Last time a Microsoft cloud product went down, users sustained real data loss. Of course, Microsoft claimed it couldn't happen with Azure.
"First they came for the slanderers and i said nothing."
like google does when something goes wrong. just explain how you're going to change things and why it happened and it will all be OK
At this point, the best way to keep their credibility from further deteriorating is to provide good reports on what is going on. E.g., not like PSN, more like Amazon. Currently that Azure dashboard doesn't even load for me... has it been slashdotted or something?
As an aside: whenever a cloud system goes down, people come out to rag on the reliability of the cloud. While I'm also annoyed by the marketing guys throwing around "just put it in the cloud!!" as much as anyone else, and agree some applications make no sense living in the cloud, I'd also like to point out that for some people, doing the admin work in-house results in the same amount or more headaches.
It seems like even the biggest guys can't make it work reliably, and presumably given the high profile of these services, they're not afraid to throw money and smart people at these problems.
Since the image that "Azure" and "Cloud" conjurs up is more "sky" than "cloud" it would be my suggestion that Microsoft simply register chickenlit.tl and set up an Azure service status monitor/report page there.
They could have an adorable cartoon chicken that, when the system is working normally, runs around scratching and pecking(speed dependent on load). When downtime occurs, it would begin squawking about how the sky is falling. What could make failure more endearing?
Just to add that Microsoft touch, they could do the entire thing as a Microsoft Agent ActiveX control!
They thought it was ready.
Well...maybe not right now...
It's the Blue CLoud of Death!
London stock exchange also is using Linux...
When it rains, it poors...
down sides of centralization and remote admins.
Some times you are better off with local admins and systems.
what is better all your sites / a big chunk of then down or just one?
local admins or centralization with remote admins that do not know about each site local software setups?
Yet another example of why the Cloud is not ready for production. No way I'm putting my eggs in there. Maybe development/testing, but never production.
It's funny how those of us who bring up issues of data security and service resiliency are dismissed as just trying to protect our jobs.
Like so many other things, the actual technical underpinnings of "the cloud" are great, and have been standard fare for years. Virtual machines + flexible networking are a godsend for systems guys tasked with getting capacity for a new project up and going yesterday. I love being able to build and rip down entire test environments just to try something out...that used to mean a rack of physical servers, switchgear, etc. tied up while it was being used. That's why everyone's slowly coming around to the "private/hybrid cloud" model, which is really just code for "VMs + network capacity + something to tie it all together + maybe some external hosting".
The problem is that "the cloud" is very badly misunderstood. As sson as a CIO sees "virtual, on-demand capacity without those pesky physical on-site machines and IT staff, for a fixed cost per compute-hour" everything else takes a back seat. Then, it's "why do we need IT staff on-site, everything's being taken care of in the cloud." Public clouds like Amazon or Azure are great for startups who can't really afford their own data centers, or even bigger businesses to offload some of the nonessential stuff. When you start looking at hosting everything though, the marketing hype of the cloud sometimes distracts people from realities that they have to contend with.
Also, I'm not saying that businesses who go the private cloud or traditional hosting/outsourcing route won't have downtime -- they will. However, having onsite staff and infrastructure means you can work those staff until they fix the problem, and you have control over them. Most sane outsourcing contracts have SLAs in them stating that the vendor will expend X amount of effort to fix your problems. Cloud provider agreements, unless specifically mentioned otherwise, are "as is, where is, best effort restoration with no warranty." OK, maybe some providers will give you an SLA, but all that does is buy you free service at a later date if they violate it...it doesn't bring your application back online. You still have no choice but to sit and wait around for the provider to fix whatever's wrong...just ask Amazon EC2 customers about what happened during their last outage...
Companies need to draw sane boundaries around hosted systems, and decide what is critical and what can be offloaded. Do I care about a set of development/test machines that get used once a month? Probably a lot less than the critical database/application servers that run my core business. Comfort level, cost per minute of downtime vs. cost of dedicated resources and other factors need to be carefully considered before jumping into the cloud with both feet.
Use the MCSE mantra:
1. Perform virus scan.
2. If that doesn't work, find a different program that will display a reassuring green graphic.
3. If that doesn't work, reboot.
4. If that doesn't work, reformat, reinstall.
5. If that doesn't work, GOTO 1.
Microsoft wouldn't know anything about data center running if it were chase aftering them at full speedo.
Google this: "Microsoft Sidekick / Danger"
http://techcrunch.com/2009/10/10/t-mobile-sidekick-disaster-microsofts-servers-crashed-and-they-dont-have-a-backup/
https://www.pcworld.com/article/173470/microsoft_redfaced_after_massive_sidekick_data_loss.html
http://www.appleinsider.com/articles/09/10/11/microsofts_danger_sidekick_data_loss_casts_dark_on_cloud_computing.html
I'll see your senator, and I'll raise you two judges.
if so, that's the breaks. If not, then there should be contractual SLAs and penalties involved.
---- Booth was a patriot ----
One more nail in the Cloud coffin.
"If any question why we died, Tell them because our fathers lied."
I had an outage on Salesforce for 1 week and they did absolutely nothing regarding giving me any free account time or anything except "Sorry".
Their explanation was a massive multiterabyte log file had to processed since what corruption they had extended to their backup.
Shouldn't ever happen.
This was last Autumn.
All boy scouts should take away this: Cloud promises are made to be broken.
Put your servers in the Azure cloud to have an uptime of 9.999999999%
This is not just an IT observation. The same thing happens with biodiversity (fewer species means greater risk that a key part of a food chain will collapse and take the entire chain with it), the economy (ever notice how failures are getting bigger as government steps in more to prevent failures?), and any other complex system. Once a system is too big for a single human mind — and specifically the one in charge of the system — to contain its complexity and understand its failure modes, failure becomes inevitable. The fewer people allowed to understand and make decisions about the system, the more catastrophic the failures when they occur. The more complex the system, the more likely it is for the failures to occur. Which is to say, any complex system is at increased risk of catastrophic failure as it grows in complexity and as it becomes more centralized. Combine the two, and you're just waiting for the disaster to happen.
-- Two men say they're Jesus. One of them must be wrong. - Dire Straits
I wrote a comment on slashdot a while back which questioned the sensibleness of running services in the cloud. I used to be a sceptic.
Since then I've used Rackspace Cloud and found that it's actually a very good idea, for certain things.
The benefits of using a cloud system are scalability and no commitment- it's not about reliability or higher availability - but you do get a little win in those areas.
To give some examples, I was recently able to play around with mysql clustering. I followed a mysql clustering howto and played around with it, setup a mysql cluster with load balancers. Once I was finished geeking about, I saved the VMs to the file storage and deleted the cloud instances. Total cost a £/$2-3 maximum. I hadn't previously been able to do this, I would have had to rent a dedicated server which would serve websites, email etc. I couldn't really use the dedicated server to play with new technology in case it had a negative impact on the live systems. I did have development box for a while, but it essentially doubled my costs without making any more money, just offering some protecting.
Now I have staging/development instances in the cloud - and no commitments to them - I don't have to worry about a £250 monthly bill or sign a 12 month contract to get my own box. I can fire up some resources, use them, and throw it away when I'm done.
The upshot is that I can play around with other peoples cool open source software without risk or buggering something up on my live box, and the costs are insignificant since I'm only renting it per hour. I can try something new, if it works great - it might go/stay in production. If not, delete it and move onto the next cool thing.
If I need high availability, I would use Rackspace, Amazon, Azure, and I'd ensure that I have a plan to deal with a major outage with any of the providers. Each have APIs, so in theory I could create new instances automagically and failover between different cloud providers with a quick DNS change, while keep costs low.
To recap, the cloud isn't all about high availability - no matter what the marketing says. It's about scaling systems and running resources for small amounts of time, and is perfectly suited to services which have peak demand (ticket sales for example).
I wonder if this is what is causing the Daily Show to post a maintenance sign on login?
It's called Fog.
Nearly fifty percent of all graduates come from the bottom half of the class!
so, now that the Azure cloud is down and the news has hit Slashdot - the "service dashboard" has now been "slashdotted"
Network Error (tcp_error)
A communication error occurred: "Operation timed out"
The Web Server may be down, too busy, or experiencing other problems preventing it from responding to requests. You may wish to try again at a later time.
For assistance, please raise a ticket through the CSC Help Desk (E-mail: CSS_Internal_Help_Desk@csc.com), and provide the information on this page for Proxy: CSC-CHD-CDC-1
ya'll killin' me :-)
....a loop of death as cause for the outage! ^^
29 February and unexpected downtime hummm
I have no empathy for any company who relies solely on a single provider. It seems as though nothing is every 100% reliable, and for those companies who rely on outside service providers, they need to understand that no external company will ever value the service as much as you do.
For a while now, I have contemplated the necessity for a data layer which provides replication and failover, between two ENTIRELY separate clouds (think Azure and AWS/EC2). I just keep waiting for someone to do the legwork of developing this (I'm distracted on other projects).
People that believe the cloud is not as risk for downtimes are just stupid and deserve exactly what they get. The cloud not only has the normal risks any comparable infrastructure has, but also suffers from additional risks because of complex network connectivity, complex usage patterns and untried system administration patterns.
People that still think this now are not only stupid but unwilling to learn, as the Amazon outage last year clearly showed the risks. In addition, Amazon is very likely more competent than Microsoft at this by any sane metric.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Apple's iCloud service is apparently using Azure (it's been reported many times), and since it's down, iCloud should in theory be down.
So how come no one's putting the full on for page views by saying it's Apple's iCloud service is down and then just mentioning in passing it's because Microsoft's Azure is down?
Right now we just have people commenting about cloud services. Why not add in the Android and Apple fanboys as well to the discussion?
Page views, people!
I could fix this with a $35 payment to someone?
Do you have ESP?
Ill take care of my own data storage, Thank you....
I run cloud servers that are 24/7/365 with uptimes of well over 1000 days. Microsoft servers to the best of my knowledge can not do that.
They can but this year there are 366 days..
Anyone with an interest in meteorology knows that weather can be unpredictable and unreliable.
No one *really* expects anything more from microsoft any more, do they???
Obama and his crew will want mandatory, no warrant backdoors. Prices will go up.
Is comedy central http://www.thedailyshow.com/ hosted on there?
To see what the heck is going on:
http://www.windowsazure.com/en-us/support/service-dashboard/
My thinking on clouds and downtime is that it's pushing failures to be less frequent, but much higher-impact when they happen. That is, instead of 1 hour down a year (and N man-hours of work lost), you get 20 hours down every decade (and 1000*N lost man-hours or somesuch). Which is bad because we (psychologically and economically) are truly terrible at evaluating or dealing with once-in-a-generation huge catastrophes -- yet we seem to be arranging more of them all the time.
Central planning and power relations and all that, I guess.
We know where leadership by an anti-intellectual "strongman" who scapegoats minorities and likes boisterous rallies goes
I've always had to laugh at the name "Office 365" -- the fact this happened on Leap Day amuses me to no end.
I went to eat some animal crackers and the box said, "Do not eat if seal is broken." I opened the box and sure enough..
When Windows was on your machine, you had blue screen of death.
Now, online, the microsoft cloud is gone so I guess you get a blue sky of death.
I've always had to laugh at the name "Office 365" -- the fact this happened on Leap Day amuses me to no end.
In light of Excel's horribly buggy code of handling Leap Day, I have to wonder if Microsoft's problems here might not be because it's Leap Day? Whaddaya bet Azure comes back up all fine and dandy once the date rolls over to 1 March instead of 29 February? I'm actually serious about this conjecture, this is not just an attempt at humor.
On a different angle, does anyone else find it amusingly ironic that this service is named Azure, and now it's blue-screened? They've only gone up one letter -- now it's the ASOD.
Cheers,
"What in the name of Fats Waller is that?"
"A four-foot prune."
Microsoft wouldn't know anything about data center running if it were chase aftering them at full speedo.
Well, THANK YOU so very much for putting that image in my head! All that brought to mind was Ballmer chasing people while wearing nothing but a Speedo.
Where's the brain bleach?
"What in the name of Fats Waller is that?"
"A four-foot prune."
Had to be critical updates to the cloud.
This outage shouldn't affect Slashdot readers, since everyone here will be aware of these two fundamental principles of IT:
1. Never trust Microsoft
2. Never trust cloud services with anything important
"Nine times out of ten, starting a fire is not the best way to solve the problem." - my wife
Microsoft has always been about the desktop and securing their market. The cloud takes things away from the desktop and their corporate clients.
step 1: identify cloud as potential enemy
step 2: create cloud of own
step 3: advertise so everyone joins cloud
step 4: randomly "loose the cloud" to hundreds/thousands/millions of cloud users - everyone gets scared of the "cloud" idea across the board, incl. other cloud service customers
step 5: PROFIT! sell more fully independent PCs and servers!
Lots of people couldn't access data or services that were hosted in Azure. The UK government for one, who had just migrated to Office 365 a few weeks ago. The Daily Show for another. Many others are reported. We're going to have news about these failures for weeks. Many millions of dollars were lost by services hosted in the Azure cloud that were down all day today for Leap Day - perhaps a billion dollars or more of sales. Everyone affected has a huge loss of face they're going to have to recover gradually over time, so the losses compound.
Azure service management is still down over 24 hours after onset of the incident (link in TFS). They'll stand it up again eventually but they haven't yet. This is bad but it's not "lost data" bad yet. It's not "Danger" bad yet. Over the next few days we'll find out what actual data loss occurred, what transactions in flight were lost, which hosted databases were munged. Most simple hosting customers will be unaffected and they will skew the results so that Microsoft can say "only a few customers had severe issues." Though the prime enterprise customers with 10,000-100,000 users were totally hosed because they were most active, they're only one customer each so their customer count doesn't count in the PR scheme.
The biggest loss is the loss of confidence. Azure hadn't failed this badly in public before and now it has. This is the failbar other services will have to get over to differentiate their service and some hosted cloud providers are now breathing a sigh of relief because this is a really low bar. "We haven't failed this bad yet!" will be their advertising slogan. When pressured for a competitive argument against Azure they're going to ask: "On Leap Day then what?" That's the five word closing argument for a whole lot of cloud services tomorrow.
Azure becomes the "Azure screen of death." Ironically, Azure is the color of a cloudless sky. Perhaps the name is prophetic.
Help stamp out iliturcy.
but do you run 24/7/366, that's the question here =P
insensitive clod overlords obligatory xkcd car analogy russian reversals whoosh pedant fanbois ftfy in 3...2...1..PROFIT
When Amazon had that outage it was just thought to be an outage, but it turned out data was lost.
Disruption is bad enough, but data loss is way worse, since people and businesses likely won't have their own backups, and loss of data, even a low percentage, can easily KILL a business.
Lose an order or a customer's records or a customer's data and you likely lose a customer and get bad reputation.
Lose business records and it might be impossible to exist.
Just because it CAN be done, doesn't mean it should!
If it's five 9's then that's gone!
28 days in February . . . . . should be enough for ANYBODY.
at least 9 5s of uptime!
How can date manipulation bring down a mission critical asset for 7 hours? Maybe someone can explain how you could accidentally write code that breaks this badly on Leap Day. I've never written anything that stores the data internally as anything other than epoch seconds or epoch milliseconds, precisely because it seems like a can of worms. I understand this is the norm for many Microsoft projects, right?
MS unconfirmed source said "Don't worry...we've scheduled a leap year remembrance event next year on Feb 29th"