Certificate Expiry Leads to Total Outage For Microsoft Azure Secured Storage

← Back to Stories (view on slashdot.org)

Certificate Expiry Leads to Total Outage For Microsoft Azure Secured Storage

Posted by timothy on Saturday February 23, 2013 @02:17AM from the keeping-the-lights-on dept.

rtfa-troll writes "There has been a worldwide (all locations) total outage of storage in Microsoft's Azure cloud. Apparently, 'Microsoft unwittingly let an online security certificate expire Friday, triggering a worldwide outage in an online service that stores data for a wide range of business customers,' according to the San Francisco Chronicle (also Yahoo and the Register). Perhaps too much time has been spent sucking up to storage vendors and not enough looking after the customers? This comes directly after a week-long outage of one of Microsoft's SQL server components in Azure. This is not the first time that we have discussed major outages on Azure and probably won't be the last. It's certainly also not the first time we have discussed Microsoft cloud systems making users' data unavailable."

23 of 176 comments (clear)

Lolwut? by Anonymous Coward · 2013-02-23 02:21 · Score: 4, Funny

What's an expirty?
1. Re:Lolwut? by Nidi62 · 2013-02-23 02:27 · Score: 5, Funny
  
  I think you get them from storage vendros
  
  --
  The only thing necessary for evil to triumph is for it to be pitted against a slightly greater evil
2. Re:Lolwut? by drinkypoo · 2013-02-23 02:34 · Score: 4, Funny
  
  Vendro is Destro's cousin, who works on the supply side.
  
  --
  "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
Typical. by berchca · 2013-02-23 02:28 · Score: 5, Funny

Not the first time they've made such blunders:
http://slashdot.org/story/03/11/06/1540257/microsoft-forgets-to-renew-hotmailcouk
If only Redmond had some sort of calendar system to help them remember this stuff.
1. Re:Typical. by Stormthirst · 2013-02-23 02:35 · Score: 4, Funny
  
  Does MS not have a credit card its vendor can keep on file?
2. Re:Typical. by hsmith · 2013-02-23 02:39 · Score: 4, Interesting
  
  It is almost a year ago to the day Azure was down for a day because no one accounted for leap year for validating certificates, lol. AWS seems to have issues too, but they don't seem to revolve around blatant stupidity and result in an entire day of downtime.
3. Re:Typical. by rtb61 · 2013-02-23 03:20 · Score: 4, Insightful
  
  M$ has a history of lack of customer focus hence it will fail ay any industry that demand the highest levels of customer focus. For cloud services to be down for a down is inexcusable and seriously any IT management staff that fails to acknowledge these failures and uses or recommends Azure should be fired. Any down time should be measured in minutes not days, this should be considered catastrophic failure. M$ is far to used to it's EULA's a warranty without a warranty and has become woefully complacent about actually guaranteeing a supply of service, meh, it mostly works it their motto and we'll fix it net time round, for sure this time.
  
  --
  Chaos - everything, everywhere, everywhen
4. Re:Typical. by Charliemopps · 2013-02-23 03:38 · Score: 4, Interesting
  
  You'd think that, but there's contract stuff. The thing is, you basically need a department in charge of renewing shit like this when you have enterprise level services. We've got a site with millions of hits daily and still manage to let it expire every couple of years. You try the credit card thing, but credit cards expire. You try recurring billing and then you get into a contractual nightmare with the registrar. The registrar isn't going to do you any favors, you might get millions of hits daily, but they still only get $5/year even from google.com so fuck you, figure out the billing yourself.
  The only real way to do it effectively is build yourself a database of all the crap you need to renew regularly, then hire someone to renew that stuff. But who are you going to hire? It usually ends up being some assistant that doesn't know a damned thing about tech... and it's still going to cost you $60k a year in pay and bennifits to retain them. That's an expensive way of keeping track of such things... ah, the website admins can remember right?
Re:Spellcheck... by mystikkman · 2013-02-23 02:31 · Score: 4, Funny

Maybe rtfa-troll and Timothy's spell checkers were hosted on Azure.
Tip of the iceberg by gmuslera · 2013-02-23 02:38 · Score: 5, Insightful

If you can't trust Microsoft for such kind of small but essential things, should you trust them with bigger ones?
1. Re:Tip of the iceberg by Junta · 2013-02-23 04:10 · Score: 4, Insightful
  
  The reality is, if you outsource your hosting to a single company, there will always be single points of failure.
  There will be architectural ones, like root of trust expiring resulting in security framework taking everything down.
  There will be bugs that can bite all of their instances in the same way at the same time.
  There will be business realities like failing to pay electric bills, or collapsing, or simply closing down their hosting business for the sake of other business interests.
  Ideally:
  -You must keep ownership of all data required to set up anywhere at all time. Even if you host nothing publicly yourself, you must assure all your data exists on storage that you own.
  -You either do not outsource your hosting (in which case your single point of failure business wise would take you out anyway) or else you outsource to financially independent companies. "Everything to EC2" is a huge mistake, just as much as "everything to azure" is a huge mistake.
  -Never trust a providers security promises beyond what they explicitly accept liability for. If you consider the potential risk to be "priceless", then you cannot host it. If you do know what your exposure is (e.g. you could be sued for 20 million, then only host it if the provider will assume liability to the tune of 20 million)
  
  --
  XML is like violence. If it doesn't solve the problem, use more.
12 hours to update the certs? by crt · 2013-02-23 02:56 · Score: 5, Informative

The really amazing thing is that if you look at their service dashboard, it took them 12 hours to update the certificates on their site:
http://www.windowsazure.com/en-us/support/service-dashboard/
They spent several hours doing "test deployments" ... while it's great to make sure you aren't going to make something worse, updating an SSL cert isn't exactly rocket science. I'd had to see how long it took to recover from a more serious service issue triggered by a software bug.
1. Re:12 hours to update the certs? by Glendale2x · 2013-02-23 03:21 · Score: 4, Funny
  
  Maybe they tried rolling back to an older version of the cert first.
  (Yes, that was sarcasm.)
  
  --
  this is my sig
Entwined failure loop... by dargaud · 2013-02-23 03:07 · Score: 4, Interesting

I wonder how long it will be before there's a major failure loop in the cloud, something like the certificate for cloud X is stored in service Y, which actually uses cloud X as its backend. So when certificate for X stops, the whole thing grinds to a halt with no way to restart it (unless backdoors)...

--
Non-Linux Penguins ?
Re:Somebody by Glendale2x · 2013-02-23 03:23 · Score: 5, Insightful

Eh, don't put anything too important that you can't live without on systems outside of your control.

--
this is my sig
Re:Blew their support contracts.. by binarylarry · 2013-02-23 03:26 · Score: 5, Funny

Finally the Microsoft Blue Screen of Death has made into the new mobile cloud age.
I mean the Azure Screen of Death, excuse me Mr. Ballmer.

--
Mod me down, my New Earth Global Warmingist friends!
Re:Blew their support contracts.. by click2005 · 2013-02-23 03:34 · Score: 4, Insightful

The Blue Sky of Cloud Death

--
I am a free slashdotter. I will not be modded, blogged, DRM'd, patented, podcasted or RFID'd. My life is my own.
Re:Somebody by Anonymous Coward · 2013-02-23 03:53 · Score: 4, Insightful

Somehow I feel those worker visas are the issue here.
Anything else you'd like to blame on foreigners?
Declining population of ducks in the local pond?
Chips no-longer served in old newspaper?
Lack of respect for elders?
Banning of blackboards in schools?
Rampant rape and violence all foreigners bring to your little Daily Mail reading village?
Re:Then what the hell was this Slashdot article? by multi+io · 2013-02-23 04:06 · Score: 4, Funny

Outperforms in reliability, huh? bullshit
Of course it doesn't work, but look how fast it is!
Re:Somebody by DarkOx · 2013-02-23 04:21 · Score: 4, Insightful

Right and I think this is an important aspect to the problem here.
There is simply no substitute for having all your I's dotted and T's cross with large integrated systems like this. This is a culture problem not a individual screwed up problem. If you just fire the guy, there will be lots of awareness but the take away most of your remaining people will get is "don't forget to check the certificate expiry dates, that'll get you canned" many of them traumatized by the experience will dutifully check certificate dates for the rest of their careers but this will do nothing to prevent your next major outage; because that will almost certainly be the result of something else.
Everyone is pushing this vitalization + "dev ops" + management/monitoring is going to let us have one admin do what was once the work of ten. The fact is it just does not work like that. Management/monitoring like Microsoft Mom for example requires you to have all the failure modes identified and the scripts written to check conditions like expiry dates and trigger the alerts. Unless everyone is really good about all the routine maintenance tasks in there is won't help with something like this. That takes time you ONE admin has not got and discipline that breaks down when someone is overworked.
The "dev ops" and vitalization stuff is all great in terms of how much can be automated. Someone has to develop that automation though. Your ONE guy does not have time to build and test his generic deployment scrip when you promised your customers you'd have their infrastructure stood up last week.
It comes down to the business recognizing its important to have good people, enough people, and willingness to invest in making sure the job is done correctly and completely every time, and that documentation is maintained and in a way everyone knows how to use it. Check lists need to be kept and followed etc. IT got away from plant engineering style discipline when hardware got cheap. You know longer had to worry about that one computer you had failing. As we move back to more consolidated and integrated solutions; management is going to have to get used to the idea again that there is some people time investment that must be made. Its great you can save on power, cooling equipment, and headcount but you can't cut headcount to far because the more consolidated you get the less you can afford for anything to go wrong so it all must be check, doubled checked, and checked again just to be sure. This is if you do it yourself or if you pay your cloud provider to do it. Either way cloud services so far have been mostly a race to the bottom and that is going to cause some to have to learn some very painful lessons if the industry remains on its current trajectory.

--
Repeal the 17th Amendment TODAY! Also Please Read http://www.gnu.org/philosophy/right-to-read.html
Re:Somebody by Nerdfest · 2013-02-23 04:35 · Score: 5, Interesting

On the other hand, I've worked at places where the worst thing you could do is leave things that the company can't live without *in* the control of the company. Sometimes certain areas of expertise require specializations that the company just doesn't have and isn't interested in acquiring. Of course handing the responsibility of those things off to *Microsoft* is not necessarily any better.
Re:Monitoring Fail by rabbitfood · 2013-02-23 07:28 · Score: 5, Insightful

Simple operation? You've clearly never worked for a large company.
Even if a warning wasn't trickled down a month ago, and we've no reason to assume it wasn't, the person whose job it is to act on it, provided they weren't on vacation, won't have simply thrown five dollars at a registrar. They'll have had to put in a request to the finance department, probably via a cost-management chain of command, with a full description of what needed to be paid to whom and why, with payee reference, cost-center code, expense code and departmental authorization, and hoped it would arrive in time to be allocated to the next monthly rubber-stamp meeting. Assuming the application contained no errors, was suitably endorsed and was made against an allocated budget that hadn't been over-spent and wasn't under review, then, perhaps, in the fullness of time, it might have received approval and have been sent back down the chain for subsequent escalation to the bought-ledger department, who'd have looked at the due date, added ninety days and put it on the bottom of the pile. After those ninety days, when the finance folk began to take a view to assessing its urgency, unless they found a proper purchase order from the supplier, and a full set of signed terms and conditions of purchase, non-disclosure agreements, sustainability declarations and ethical supply-chain statements, as now required by any self-respecting outfit, it'll have been put aside and, eventually, sent back round to be done properly. Or, if it all checked out first time, it'll have been put on the system for calendering into the next round of payment processing.
I'm sure it might be possible to streamline aspects of such mechanisms, but to suggest there's anything trivial about them is a touch hasty. But you never know. Perhaps they're already thinking of planning a meeting to discuss it, and are working on a framework for identifying the stakeholders as I write.
Re:Does Timothy Have Brain Damage? by 6ULDV8 · 2013-02-23 07:42 · Score: 4, Funny

Calling someone a cunt for any reason wouldn't make constructive criticism. When I use say it, it definitely isn't an attempt at anything constructive. I still love the word though.

--
Pull my finger for my public key.