Dark Day In the AWS Cloud: Big Name Sites Go Down

← Back to Stories (view on slashdot.org)

Dark Day In the AWS Cloud: Big Name Sites Go Down

Posted by timothy on Sunday August 25, 2013 @12:02PM from the central-authority-vs-resilience dept.

An outage of one company's servers might only affect that company's customers — but when a major data center for Amazon hits kinks, sites that rely on the AWS cloud services all suffer from the downtime. That's what happened today, when several major sites or online services (like Instagram and AirBnB) were knocked temporarily offline, evidently because of problems at an Amazon data center in Northern Virginia. From TechCrunch's coverage of the outage: "The deluge of tweets that accompanied the services’ initial hiccups first started at around 4 p.m. Eastern time, and only increased in intensity as users found they couldn’t share pictures of their food or their meticulously crafted video snippets. Some further poking around on Twitter and beyond revealed that some other services known to rely on AWS — Netflix, IFTTT, Heroku and Airbnb to name a few — have been experiencing similar issues today."

35 of 182 comments (clear)

Min score:

Reason:

Sort:

Running List of Cloud Outages? by bill_mcgonigle · 2013-08-25 12:07 · Score: 3, Insightful

I thought this might already exist, but I'm not finding it with a quick Google search. Seems like it's a thing that could get ad views from some decent IT audiences.

--
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
1. Re:Running List of Cloud Outages? by sottitron · 2013-08-25 14:43 · Score: 3, Funny
  
  You should totally create this. I hear AWS is the way to go to get things online quickly and at scale.
Re:Say what you will by Anonymous Coward · 2013-08-25 12:12 · Score: 5, Funny

In Soviet Russia, company's customers go down on YOU!
Re:Say what you will by rudy_wayne · 2013-08-25 12:23 · Score: 4, Interesting

One of the features of AWS was supposed to be the ability to reroute everything to a different datacenter if one goes down. I know I read that somewhere back when AWS was first starting up. You don't think they lied, do you?
Re:Say what you will by teknopurge · 2013-08-25 12:25 · Score: 3, Funny

In Soviet Russia, company's customers go down on YOU!
so dirty...

--
Website Hosting
Add Adobe Creative Cloud to the List too by JenovaSynthesis · 2013-08-25 12:26 · Score: 3, Interesting

That went down and I think it ate some files with it. Just before the crash my client reported 103 files being removed. They weren't by me.

--
Anonymous Cowards generally receive no replies because you're a coward and I'm a bitch :)
Re:Say what you will by teknopurge · 2013-08-25 12:30 · Score: 5, Insightful

That's expensive. "Cloud" hosting services cost about 1.5x traditional hosting. When you want multiple locations("regions" in aws) you need to pay for resources in each additional region, then pay another cost to provide that failover. Cloud hosting is great, but it's nothing it does is new or cheaper than hosting 10 years ago.

--
Website Hosting
Has Rackspace had any outages in 10 years or so? by MillerHighLife21 · 2013-08-25 12:40 · Score: 5, Interesting

I've run servers on both Amazon and Rackspace for several years now and I can't recall a single instance of Rackspace having an outage. On the other hand, Amazon seems to have major issues at least 2 or 3 times a year. Is this stuff tracked anywhere?

--
"Don't teach a man to fish, feed yourself. He's a grown man. Fishing's not that hard." - Ron Swanson
Re:Say what you will by Anonymous Coward · 2013-08-25 12:47 · Score: 5, Insightful

No they didn't lie. You can set things up that way-simply set up your servers in multiple data centers(AWS availability zones) and load balance between them. It's foolish to just throw things up in the cloud and think magically I won't ever have to worry about downtime ever again. It's foolish-but a lot of companies act this way.
Somehow cloud hosting is taken as the silver bullet to prevent outages-it isn't. You still have to architect things the way you would normally if you're looking for things like disaster recovery, high availability, etc...etc..
Re:Say what you will by chrisgeleven · 2013-08-25 12:54 · Score: 5, Informative

Assuming you mean traditional round-robin A records, the timeout(s) you still have to suffer through would kill your latency.
If your talking about DNS providers (disclaimer, I work for Dyn) with advanced features that detect a failover event occurring and will only serve healthy A records, then that is a different story.
Re:Say what you will by rudy_wayne · 2013-08-25 12:59 · Score: 3, Insightful

No they didn't lie. You can set things up that way-simply set up your servers in multiple data centers(AWS availability zones) and load balance between them. It's foolish to just throw things up in the cloud and think magically I won't ever have to worry about downtime ever again. It's foolish-but a lot of companies act this way.
But that's the problem. *THEY* (i.e., AWS or whoever) are supposed to take care of all that stuff. They're supposed to worry about "uptime" and fixing things when they break and having redundant systems that kick in when something breaks so that there's no loss of service. That's the whole point of putting stuff in the "cloud".
If * I * have to worry about that stuff then I might as well just do it myself and not give my money to Amazon.
Re:Say what you will by alen · 2013-08-25 13:34 · Score: 3

yeah, but cloud is sold as this super cheap way to compute and have five nines reliability
Re:Say what you will by Cyberax · 2013-08-25 13:38 · Score: 5, Interesting

Well, right now I have 500 machines running some heavy calculations in multiple AZs. Works perfectly fine, we have noticed the recent problems but simply stopped using the affected region (us-east-1) for the time being, shifting our calculations to other regions.

AWS is really great at scaling. It's better than anything else on the market, but it does require a lot of work.
Re:Say what you will by Glendale2x · 2013-08-25 13:40 · Score: 4, Interesting

No, you have to manage your own redundancy and failover on AWS. Look at all the effort Netflix has put into programming failover and stress testing and yet they still have frequent outages with AWS.

--
this is my sig
Re:Say what you will by Anonymous Coward · 2013-08-25 13:48 · Score: 5, Informative

either you don't speak English or you need to take your meds. no offense. so i'll try muddling a reply together for you.
There are many ways to setup remote failover systems. Most of them rely on some type of heartbeat system where there's a "heartbeat message" which they all send each other periodically, and if the current Active goes out of response for too long the others choose one to take over. So it doesn't matter if they're all in one room connected with a single switch, or spread all over the planet.
The real rub for any mechanism is DNS... if the primary server your FQDN points at drops then you might have redundancy but most people won't be able to take advantage of it. With more manual mechanisms (such as telling users "If our primary site goes down, try here instead!") that's not as much of a concern, just a PITA to keep track of.
Re:Say what you will by Anonymous Coward · 2013-08-25 14:05 · Score: 4, Informative

AWS Status Dashboard?
I know this is /., and people here don't like to read, but did anyone actually read the status dashboard posts?
This issue was limited to a single AZ, effected only a small number of machines, and was specifically an issue with added latency in EBS volumes. And Amazon completely resolved the issue in 4 hours.
So, call me crazy, but didn't they do exactly what they are supposed to do? Also, AWS quite clearly states that any given AZ *might* fail. Hence, if you want any sort of high-availability, you replicate across different AZs.
Plus, I have 10+ EC2 instances, and a number of other resources with AWS, and none of them were effected by this outage.
actually, no by Chirs · 2013-08-25 14:09 · Score: 5, Informative

"cloud" is sold as a *convenient* way to compute, where it's quick to add resources when needed so you can start small and scale up (and down) with demand.
It is *not* generally considered a cheap or particularly reliable solution. So far at least none of the cloud providers are offering five nines--if you want that, you should (for now at least) jbe looking at enterprise/telecom gear.
1. Re:actually, no by You're+All+Wrong · 2013-08-25 18:00 · Score: 4, Informative
  
  > It is *not* generally considered a cheap
  
  Quoth Forbes:
  Cost savingsâ€¦ [...] These are the advertised benefits of cloud computing
  
  Quoth Salesforce:
  4. Cap-Ex Free [...] no need for capital expenditure [...] minimal project start-up costs
  
  Quoth Verio:
  Achieve economies of scale [...] Reduce spending on technology infrastructure. [...] Globalize your workforce on the cheap [...] Reduce capital costs.
  
  And those were the first 3 hits for ``benefits of cloud computing'' (although the first one is meta, it refers to others refering to cost savings).
  
  I hate to shake you from your firmly entrenched world-view, but you have to know that people are touting cloud solutions as ones which have cost benefits. Whether they're valid claims or not is irrelevant, they are undeniably being made.
  
  --
  Your head of state is a corrupt weasel, I hope you're happy.
2. Re:actually, no by Narcocide · 2013-08-25 21:57 · Score: 3, Funny
  
  YOU sir look like a shrewd and discerning businessman. How would you like to buy a bridge?
3. Re:actually, no by AK+Marc · 2013-08-25 22:27 · Score: 3, Insightful
  
  Whether they're valid claims or not is irrelevant, they are undeniably being made.
  Like "core business" Every time I hear that, it's from a contractor or someone who just spoke to a contractor, and it's always about why it's good to outsource everything to contractors. It doesn't take long for that to be a pattern.
  
  Now, ask cloud computing companies how much they charge, compared to renting tin. It's always cheaper, except when it's not, and even then, it's cheaper to use the more expensive cloud because tin can go down, the cloud can't, or something like that.
  
  --
  Learn to love Alaska
Everybody that is surprised is stupid... by gweihir · 2013-08-25 14:12 · Score: 4, Insightful

That things like this will happen with a cloud infrastructure are obvious. That the reliability claims made by the cloud providers are fantasy is also obvious. As soon as they start to do "uptime or else" (meaning you get tons of money as downtime compensation), things may be different. but they will not do that. At this time, the only thing you can do is change to a different cloud provider, which will have the same issues. Uptime guarantees without penalties when failed to meet them are worthless.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
1. Re:Everybody that is surprised is stupid... by VortexCortex · 2013-08-25 14:36 · Score: 4, Insightful
  
  We built a decentralized network called The Internet, even capable of withstanding global thermonuclear war -- packets rerouted moments after a city disappears from the mesh... And folks use data silos? Protip: Don't centralize services, that's daft in terms of both uptime and congestion.
Re:Has Rackspace had any outages in 10 years or so by CritterNYC · 2013-08-25 14:29 · Score: 5, Informative

It depends which data center you're in. PortableApps.com has been hosted at Rackspace for years and we had multiple major outtages due to ongoing power issues in the Dallas data center in 2009. The switch from grid to ups was failing and would take the whole wing of the data center out with every server crashing hard. It would take quite a while to come back up. Then we'd have to wait hours for the Rackspace folks to rebuild our corrupted database (fully managed account on a dedicated server). It happened two weekends in a row in June and one other time if I recall correctly, basically costing us a full day of downtime each time.

--
Portable versions of Firefox, GIMP, LibreOffice, etc
This is why I laugh at tech pundits who preach... by bagboy · 2013-08-25 14:38 · Score: 3, Interesting

public cloud services as "the future". I will never risk my corporate data uptime and reliability to some "location in the cloud". I'll stick to private clouds (VMWare/VCenter) where I have control of both hardware and software and reliable failsafe systems. At least then if I have downtime I also have accountability and predictability. They same cannot be said for cloud providers and no matter what anyone says once the data leaves your hardware, you have lost that control.
Re:Say what you will by tnk1 · 2013-08-25 15:01 · Score: 3, Interesting

Supposedly the load balancer problem did not affect LBs that have backing hosts in two availability zones according to the article. The major question is... who runs everything in one availability zone? You're not supposed to do that for high availability sites.
Wrong terminology? by elfprince13 · 2013-08-25 15:25 · Score: 5, Funny

Shouldn't this, technically speaking, be a "bright day" or a "sunny day"? After all, that's what I call it when the cloud-coverage breaks around here.
Re:This is why I laugh at tech pundits who preach. by l0ungeb0y · 2013-08-25 15:38 · Score: 3, Interesting

Depends on which "future" you are talking about. The future where the bulk of personal data is stored on the cloud to be shared across devices and with friends, family and authorized services is one I think is bound to come to fruition.
The future where Corporations put their core infrastructure into the Cloud is not one I ever recall anyone talking about.
Re:Say what you will by mysidia · 2013-08-25 15:56 · Score: 3, Insightful

But that's the problem. *THEY* (i.e., AWS or whoever) are supposed to take care of all that stuff. They're supposed to worry about "uptime" and fixing things when they break and having redundant systems that kick in when something breaks so that there's no loss of service. That's the whole point of putting stuff in the "cloud".
Boy have you been fed a line. Read the SLA. If it's not in there; then you don't get it.
If you think the cloud provider is clustering your instance and giving you HA; then AWS is not for you.
Amazon provides availability zones you can provision separate instances storage and networks in. If your application cannot survive the failure of an instance and the failure of an entire availability zone, then you don't have HA, and Amazon won't give it to you -- your app may be inappropriate for AWS, if HA is required.
Re:Say what you will by Zemran · 2013-08-25 15:59 · Score: 5, Funny

"In Soviet Russia, company's customers go down on YOU!"
Now we know the truth about why Snowden went there...

--
I love stacking my barbecues in the shed at the end of summer - you can't beat a bit of grill on grill action.
Re:Say what you will by Zemran · 2013-08-25 16:02 · Score: 4, Insightful

"nothing it does is new or cheaper than hosting 10 years ago."
Welcome to the wonderful world of marketing. Sell people what they already have for 50% more.

--
I love stacking my barbecues in the shed at the end of summer - you can't beat a bit of grill on grill action.
Re:Say what you will by hawguy · 2013-08-25 16:17 · Score: 3, Interesting

No they didn't lie. You can set things up that way-simply set up your servers in multiple data centers(AWS availability zones) and load balance between them. It's foolish to just throw things up in the cloud and think magically I won't ever have to worry about downtime ever again.
But that was one of the big promises of "the cloud": that you'd never have to worry about the nitty-gritty of network administration again, your provider would handle all that for you.
There are many different flavors of "cloud" computing - if you throw your app at a cloud provider and blindly expect them to make it highly available, then you'll get what you deserve. There is no end of cloud solution providers that will be happy to help you architect your app for whatever level of redundancy you want. But it's not going to be free.
Amazon does let you get rid of your network admin and concentrate on managing the servers. No need to worry about BGP, buying bandwidth from multiple redundant providers, buying and administering your own firewalls, network switches, routers, etc.
But you still have to manage your servers. Amazon will help you with multi-AZ redundancy for things like MySQL.

If that isn't the case, then you gain nothing and might as well host the data yourself.
That's depends heavily on your use case. If you have a relatively small number of servers, or have large demand spikes, Amazon can be much more cost effective than hosting your own servers. If you have hundreds of servers and keep them busy all the time, you can probably save money by doing it yourself.
But if you have dozens of servers, then it's likely that you'll save money with Amazon over buying your own servers, network gear, a SAN, backup solution, hardware service contracts, etc.
But you have to architect your application properly. We have our core servers split across multiple AZ's with the database replicated across those AZ's. We don't trust our failover/failback scripts enough to make it automatic, so we have a simple web interface to let anyone on the tech team do the failover. The only impact we saw in this outage was higher latency and timeouts to some of our app servers, but our database was not in the affected zone, and Amazon's load balancer correctly routed traffic to the servers in the good AZ.
Additionally, we have a warm spare running in a different region - the servers are kept up to date with data, but they are running in smaller instance types than we need to run our app, do to a regional failover, we'd have to reboot them into larger instance types (our app startup scripts already tune memory parameters to take advantage of the greater amounts of RAM in the larger instances), then repoint DNS.
Realistically by corran__horn · 2013-08-25 16:58 · Score: 3, Insightful

Chances are that there are no providers that offer a true 99.999% uptime. If you demand that, you need to be building your code to run in a HA cluster with nationwide dispersion. (For reference, you get 5.25 minutes of downtime across a whole year).
99.999% uptime is also completely unnecessary, but sounds really good to management until you talk cost.

--

If people can connect to one another even the smallest of voices will grow loud.
--Serial Experiments Lain
Pretend this was a US government outage by Required+Snark · 2013-08-25 17:21 · Score: 3, Insightful

It's a thought experiment: pretend it was the FAA having a big chunk of airspace loose all ability to track aircraft, or NOAA loosing data collection so that weather forecasts are disrupted. (This, or something like it happens from time to time.)
The right wing talking heads on TV would be squealing like stuck pigs. They would be screaming about "gubment" waste and incompetence, and start floating bills to privatize the FAA (or whomever). You'd get the same response on Slashdot as well.
Meanwhile in real life AWS, Google, and NASDAQ have all had dramatic failures in recent weeks. Although NASDAQ got a fair amount of coverage, and Google got some mention, AWS has been pretty much below the radar for the mainstream media. No one is making dramatic statements on TV about how Google is run by a bunch of idiots, or NASDAQ, a quasi-governmental entity, should be nationalized, because when it fails the entire economy is as risk. As far a critical comments, it's the sound of crickets.
Clearly, there is a double standard. When there are problems with technology in the public sector, it's all hostility and table thumping. Similar failures in the private sector are treated like natural disasters completely beyond human control. According to common rhetoric, the private sector is always better then the public sector. Yet when the private sector fails, no one ever compares it to the well functioning public sector.
There is clearly a lot of hypocrisy in bashing the government. A lot of political power is at stake, and along with that goes a lot of money. This situation makes some people very happy, because they are getting what they want, both in public policy and private profit.

--
Why is Snark Required?
Re:Lack of reliability by Joining+Yet+Again · 2013-08-25 20:31 · Score: 3, Interesting

But I thought the whole point of the cloud was that everything included redundancy, so a server, or a cable, or a whole datacentre could go down, and because of real time replication, nothing whatever would be missed.
Or am I just thinking of VAXclusters from, you know, the 1980s.
Re:Say what you will by jimicus · 2013-08-25 23:33 · Score: 3, Informative

But that's the problem. *THEY* (i.e., AWS or whoever) are supposed to take care of all that stuff. They're supposed to worry about "uptime" and fixing things when they break and having redundant systems that kick in when something breaks so that there's no loss of service. That's the whole point of putting stuff in the "cloud".
Then either you're incredibly naive or you've never looked at what you get with most cloud providers.
Those £15/month virtual servers? You don't get any redundancy on those. If you're lucky, the provider will move it to a new physical host if the one it's living on breaks down, but they won't make any guarantees regarding how quickly that will happen or how automated and transparent that process is.
IME, the pile-it-high, sell-it-cheap brigade are punting exactly this. It's a whole bunch of physical boxes running something like Xen with a web-based front end but none of the work necessary to make it truly highly available has been carried out.
You want true high availability in the cloud - where even an entire datacentre going dark won't affect you? Well, then you have two choices:
- Architect your own. This means you will need several cheap virtual servers and you'll have to write your own software that accounts for all the various failure modes. Yes, this is difficult. Yes, this means you can't just fire up an Ubuntu image with Apache preinstalled on AWS and forget about it. Yes, this means it's a hell of a lot more expensive because suddenly you need to pay for lots of virtual servers rather than just one or two and you need to put a hell of a lot more work into the development process. But that was a choice you made when you went for the cheap option. Oh, you thought that because they used the word "cloud" in their marketing, that meant they'd already done all that for you? Ah.... no. Sorry.
- Contract it out to a company that has already built all this at the virtualisation level so you don't need to worry about it at the OS level. They operate a highly-available infrastructure with redundant everything and guarantees that even if something does fail, the redundancy will kick in automatically and you'll see no downtime. There are companies that offer this, but you might want to sit down with a strong drink before you look at their pricing structure. Clue: It's a hell of a lot more than £15/month for a basic virtual server.