Dark Day In the AWS Cloud: Big Name Sites Go Down
An outage of one company's servers might only affect that company's customers — but when a major data center for Amazon hits kinks, sites that rely on the AWS cloud services all suffer from the downtime. That's what happened today, when several major sites or online services (like Instagram and AirBnB) were knocked temporarily offline, evidently because of problems at an Amazon data center in Northern Virginia. From TechCrunch's coverage of the outage: "The deluge of tweets that accompanied the services’ initial hiccups first started at around 4 p.m. Eastern time, and only increased in intensity as users found they couldn’t share pictures of their food or their meticulously crafted video snippets. Some further poking around on Twitter and beyond revealed that some other services known to rely on AWS — Netflix, IFTTT, Heroku and Airbnb to name a few — have been experiencing similar issues today."
In Soviet Russia, company's customers go down on YOU!
One of the features of AWS was supposed to be the ability to reroute everything to a different datacenter if one goes down. I know I read that somewhere back when AWS was first starting up. You don't think they lied, do you?
That's expensive. "Cloud" hosting services cost about 1.5x traditional hosting. When you want multiple locations("regions" in aws) you need to pay for resources in each additional region, then pay another cost to provide that failover. Cloud hosting is great, but it's nothing it does is new or cheaper than hosting 10 years ago.
Website Hosting
I've run servers on both Amazon and Rackspace for several years now and I can't recall a single instance of Rackspace having an outage. On the other hand, Amazon seems to have major issues at least 2 or 3 times a year. Is this stuff tracked anywhere?
"Don't teach a man to fish, feed yourself. He's a grown man. Fishing's not that hard." - Ron Swanson
No they didn't lie. You can set things up that way-simply set up your servers in multiple data centers(AWS availability zones) and load balance between them. It's foolish to just throw things up in the cloud and think magically I won't ever have to worry about downtime ever again. It's foolish-but a lot of companies act this way.
Somehow cloud hosting is taken as the silver bullet to prevent outages-it isn't. You still have to architect things the way you would normally if you're looking for things like disaster recovery, high availability, etc...etc..
Assuming you mean traditional round-robin A records, the timeout(s) you still have to suffer through would kill your latency.
If your talking about DNS providers (disclaimer, I work for Dyn) with advanced features that detect a failover event occurring and will only serve healthy A records, then that is a different story.
Well, right now I have 500 machines running some heavy calculations in multiple AZs. Works perfectly fine, we have noticed the recent problems but simply stopped using the affected region (us-east-1) for the time being, shifting our calculations to other regions.
AWS is really great at scaling. It's better than anything else on the market, but it does require a lot of work.
No, you have to manage your own redundancy and failover on AWS. Look at all the effort Netflix has put into programming failover and stress testing and yet they still have frequent outages with AWS.
this is my sig
either you don't speak English or you need to take your meds. no offense. so i'll try muddling a reply together for you.
There are many ways to setup remote failover systems. Most of them rely on some type of heartbeat system where there's a "heartbeat message" which they all send each other periodically, and if the current Active goes out of response for too long the others choose one to take over. So it doesn't matter if they're all in one room connected with a single switch, or spread all over the planet.
The real rub for any mechanism is DNS... if the primary server your FQDN points at drops then you might have redundancy but most people won't be able to take advantage of it. With more manual mechanisms (such as telling users "If our primary site goes down, try here instead!") that's not as much of a concern, just a PITA to keep track of.
AWS Status Dashboard?
I know this is /., and people here don't like to read, but did anyone actually read the status dashboard posts?
This issue was limited to a single AZ, effected only a small number of machines, and was specifically an issue with added latency in EBS volumes. And Amazon completely resolved the issue in 4 hours.
So, call me crazy, but didn't they do exactly what they are supposed to do? Also, AWS quite clearly states that any given AZ *might* fail. Hence, if you want any sort of high-availability, you replicate across different AZs.
Plus, I have 10+ EC2 instances, and a number of other resources with AWS, and none of them were effected by this outage.
"cloud" is sold as a *convenient* way to compute, where it's quick to add resources when needed so you can start small and scale up (and down) with demand.
It is *not* generally considered a cheap or particularly reliable solution. So far at least none of the cloud providers are offering five nines--if you want that, you should (for now at least) jbe looking at enterprise/telecom gear.
That things like this will happen with a cloud infrastructure are obvious. That the reliability claims made by the cloud providers are fantasy is also obvious. As soon as they start to do "uptime or else" (meaning you get tons of money as downtime compensation), things may be different. but they will not do that. At this time, the only thing you can do is change to a different cloud provider, which will have the same issues. Uptime guarantees without penalties when failed to meet them are worthless.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
It depends which data center you're in. PortableApps.com has been hosted at Rackspace for years and we had multiple major outtages due to ongoing power issues in the Dallas data center in 2009. The switch from grid to ups was failing and would take the whole wing of the data center out with every server crashing hard. It would take quite a while to come back up. Then we'd have to wait hours for the Rackspace folks to rebuild our corrupted database (fully managed account on a dedicated server). It happened two weekends in a row in June and one other time if I recall correctly, basically costing us a full day of downtime each time.
Portable versions of Firefox, GIMP, LibreOffice, etc
Shouldn't this, technically speaking, be a "bright day" or a "sunny day"? After all, that's what I call it when the cloud-coverage breaks around here.
"In Soviet Russia, company's customers go down on YOU!"
Now we know the truth about why Snowden went there...
I love stacking my barbecues in the shed at the end of summer - you can't beat a bit of grill on grill action.
"nothing it does is new or cheaper than hosting 10 years ago."
Welcome to the wonderful world of marketing. Sell people what they already have for 50% more.
I love stacking my barbecues in the shed at the end of summer - you can't beat a bit of grill on grill action.