Dark Day In the AWS Cloud: Big Name Sites Go Down

← Back to Stories (view on slashdot.org)

Dark Day In the AWS Cloud: Big Name Sites Go Down

Posted by timothy on Sunday August 25, 2013 @12:02PM from the central-authority-vs-resilience dept.

An outage of one company's servers might only affect that company's customers — but when a major data center for Amazon hits kinks, sites that rely on the AWS cloud services all suffer from the downtime. That's what happened today, when several major sites or online services (like Instagram and AirBnB) were knocked temporarily offline, evidently because of problems at an Amazon data center in Northern Virginia. From TechCrunch's coverage of the outage: "The deluge of tweets that accompanied the services’ initial hiccups first started at around 4 p.m. Eastern time, and only increased in intensity as users found they couldn’t share pictures of their food or their meticulously crafted video snippets. Some further poking around on Twitter and beyond revealed that some other services known to rely on AWS — Netflix, IFTTT, Heroku and Airbnb to name a few — have been experiencing similar issues today."

8 of 182 comments (clear)

Min score:

Reason:

Sort:

Re:Say what you will by chrisgeleven · 2013-08-25 12:54 · Score: 5, Informative

Assuming you mean traditional round-robin A records, the timeout(s) you still have to suffer through would kill your latency.
If your talking about DNS providers (disclaimer, I work for Dyn) with advanced features that detect a failover event occurring and will only serve healthy A records, then that is a different story.
Re:Say what you will by Anonymous Coward · 2013-08-25 13:48 · Score: 5, Informative

either you don't speak English or you need to take your meds. no offense. so i'll try muddling a reply together for you.
There are many ways to setup remote failover systems. Most of them rely on some type of heartbeat system where there's a "heartbeat message" which they all send each other periodically, and if the current Active goes out of response for too long the others choose one to take over. So it doesn't matter if they're all in one room connected with a single switch, or spread all over the planet.
The real rub for any mechanism is DNS... if the primary server your FQDN points at drops then you might have redundancy but most people won't be able to take advantage of it. With more manual mechanisms (such as telling users "If our primary site goes down, try here instead!") that's not as much of a concern, just a PITA to keep track of.
Re: Say what you will by Anonymous Coward · 2013-08-25 13:59 · Score: 2, Informative

Most modern browsers do, indeed, try the next address. It' s a browser feature, though, not an official standard.
Re:Say what you will by Anonymous Coward · 2013-08-25 14:05 · Score: 4, Informative

AWS Status Dashboard?
I know this is /., and people here don't like to read, but did anyone actually read the status dashboard posts?
This issue was limited to a single AZ, effected only a small number of machines, and was specifically an issue with added latency in EBS volumes. And Amazon completely resolved the issue in 4 hours.
So, call me crazy, but didn't they do exactly what they are supposed to do? Also, AWS quite clearly states that any given AZ *might* fail. Hence, if you want any sort of high-availability, you replicate across different AZs.
Plus, I have 10+ EC2 instances, and a number of other resources with AWS, and none of them were effected by this outage.
actually, no by Chirs · 2013-08-25 14:09 · Score: 5, Informative

"cloud" is sold as a *convenient* way to compute, where it's quick to add resources when needed so you can start small and scale up (and down) with demand.
It is *not* generally considered a cheap or particularly reliable solution. So far at least none of the cloud providers are offering five nines--if you want that, you should (for now at least) jbe looking at enterprise/telecom gear.
1. Re:actually, no by You're+All+Wrong · 2013-08-25 18:00 · Score: 4, Informative
  
  > It is *not* generally considered a cheap
  
  Quoth Forbes:
  Cost savingsâ€¦ [...] These are the advertised benefits of cloud computing
  
  Quoth Salesforce:
  4. Cap-Ex Free [...] no need for capital expenditure [...] minimal project start-up costs
  
  Quoth Verio:
  Achieve economies of scale [...] Reduce spending on technology infrastructure. [...] Globalize your workforce on the cheap [...] Reduce capital costs.
  
  And those were the first 3 hits for ``benefits of cloud computing'' (although the first one is meta, it refers to others refering to cost savings).
  
  I hate to shake you from your firmly entrenched world-view, but you have to know that people are touting cloud solutions as ones which have cost benefits. Whether they're valid claims or not is irrelevant, they are undeniably being made.
  
  --
  Your head of state is a corrupt weasel, I hope you're happy.
Re:Has Rackspace had any outages in 10 years or so by CritterNYC · 2013-08-25 14:29 · Score: 5, Informative

It depends which data center you're in. PortableApps.com has been hosted at Rackspace for years and we had multiple major outtages due to ongoing power issues in the Dallas data center in 2009. The switch from grid to ups was failing and would take the whole wing of the data center out with every server crashing hard. It would take quite a while to come back up. Then we'd have to wait hours for the Rackspace folks to rebuild our corrupted database (fully managed account on a dedicated server). It happened two weekends in a row in June and one other time if I recall correctly, basically costing us a full day of downtime each time.

--
Portable versions of Firefox, GIMP, LibreOffice, etc
Re:Say what you will by jimicus · 2013-08-25 23:33 · Score: 3, Informative

But that's the problem. *THEY* (i.e., AWS or whoever) are supposed to take care of all that stuff. They're supposed to worry about "uptime" and fixing things when they break and having redundant systems that kick in when something breaks so that there's no loss of service. That's the whole point of putting stuff in the "cloud".
Then either you're incredibly naive or you've never looked at what you get with most cloud providers.
Those £15/month virtual servers? You don't get any redundancy on those. If you're lucky, the provider will move it to a new physical host if the one it's living on breaks down, but they won't make any guarantees regarding how quickly that will happen or how automated and transparent that process is.
IME, the pile-it-high, sell-it-cheap brigade are punting exactly this. It's a whole bunch of physical boxes running something like Xen with a web-based front end but none of the work necessary to make it truly highly available has been carried out.
You want true high availability in the cloud - where even an entire datacentre going dark won't affect you? Well, then you have two choices:
- Architect your own. This means you will need several cheap virtual servers and you'll have to write your own software that accounts for all the various failure modes. Yes, this is difficult. Yes, this means you can't just fire up an Ubuntu image with Apache preinstalled on AWS and forget about it. Yes, this means it's a hell of a lot more expensive because suddenly you need to pay for lots of virtual servers rather than just one or two and you need to put a hell of a lot more work into the development process. But that was a choice you made when you went for the cheap option. Oh, you thought that because they used the word "cloud" in their marketing, that meant they'd already done all that for you? Ah.... no. Sorry.
- Contract it out to a company that has already built all this at the virtualisation level so you don't need to worry about it at the OS level. They operate a highly-available infrastructure with redundant everything and guarantees that even if something does fail, the redundancy will kick in automatically and you'll see no downtime. There are companies that offer this, but you might want to sit down with a strong drink before you look at their pricing structure. Clue: It's a hell of a lot more than £15/month for a basic virtual server.