Amazon Outage Shows Limits of Failover 'Zones'

← Back to Stories (view on slashdot.org)

Amazon Outage Shows Limits of Failover 'Zones'

Posted by timothy on Thursday April 21, 2011 @08:01AM from the my-cloud-smells-like-cat-food dept.

jbrodkin writes "For cloud customers willing to pony up a little extra cash, Amazon has an enticing proposition: Spread your application across multiple availability zones for a near-guarantee that it won't suffer from downtime. 'By launching instances in separate Availability Zones, you can protect your applications from failure of a single location,' Amazon says in pitching its Elastic Compute Cloud service. But the availability zones are close together and can fail at the same time, as we saw today. The outage and ongoing attempts to restore service call into question the effectiveness of the availability zones, and put a spotlight on Amazon's failure to provide load balancing between the east and west coasts."

23 of 125 comments (clear)

Min score:

Reason:

Sort:

Re:For a little extra money... by techsoldaten · 2011-04-21 08:06 · Score: 2

For a little extra money, you can get a seat in my biplane, with the extra wings.
Let us learn from Xzibit by Anonymous Coward · 2011-04-21 08:08 · Score: 4, Funny

Amazon should put their cloud in a cloud, so the cloud will have the redundancy of the cloud.
Cloud computing by stopacop · 2011-04-21 08:08 · Score: 5, Funny

Not ready for the desktop ;-)

--
http://www.stopacop.so -- You have rights. How about standing up for them before they go away?
have your own servers by Dan667 · 2011-04-21 08:13 · Score: 4, Insightful

or use a completely different company for redundancy. I think that is the lesson here.
1. Re:have your own servers by rudy_wayne · 2011-04-21 08:35 · Score: 4, Insightful
  
  This incident illustrates once again why you need to put your stuff on your own servers and not someone else's. All computer systems will fail occasionally. There's no such thing as 100% uptime. However, when your own servers fail you can get your own people working on it right away and it's their number one priority. When your stuff is on someone else's servers, you're at their mercy. It will get fixed when they get around to it, and, they have more customers than just you, so you might not be first on the priority list. Or second. Or third. Or tenth.
2. Re:have your own servers by gad_zuki! · 2011-04-21 08:47 · Score: 4, Informative
  
  So wait. The cloud sales pitch is "no more servers-save money-cut IT staff" but now its:
  1. Virtualized servers in zone 1
  2. Virtualized servers in zone 2
  3. Virtualized servers from a different company altogether.
  So I went from one solid server, good backups, maybe a hot backup, and talented staff running the show to outsourced to 3 different clouds with hour-long hold times with some Amazon support monkey? Genius.
3. Re:have your own servers by vrmlguy · 2011-04-21 09:40 · Score: 2
  
  This incident illustrates once again why you need to put your stuff on your own servers and not someone else's.
  Well. Or put your stuff on your own servers as well as someone else's. Cloning your services into various clouds isn't insane as a tool for handling some types of unplanned scaling requirements or some types of unplanned outages. Relying on those clouds introduces risks that were just demonstrated.
  It's probably worth noting that EMC makes a cloud storage product called Atmos with an API essentially identical to Amazon's S3 service. The main difference is that the HTTP headers start with x-emc instead of x-amz, so a properly written application running on non-Amazon servers could switch fairly easily between the two for load balancing or redundancy.
  
  --
  Nothing for 6-digit uids?
4. Re:have your own servers by hey! · 2011-04-21 10:02 · Score: 2
  
  Nah. It shows that when you buy a product or service you need to understand what you are paying for, not extrapolate from a buzzword like "cloud".
  You can't make a blanket statement one way or another about using something like EC2 without considering the user's needs and capabilities. There may be users who'd find the recent outage intolerable ;they probably shouldn't be using EC2. But if they have good reasons to consider EC2 chances are they are goig to spend more money.
  
  --
  Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
5. Re:have your own servers by LordLimecat · 2011-04-21 13:40 · Score: 2
  
  Pop quiz:
  Youre a small company that does software development. You need servers to do deployment testing, basically just apache and the customized package. Uptime is a must, and your budget is limited.
  Do you...
  A) Spend tens of thousands on servers, plus backup power, plus racks, plus redundant switches, plus dual WAN links, plus a backup solution (for 10 servers, so far youre looking at ~$35k, plus a thousand a month on WAN links)
  or
  B) Trust that Amazon will have FAR better uptime than you could EVER dream of architecting on a budget, with far greater convenience, and a lower price tag to boot (the "cloud" is generally billed on CPU usage and bandwidth, which will be low for testing)?
  Every time Google or Amazon or Rackspace suffers an outage, people start hollering that the cloud is a menace, a curse, a sham, or whatever. But if you look at the length of, for example, Googles outages over 8 years, their record is head and shoulders above anything that slashdot armchair engineers could throw together, especially given the load they carry.
  Unless I missed a news story, this will be Amazon cloud's first outage, and the beauty here is that none of the "rebuild" or "restore from backup" burden will be on their customers. They have to pay technicians to come out and replace hardware; they have to provide the hardware. The downside, of course, is that their services are unavailable; but of course you would be facing that if your own setup failed, and you would be footing the bill to boot.
  The real lesson here, I suppose, is that if you really really really need 100% uptime, you should be prepared to fire up a hot- or cold- standby system of your own, or that you should get a rather large budget approved to build a real redundant system-- but not that you can out-architect Amazon on anything less than a large budget.
  NB-- I say this as a technician typically dealing with networks up to 100 users and up to 30 servers. If you have multi-million dollar budgets, certainly go ahead and build out that server room.
6. Re:have your own servers by outsider007 · 2011-04-21 14:52 · Score: 2
  
  No. If a zone goes down in CA, I can have a new server up in Virginia within minutes. I would rather be on ec2 when I go down. I guarantee I will be back up faster than you.
  
  --
  If you mod me down the terrorists will have won
7. Re:have your own servers by The+Bean · 2011-04-21 18:03 · Score: 2
  
  Reddit's downtime has been a bit of a running joke for a while now, which most (all?) of it being blamed on Amazon.
  The way they implemented things is one of the big issues. For example, things like setting up RAID volumes across multiple EBS volumes. They just magnified their exposure to any issues in the cloud. Any one machine goes down the system gets hosed and needs recovery. They also are constrained to a single availability zone in order to get the performance they need from their setup. (This is not intended to be a factual statement. ie, I didn't confirm the details, but I believe it captures the essence of the issue.)
  To get the most from the "cloud" you need to build your infrastructure accordingly. You can't take old systems and throw them in the cloud and expect it to scale. Neither can you take all the old ideas, new tools will require new techniques which the industry will learn as things mature.
philosophical POV by Anonymous Coward · 2011-04-21 08:29 · Score: 2, Insightful

I'll take the philosophical point of view on this and say failures are the best way to find and diagnose systemic weaknesses. Now Amazon knows the weakness in the AZs and can fix it.
And thus the gullible managers who ignored IT... by gestalt_n_pepper · 2011-04-21 08:41 · Score: 2

and only heard, "Cloud, cloud, cloud! It's new and shiny and cheaper than those annoying internal IT guys so I get a bonus!" learn to pay the stupidity tax.
Next up, learning just how *much* of your cloud data has been stolen and resold by those trustworthy souls in China and India.
Cheers!

--
Please do not read this sig. Thank you.
Re:sounds like TWCs DNS servers by michaelhood · 2011-04-21 08:43 · Score: 2

or just use 4.2.2.1 and 4.2.2.2.. or 8.8.8.8 and 8.8.4.4.
Re:sounds like TWCs DNS servers by samkass · 2011-04-21 08:57 · Score: 2, Informative

...and get slow performance on anything delivered via Akamai or similar services which try to use regional data centers.
OpenDNS and Google DNS are hacks that work increasingly badly.

--
E pluribus unum
Gullible manager doesn't care by JaredOfEuropa · 2011-04-21 08:59 · Score: 2

Outsource IT and you outsource responsiblity as well. If your own department fucks up, the top brass will come looking for you. However, If you outsource and the service provider messes up, you can shift the blame to them especially in case of big disasters like these. As long as you can show that you've managed the SLA's well and that it's them who didn't keep to their promises, you're good. More likely you'll find that those SLA's were crap to begin with, which is also fine, because it's likely your boss and his boss signed off on the deal as well. Pass the buck...

--
If construction was anything like programming, an incorrectly fitted lock would bring down the entire building...
Re:Turning lemons into lemonade or...... by elohel · 2011-04-21 09:00 · Score: 3, Insightful

Okay, I had to log in simply to comment on the stupidity of this statement. Aside from now being in violation of their own ToS (probably, at least in transgression of up-time guarantees), they're undoubtedly fiscally liable for refunding payment for the period of time in which services were unavailable or degraded. Additionally, this dramatically hurts their brand name - I know if I ever have to host anything on 'the cloud' (I can't believe I said it), this incident will be on my mind when the time comes for me to choose a provider. And before I stop beating this dead horse - think about what kind of liability Amazon would have, fiscally, for intentionally dropping services for revenue producing sites. One would imagine that Amazon would be fiscally liable for revenue losses during that downtime if this outage was intentional. That's no small amount of coin.
Re:sounds like TWCs DNS servers by trapnest · 2011-04-21 09:05 · Score: 3, Interesting

Not that you're wrong, but that's not the fault of the DNS servers, Akamai should be using geolocation by IP, not by the location of DNS servers.
Infact, I'm not sure how they could be doing geolocation by the client's DNS servers... are you sure about that?
Re:sounds like TWCs DNS servers by guruevi · 2011-04-21 09:11 · Score: 2

Actually, when you're on TWC you might get BETTER performance with OpenDNS than with their own DNS. When using the TWC DNS I can't get a 1080p without 10m of loading time or even a non-stuttering 720p stream from YouTube or Netflix. With OpenDNS or Google DNS I get much better performance. Also, if you're an AT&T Business customer, OpenDNS works much better with DNS-based RBL's like Spamhaus which AT&T blocks.

--
Custom electronics and digital signage for your business: www.evcircuits.com
Re:Yay by dkf · 2011-04-21 09:41 · Score: 2

Cloud Computing generally implies redundancy and non-locality 24/7.
No. It generally implies that you can hire resources (cpu, disk) on short notice and for short amounts of time without costing the earth. You can build high-availability systems on top of that, but HA is not trivial to set up and typically requires significant investment at many levels (hardware, system, application) to attain. Pretend that you can get away with less if you want; I don't care.

--
"Little does he know, but there is no 'I' in 'Idiot'!"
No need to speculate by jc2brown · 2011-04-21 10:18 · Score: 2

Since apparently no one's actually looked into the issue beyond "ZOMG the cloud is down," here's some info from Amazon:

8:54 AM PDT We'd like to provide additional color on what were working on right now (please note that we always know more and understand issues better after we fully recover and dive deep into the post mortem). A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes. Additionally, one of our internal control planes for EBS has become inundated such that it's difficult to create new EBS volumes and EBS backed instances. We are working as quickly as possible to add capacity to that one Availability Zone to speed up the re-mirroring, and working to restore the control plane issue. We're starting to see progress on these efforts, but are not there yet. We will continue to provide updates when we have them.
So the engineers failed to foresee a potential hazard. Hardly something to get worked up about, especially for a relatively young technology.
Amazon and Microsoft by kriston · 2011-04-21 15:07 · Score: 2

Amazon and Microsoft have to distinctly different views of "cloud computing."
When I first learned about "cloud computing" I automatically assumed it meant that there would be an arbitrary number of different services available to an arbitrary number of web servers which would then be served to the user. No one service would depend on the other.
Amazon's "cloud computing" is centralized upon the virtual machine as the hub of the "cloud." Microsoft Azure, on the other hand, originally offered the approach that I had thought about, where everything is just a service, no VM required.
Today Amazon still depends heavily on the VM concept. You can't have a web service on Amazon without one. This also makes it excessively difficult to "load balance" or provide "failover" because you are actually expected to stand up new VM instances to scale up and down and need separate VM instances on each "availability zone." In addition it's not easy or affordable to share data between availability zones. This isn't what I thought the cloud was going to be.
Microsoft eventually added VMs to its Azure service so they could compete with Amazon's VM-centralized concept. I still think the idea of separate, independent services talking to each other was what the "cloud" was supposed to be, and if these services didn't have to depend on these VMs (which they do not have access to because AWS is intermittently down) they would have still been working from the other data centers.

--
Kriston
Re:sounds like TWCs DNS servers by The+Bean · 2011-04-21 17:54 · Score: 2

Typically your computer asks your firewall/router for a DNS lookup. It relays that to your ISP's DNS server. Your ISP looks up the DNS server responsible for the domain and contacts that server and sends your original request. That request doesn't include your IP however, so Akamai's DNS servers are returning regional specific servers based on your ISP's DNS server IP/geo-location. That's usually perfectly acceptable, since presumably your ISP's DNS server would be located on a good route with a low ping.
So if you replace your ISP's DNS server with those of OpenDNS, google or whatever else, it is that server which determines your location when Akamai's DNS servers decide which IPs to give you.
You should be able to replace your DNS server with your own locally hosted one as well. ie, you contact the root-servers, hunt down the responsible server, then contact it directly for the IP. I'm not sure what the implications of that is though. The intent of the typical setup is that the ISP DNS servers can cache things and reduce the load on the central root servers.