Amazon Outage Shows Limits of Failover 'Zones'
jbrodkin writes "For cloud customers willing to pony up a little extra cash, Amazon has an enticing proposition: Spread your application across multiple availability zones for a near-guarantee that it won't suffer from downtime. 'By launching instances in separate Availability Zones, you can protect your applications from failure of a single location,' Amazon says in pitching its Elastic Compute Cloud service. But the availability zones are close together and can fail at the same time, as we saw today. The outage and ongoing attempts to restore service call into question the effectiveness of the availability zones, and put a spotlight on Amazon's failure to provide load balancing between the east and west coasts."
For a little extra money, you can get a seat in my biplane, with the extra wings.
Yay, cloud. http://www.youtube.com/watch?v=Lel3swo4RMc
This space for rent.
Amazon should put their cloud in a cloud, so the cloud will have the redundancy of the cloud.
Not ready for the desktop ;-)
http://www.stopacop.so -- You have rights. How about standing up for them before they go away?
I lose access to TWC's DNS servers regularly (yes, I will be setting up my own, when it becomes annoying enough). Although you can do a quick-and-dirty load-balancing by setting them up as follows, there's no redundancy for the customers when there's a link failure.
search socal.rr.com
nameserver 209.18.47.61
nameserver 209.18.47.62
or use a completely different company for redundancy. I think that is the lesson here.
Amazon: where failover meets overfail.
It has to be embarassing that a single incident broght down multiple "availability zones" (at least for EBS, maybe other parts of EC2), as that's just what they were supposed to be safe from. Hmm, "overfail", I like it.
Socialism: a lie told by totalitarians and believed by fools.
I'll take the philosophical point of view on this and say failures are the best way to find and diagnose systemic weaknesses. Now Amazon knows the weakness in the AZs and can fix it.
What, you mean a modern oil tanker?
I use irony whenever I can, but my shirts are still wrinkled...
and only heard, "Cloud, cloud, cloud! It's new and shiny and cheaper than those annoying internal IT guys so I get a bonus!" learn to pay the stupidity tax.
Next up, learning just how *much* of your cloud data has been stolen and resold by those trustworthy souls in China and India.
Cheers!
Please do not read this sig. Thank you.
maybe the failure was on purpose to promote another revenue stream. Hmmmmmmmmm......................
Do you think Amazon would allow its own sales and services to be impacted for 12 hours (and running) under any circumstances short of the recent disaster in Japan? EC2 customers, on the other hand, appear to be second-class citizens.
Outsource IT and you outsource responsiblity as well. If your own department fucks up, the top brass will come looking for you. However, If you outsource and the service provider messes up, you can shift the blame to them especially in case of big disasters like these. As long as you can show that you've managed the SLA's well and that it's them who didn't keep to their promises, you're good. More likely you'll find that those SLA's were crap to begin with, which is also fine, because it's likely your boss and his boss signed off on the deal as well. Pass the buck...
If construction was anything like programming, an incorrectly fitted lock would bring down the entire building...
"Availibility Zones", "Failure Domains", etc. must be done with absolute perfection if you do them at all. If your gargantuan application has some single tiny side-feature that is not replicated across domains, your whole app is going down.
True Story: I was doing some consulting work for a large bank after they had a bunch of problems. Their main website had all the super-available trimmings: Oracle RAC, mutli-site server clustering, storage mirroring, all the fancy, expensive, highly-available crap you could ask for. This is all well and good except... some dinky stylesheet (or something like that) for the bank's homepage resided on some dinky 1U non-clustered fileserver. When it went down, the pages simply would not display. Whoops! All that grand effort was for nought because there was one "leakage" that killed the whole app.
You're supposed to FAILOVER between them, not load balance between them.
You can't hold amazon accountable for your own stupidity.
Beyond that, you have to ask yourself the question: how many outages would you have had with your own facility in the past year compared to this outage? Did you apply the same approach to your use of EC2 as you would to your own facility?
It's been like 7 years, how's everyone doing? :)
Change.Org says that for the past several days the Chinese have been DDoSing it over a petition they are posting to gather support for Ai WeiWei.
http://blog.change.org/2011/04/chinese-hackers-attack-change-org-platform-in-reaction-to-ai-weiwei-campaign/
But if you go to the Change.Org site to sign the petition, you get a message saying that something is wrong with their servers, which are at Amazon.
http://www.change.org/petitions/call-for-the-release-of-ai-weiwei
http://status.aws.amazon.com/
http://www.computerworld.com/s/article/9216064/Amazon_gets_black_eye_from_cloud_outage
Could Amazon's outage be the result of Chinese hackers?
C'mon. All managers love cloud!
What rolls down stairs, fails over in pairs,
Leaks data when it's allowed?
A stupidity tax, it replaces your racks,
It's cloud, cloud, cloud!
It's cloud! It's cloud! It's new, it's shiny, it's cheap!
It's cloud! It's cloud! It's down, and now you'll weep.
Everything's in the cloud! You're gonna love it, cloud!
Outsource it to the cloud! Everyone needs a cloud!
Cloud! It goes blammo!
Fools, I have this: http://en.wikipedia.org/wiki/Caproni_Ca.4
Bow down before the three wings and two engines. Seats are going fast, order a spot today.
It is a sad joke. Even for sites like Reddit whose administrators are supposed to know better, the Amazon shit hit. And the terrible thing is that it is not the first time that Amazon's service has broken, this has happened quite a lot in the last months, and people still *pay* for the service. Crazy.
Ubuntu is an African word meaning 'I can't configure Debian'
8:54 AM PDT We'd like to provide additional color on what were working on right now (please note that we always know more and understand issues better after we fully recover and dive deep into the post mortem). A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes. Additionally, one of our internal control planes for EBS has become inundated such that it's difficult to create new EBS volumes and EBS backed instances. We are working as quickly as possible to add capacity to that one Availability Zone to speed up the re-mirroring, and working to restore the control plane issue. We're starting to see progress on these efforts, but are not there yet. We will continue to provide updates when we have them.
So the engineers failed to foresee a potential hazard. Hardly something to get worked up about, especially for a relatively young technology.
Downtime comes from people. The more people involved, the more downtime you'll have.
I don't necessarily hate the marketing concept of 'The Cloud', but I am fascinated by the business decisions and risk acceptance that organisations are willing to take. ie- the typical: "Demanding high availability and hot failover, instantaneous incident resolution, and 'we are your primary customer'... but also a low cost." I think that Amazon and their competitors *may* get there with their offerings, but until there is a bit more maturity, I expect to see more incidents like this.
My wild guess is that a change triggered this, which of course leads to why has the backout plan failed (and who signed off on the risk)? I can't imagine that this is not change related - otherwise there is a serious architectural design flaw here somewhere.
Amazon and Microsoft have to distinctly different views of "cloud computing."
When I first learned about "cloud computing" I automatically assumed it meant that there would be an arbitrary number of different services available to an arbitrary number of web servers which would then be served to the user. No one service would depend on the other.
Amazon's "cloud computing" is centralized upon the virtual machine as the hub of the "cloud." Microsoft Azure, on the other hand, originally offered the approach that I had thought about, where everything is just a service, no VM required.
Today Amazon still depends heavily on the VM concept. You can't have a web service on Amazon without one. This also makes it excessively difficult to "load balance" or provide "failover" because you are actually expected to stand up new VM instances to scale up and down and need separate VM instances on each "availability zone." In addition it's not easy or affordable to share data between availability zones. This isn't what I thought the cloud was going to be.
Microsoft eventually added VMs to its Azure service so they could compete with Amazon's VM-centralized concept. I still think the idea of separate, independent services talking to each other was what the "cloud" was supposed to be, and if these services didn't have to depend on these VMs (which they do not have access to because AWS is intermittently down) they would have still been working from the other data centers.
Kriston
As what I would consider a medium-weight AWS user (our account is about 4 grand a month) I am still quite happy with AWS. We built our system across multiple availability zones, all in us-east and had zero downtime today as a result. We had a couple of issues where we tried to scale up to meet load levels and couldn't spin up anything in us-east-1a (or if we could, we couldn't attach it successfully to a load balance because of internal connectivity issues), but we spun up a new instance in us-east-1b and attached it completely fine and were able to handle the load just fine. The load balancers worked as expected (and hoped for) and the segregation of issues between availability zones was fairly successful.
I think that fixing these issues are just as high an issue with Amazon as they would be with any internal IT infrastructure, so I don't give much credence to the arguments that having your own servers and your own internal IT team would truly solve the problem any more effectively: I think it just gives you more the illusion of control because you can see that you're working on it, as opposed to trusting to the fact that Amazon is working on it.
If there is any AWS lesson to be taken away from this it is that:
1) EBS may not be ready for prime time - most of our servers are instance-store anyway, both for performance reasons and for other reliability problems we have had in the past.
2) You should keep your server templates set up as up-to-date AMIs so you can deploy across any availability zone you want at any time you want. Right now, we have our load balancer attachment configuration all scripted as well, so spinning up new instances to feed a cluster is a single CLI execution with us specifying the availability zones.
Check out http://perfcap.blogspot.com/2011/03/understanding-and-using-amazon-ebs.html for a nice explanation of some of the issues you may come across with EBS and the internals of why.
Overall, I still give Amazon a good rating. This was a major outage and we felt barely a hiccup.
If you can't say something nice, make sure you have something heavy to throw.
I have studied high availability systems and I have developed some. It's not too difficult to come up with a decent design that should guarantee extremely high availability. However, there always will remain assumptions and external factors which influence your system. The most tedious bits are systems and components that "never" fail (like simple NICs) for which you will not get any attention whatsoever.
100% availability is a myth. I know of a case where IBM's zero downtime operating system (z/OS) went down a whole day, causing huge losses. The shut down was allegedly caused by a cleaner accidentally disabling one power supply. Which essentially shouldn't have mattered... But it did. The fail over switch was prevented by a relatively small amount of disk writes which were not fully committed to the fail over site and hence the over site could not commence.
You should know that IBM charges huge amounts of money, it claim zero downtime and eventually doesn't deliver on that. Quite absurd.
But, IMHO, bashing IBM for not delivering zero downtime isn't fair. IBM's customer at hand should have conducted an own HA analysis but instead probably opted for a few words on "IBM" and "zero downtime". The customer really should have known.
I hadn't the slightest objection to his spending his time planning massacres for the bourgeoisie... (P.G. Wodehouse)
and only heard, "Cloud, cloud, cloud! It's new and shiny and cheaper than those annoying internal IT guys so I get a bonus!" learn to pay the stupidity tax.
Next up, learning just how *much* of your cloud data has been stolen and resold by those trustworthy souls in China and India.
Cheers!
As a developer from India, I can tell you India has absolutely nothing to do with the cloud.. thats a US revolution.. we are still years behind..
The last person to mod me down is a rotten egg..... there.. that should do it..
In that case the downtime wouldn't be the biggest of my problems.
Well, I might have a way, but it only works on a semi spherical planet in a vacuum.
Comment removed based on user account deletion
"For cloud customers willing to pony up a little extra cash, Amazon has an enticing proposition: Spread your application across multiple availability zones for a near-guarantee that it won't suffer from downtime
I would have thought the the entire raison d'etre of moving to the Cloud was to eliminate downtime, else why not rent two boxes in different locations and achieve this near-guarantee uptime without the extra expense not to mention your data totally disappearing when the Cloud goes down ...
`Amazon's "cloud computing" is centralized upon the virtual machine as the hub of the "cloud."'
I think you hit-the-nail-on-the-head there, a centralized anything is always vulnerable to a this kind of failure. For a business with multiple locations a number of servers sited locally in a peer-to-peer configuration would provide a more reliable service. All they rely on is an end-to-end IP connection. If one site goes then the rest can carry on. I do believe this whole cloud computing concept has been over sold.
... the "Bleeding Edge" for nothing. New technology always comes with teething problems.
Canadian Host?
http://www.youtube.com/watch?v=LAYMJnO9LBQ
- Dan.
~ People that think they are better than anyone else for any reason are the cause of all the strife in the world.
this reminds me of pynchon lyrics... it's perfectly in the spirit.