Amazon Outage Shows Limits of Failover 'Zones'
jbrodkin writes "For cloud customers willing to pony up a little extra cash, Amazon has an enticing proposition: Spread your application across multiple availability zones for a near-guarantee that it won't suffer from downtime. 'By launching instances in separate Availability Zones, you can protect your applications from failure of a single location,' Amazon says in pitching its Elastic Compute Cloud service. But the availability zones are close together and can fail at the same time, as we saw today. The outage and ongoing attempts to restore service call into question the effectiveness of the availability zones, and put a spotlight on Amazon's failure to provide load balancing between the east and west coasts."
You can sail on my new ship. It's got redundant hulls, a few feet apart.
Yay, cloud. http://www.youtube.com/watch?v=Lel3swo4RMc
This space for rent.
Amazon should put their cloud in a cloud, so the cloud will have the redundancy of the cloud.
Oh, you wanted failover for your web site! I thought you meant failure!
Not ready for the desktop ;-)
http://www.stopacop.so -- You have rights. How about standing up for them before they go away?
In Soviet Russia failover zones you.
I lose access to TWC's DNS servers regularly (yes, I will be setting up my own, when it becomes annoying enough). Although you can do a quick-and-dirty load-balancing by setting them up as follows, there's no redundancy for the customers when there's a link failure.
search socal.rr.com
nameserver 209.18.47.61
nameserver 209.18.47.62
or use a completely different company for redundancy. I think that is the lesson here.
I'll take the philosophical point of view on this and say failures are the best way to find and diagnose systemic weaknesses. Now Amazon knows the weakness in the AZs and can fix it.
How about one zone in Europe and one in North America? Isn't that better?
and only heard, "Cloud, cloud, cloud! It's new and shiny and cheaper than those annoying internal IT guys so I get a bonus!" learn to pay the stupidity tax.
Next up, learning just how *much* of your cloud data has been stolen and resold by those trustworthy souls in China and India.
Cheers!
Please do not read this sig. Thank you.
i have a fully redundant cloud. its got all the best and greatest technologies, and i have it in this box... whoops i just dropped it. having your backup (cloud apps or files) in the same datacenter, is a lot like backing up your data to a separate partition on the same hard drive. im looking at you time machine.
maybe the failure was on purpose to promote another revenue stream. Hmmmmmmmmm......................
Do you think Amazon would allow its own sales and services to be impacted for 12 hours (and running) under any circumstances short of the recent disaster in Japan? EC2 customers, on the other hand, appear to be second-class citizens.
Outsource IT and you outsource responsiblity as well. If your own department fucks up, the top brass will come looking for you. However, If you outsource and the service provider messes up, you can shift the blame to them especially in case of big disasters like these. As long as you can show that you've managed the SLA's well and that it's them who didn't keep to their promises, you're good. More likely you'll find that those SLA's were crap to begin with, which is also fine, because it's likely your boss and his boss signed off on the deal as well. Pass the buck...
If construction was anything like programming, an incorrectly fitted lock would bring down the entire building...
"Availibility Zones", "Failure Domains", etc. must be done with absolute perfection if you do them at all. If your gargantuan application has some single tiny side-feature that is not replicated across domains, your whole app is going down.
True Story: I was doing some consulting work for a large bank after they had a bunch of problems. Their main website had all the super-available trimmings: Oracle RAC, mutli-site server clustering, storage mirroring, all the fancy, expensive, highly-available crap you could ask for. This is all well and good except... some dinky stylesheet (or something like that) for the bank's homepage resided on some dinky 1U non-clustered fileserver. When it went down, the pages simply would not display. Whoops! All that grand effort was for nought because there was one "leakage" that killed the whole app.
This "i_want_you_to_throw_" has an overactive ego and wants to appear clever. "i_want_you_to_throw" should take DXM.
Yes thats the idea, and if you don't like it, you won't go higher in your company.
Business is business. Accept it or work for a Charity.
You mean, a company hyped something as better than it really was, and too many people took it at face value? Oh, perish the thought! (sarcasm).
You're supposed to FAILOVER between them, not load balance between them.
You can't hold amazon accountable for your own stupidity.
Beyond that, you have to ask yourself the question: how many outages would you have had with your own facility in the past year compared to this outage? Did you apply the same approach to your use of EC2 as you would to your own facility?
It's been like 7 years, how's everyone doing? :)
Change.Org says that for the past several days the Chinese have been DDoSing it over a petition they are posting to gather support for Ai WeiWei.
http://blog.change.org/2011/04/chinese-hackers-attack-change-org-platform-in-reaction-to-ai-weiwei-campaign/
But if you go to the Change.Org site to sign the petition, you get a message saying that something is wrong with their servers, which are at Amazon.
http://www.change.org/petitions/call-for-the-release-of-ai-weiwei
http://status.aws.amazon.com/
http://www.computerworld.com/s/article/9216064/Amazon_gets_black_eye_from_cloud_outage
Could Amazon's outage be the result of Chinese hackers?
C'mon. All managers love cloud!
What rolls down stairs, fails over in pairs,
Leaks data when it's allowed?
A stupidity tax, it replaces your racks,
It's cloud, cloud, cloud!
It's cloud! It's cloud! It's new, it's shiny, it's cheap!
It's cloud! It's cloud! It's down, and now you'll weep.
Everything's in the cloud! You're gonna love it, cloud!
Outsource it to the cloud! Everyone needs a cloud!
Cloud! It goes blammo!
It is a sad joke. Even for sites like Reddit whose administrators are supposed to know better, the Amazon shit hit. And the terrible thing is that it is not the first time that Amazon's service has broken, this has happened quite a lot in the last months, and people still *pay* for the service. Crazy.
Ubuntu is an African word meaning 'I can't configure Debian'
8:54 AM PDT We'd like to provide additional color on what were working on right now (please note that we always know more and understand issues better after we fully recover and dive deep into the post mortem). A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes. Additionally, one of our internal control planes for EBS has become inundated such that it's difficult to create new EBS volumes and EBS backed instances. We are working as quickly as possible to add capacity to that one Availability Zone to speed up the re-mirroring, and working to restore the control plane issue. We're starting to see progress on these efforts, but are not there yet. We will continue to provide updates when we have them.
So the engineers failed to foresee a potential hazard. Hardly something to get worked up about, especially for a relatively young technology.
Downtime comes from people. The more people involved, the more downtime you'll have.
I don't necessarily hate the marketing concept of 'The Cloud', but I am fascinated by the business decisions and risk acceptance that organisations are willing to take. ie- the typical: "Demanding high availability and hot failover, instantaneous incident resolution, and 'we are your primary customer'... but also a low cost." I think that Amazon and their competitors *may* get there with their offerings, but until there is a bit more maturity, I expect to see more incidents like this.
My wild guess is that a change triggered this, which of course leads to why has the backout plan failed (and who signed off on the risk)? I can't imagine that this is not change related - otherwise there is a serious architectural design flaw here somewhere.
switching from one data center to another should be no big deal.Amazon has proven that big things fall douwn and stumble trying to get back up.Why should I pay more? They fell on thier own. I have a better idea- don't fuck up the first time when you promise the world.
What do you expect...when you buy a consumer grade product from a company that sells books for a living...you get what you pay for.
Any IT manager worth his salt would never ever put all their eggs in one basket (btwHappy Easter).
Rarely do you ever see a medium to large enterprise completely outsource all operations to a single vendor and if they dothey completely understand that provider’s infrastructure and related redundancy models both within and outside the data center.
I advise anyone that is scared to move into a cloud compute infrastructure because Barnes & Nobleoops I mean Amazon cannot properly design/operate an infrastructurego look at real enterprise class cloud computing providers (IaaS/PaaS) and ask the top three for a 30 day demo (they will give it to you if you’re a legit business) and run some off the shelf benchmarks. You will find that Amazon although cheap to turn up when you compare price and performance under loadthey will be beat hands down in both categories. This is not only my opinion/real world experience but also that of multiple fortune 500 CIOs (more so their IT staff) that I have dealt with just in the last 12 months.
So in shortdon’t listen to all these ignorant (not stupidjust ignorant) cloud bashing server huggers and do some researchthere are some really good cloud infrastructure providers out there that may not take your visa as paymentbut will provide you with a much more robust, reliable and performance oriented infrastructure compared to your Barns & Nobleoops I did it againI mean Amazon service.
BTWI love Amazon as an Ecommerce site (books, electronics, musicetc)I love my Prime membership!
Click on our website: ( http://www.fullmalls.com/ ) Website wholesale various fashion shoes, such as Nike, Jordan, prada, also includes the jeans, shirt, bags, hats and decoration. Personality manufacturing execution systems (Mes) clothing, Grab an eye bag coat + tide bag Air jordan(1-24)shoes $30 Handbags(Coach l v f e n d i d&g) $35 Tshirts (Polo ,ed hardy,lacoste) $15Jean(True Religion,ed hardy,coogi) $30Sunglasses(Oakey,coach,gucci,A r m a i n i) $15
New era cap $12
Bikini (Ed hardy,polo) $20accept paypal and free shipping
( http://www.fullmalls.com/ )
Amazon and Microsoft have to distinctly different views of "cloud computing."
When I first learned about "cloud computing" I automatically assumed it meant that there would be an arbitrary number of different services available to an arbitrary number of web servers which would then be served to the user. No one service would depend on the other.
Amazon's "cloud computing" is centralized upon the virtual machine as the hub of the "cloud." Microsoft Azure, on the other hand, originally offered the approach that I had thought about, where everything is just a service, no VM required.
Today Amazon still depends heavily on the VM concept. You can't have a web service on Amazon without one. This also makes it excessively difficult to "load balance" or provide "failover" because you are actually expected to stand up new VM instances to scale up and down and need separate VM instances on each "availability zone." In addition it's not easy or affordable to share data between availability zones. This isn't what I thought the cloud was going to be.
Microsoft eventually added VMs to its Azure service so they could compete with Amazon's VM-centralized concept. I still think the idea of separate, independent services talking to each other was what the "cloud" was supposed to be, and if these services didn't have to depend on these VMs (which they do not have access to because AWS is intermittently down) they would have still been working from the other data centers.
Kriston
Does anyone else feel like this cloud computing stuff is more marketing-driven than technology-driven?
Our company (a very large financial services firm) started the cloud computing stuf a few years ago, and it's been such a huge hassle that they've actually rolled out local versions of applications and backup info. Cloud computing just isn't reliable enough yet (ever?) to store/run mission-critical software. Is it? Anyone have a perspective here?
As what I would consider a medium-weight AWS user (our account is about 4 grand a month) I am still quite happy with AWS. We built our system across multiple availability zones, all in us-east and had zero downtime today as a result. We had a couple of issues where we tried to scale up to meet load levels and couldn't spin up anything in us-east-1a (or if we could, we couldn't attach it successfully to a load balance because of internal connectivity issues), but we spun up a new instance in us-east-1b and attached it completely fine and were able to handle the load just fine. The load balancers worked as expected (and hoped for) and the segregation of issues between availability zones was fairly successful.
I think that fixing these issues are just as high an issue with Amazon as they would be with any internal IT infrastructure, so I don't give much credence to the arguments that having your own servers and your own internal IT team would truly solve the problem any more effectively: I think it just gives you more the illusion of control because you can see that you're working on it, as opposed to trusting to the fact that Amazon is working on it.
If there is any AWS lesson to be taken away from this it is that:
1) EBS may not be ready for prime time - most of our servers are instance-store anyway, both for performance reasons and for other reliability problems we have had in the past.
2) You should keep your server templates set up as up-to-date AMIs so you can deploy across any availability zone you want at any time you want. Right now, we have our load balancer attachment configuration all scripted as well, so spinning up new instances to feed a cluster is a single CLI execution with us specifying the availability zones.
Check out http://perfcap.blogspot.com/2011/03/understanding-and-using-amazon-ebs.html for a nice explanation of some of the issues you may come across with EBS and the internals of why.
Overall, I still give Amazon a good rating. This was a major outage and we felt barely a hiccup.
If you can't say something nice, make sure you have something heavy to throw.
I have studied high availability systems and I have developed some. It's not too difficult to come up with a decent design that should guarantee extremely high availability. However, there always will remain assumptions and external factors which influence your system. The most tedious bits are systems and components that "never" fail (like simple NICs) for which you will not get any attention whatsoever.
100% availability is a myth. I know of a case where IBM's zero downtime operating system (z/OS) went down a whole day, causing huge losses. The shut down was allegedly caused by a cleaner accidentally disabling one power supply. Which essentially shouldn't have mattered... But it did. The fail over switch was prevented by a relatively small amount of disk writes which were not fully committed to the fail over site and hence the over site could not commence.
You should know that IBM charges huge amounts of money, it claim zero downtime and eventually doesn't deliver on that. Quite absurd.
But, IMHO, bashing IBM for not delivering zero downtime isn't fair. IBM's customer at hand should have conducted an own HA analysis but instead probably opted for a few words on "IBM" and "zero downtime". The customer really should have known.
I hadn't the slightest objection to his spending his time planning massacres for the bourgeoisie... (P.G. Wodehouse)
and only heard, "Cloud, cloud, cloud! It's new and shiny and cheaper than those annoying internal IT guys so I get a bonus!" learn to pay the stupidity tax.
Next up, learning just how *much* of your cloud data has been stolen and resold by those trustworthy souls in China and India.
Cheers!
As a developer from India, I can tell you India has absolutely nothing to do with the cloud.. thats a US revolution.. we are still years behind..
The last person to mod me down is a rotten egg..... there.. that should do it..
Your modified Nickelodeon song amuses me and I would like to subscribe to your newsletter.
Comment removed based on user account deletion
Comment removed based on user account deletion
"For cloud customers willing to pony up a little extra cash, Amazon has an enticing proposition: Spread your application across multiple availability zones for a near-guarantee that it won't suffer from downtime
I would have thought the the entire raison d'etre of moving to the Cloud was to eliminate downtime, else why not rent two boxes in different locations and achieve this near-guarantee uptime without the extra expense not to mention your data totally disappearing when the Cloud goes down ...
`Amazon's "cloud computing" is centralized upon the virtual machine as the hub of the "cloud."'
I think you hit-the-nail-on-the-head there, a centralized anything is always vulnerable to a this kind of failure. For a business with multiple locations a number of servers sited locally in a peer-to-peer configuration would provide a more reliable service. All they rely on is an end-to-end IP connection. If one site goes then the rest can carry on. I do believe this whole cloud computing concept has been over sold.
... the "Bleeding Edge" for nothing. New technology always comes with teething problems.
It's hard to wrap one's head around the utter gall of the people who can take an epic failure and spin it into a way to make even more money from the customers hurt most by the failure. It demonstrates what business schools are actually teaching: short term profit is the only goal; pursue it relentlessly, regardless of all other factors.
this reminds me of pynchon lyrics... it's perfectly in the spirit.