Amazon EC2 Failure Post-Mortem
CPE1704TKS tips news that Amazon has provided a post-mortem on why EC2 failed. Quoting:
"At 12:47 AM PDT on April 21st, a network change was performed as part of our normal AWS scaling activities in a single Availability Zone in the US East Region. The configuration change was to upgrade the capacity of the primary network. During the change, one of the standard steps is to shift traffic off of one of the redundant routers in the primary EBS network to allow the upgrade to happen. The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network. For a portion of the EBS cluster in the affected Availability Zone, this meant that they did not have a functioning primary or secondary network because traffic was purposely shifted away from the primary network and the secondary network couldn't handle the traffic level it was receiving."
But can I get an understandable car analogy here?
That only explains the loss in availability of the AWS service. It in no way explains why the data is destroyed and unrecoverable
... to be able to handle loads if the primary fails?
No one. No one else remembers AOL.
Dear AWS Customer,
Starting at 12:47AM PDT on April 21st, there was a service disruption (for a period of a few hours up to a few days) for Amazon EC2 and Amazon RDS that primarily involved a subset of the Amazon Elastic Block Store (âoeEBSâ) volumes in a single Availability Zone within our US East Region. You can read our detailed summary of the event here:
http://aws.amazon.com/message/65648
Weâ(TM)ve identified that you had an attached EBS volume or a running RDS database instance in the affected Availability Zone at the time of the disruption. Regardless of whether your resources and application were impacted, we are going to provide a 10 day credit (for the
period 4/18-4/27) equal to 100% of your usage of EBS Volumes, EC2 Instances and RDS database instances that were running in the affected Availability Zone. This credit will be automatically applied to your April bill, and you donâ(TM)t need to do anything to receive it.
You can see your service credit by logging into your AWS Account Activity page after you receive your upcoming billing statement.
Last, but certainly not least, we want to apologize. We know how critical the services we provide are to our customersâ(TM) businesses and we will do everything we can to learn from this event and use it to drive improvement across our services.
Sincerely,
The Amazon Web Services Team
This message was produced and distributed by Amazon Web Services, LLC, 410 Terry Avenue
North, Seattle, Washington 98109-5210
Kriston
"Last Thursday’s Amazon EC2 outage was the worst in cloud computing’s history .. I will try to summarize what happened, what worked and didn’t work, and what to learn from it. I’ll do my best to add signal to all the noise out there" link
So we now know that the promise of the cloud is a lie. How long before we get a new buzz word for turning over all of our data to the new Internet Barron's because they know what is best?
I commend Amazon for providing us with this information. Yes, bad things happened, and data is gone forever. Amazon knows what happened and why, and I'm sure they will implement controls to prevent this again. I doubt we'll hear as much from Sony, though.
What is an EBS? Is it really just a Xen or VMWare disk image? Which data center corresponds with each availability zone? What are they using for storage iSCSI targets on a SAN?
Kudos to Amazon for rapidly explaining, in length, what happened.
Unlike some other company... *cough* Sony *cough*
English is not this
But can I get an understandable car analogy here?
15 cars tried to transform into Voltron but instead turned into Snarf.
I8-D
And HOW THE HELL does such a procedure cause data loss?!
Are those geniuses using the service transfer procedures that do not perform clean transaction handling and instead just send stuff to be copied expecting that it will sync soon enough?
Contrary to the popular belief, there indeed is no God.
I'm trying to remember what the other outage was recently where the web service failed because they forgot to implement exponential backoff. Anyone remember?
Sony hasn't fixed their issue. Kind of hard to have a post mortem while the solution is still ongoing. There has plenty of extrapolation and bullshit in the information vacuum surrounding the attack though. So when things return to normality it would be in their interest to provide a decent technical overview of what happened, the safeguards that were there before, why they failed and what steps have been made since to improve things.
It was good that they were forthcoming, as competitors are both breathing down their necks, and also looking at their own infrastructure for possible race conditions that would crater post-failure storage isolation(s).
They also admitted but don't seem to get the message that their focus has been on developing novel customer solutions-- NOT keeping the core infrastructure bulletproof. Loose-and-fast rather than unrelenting QA will cause Amazon a lot of pain; it'll be hard to trust them until they can prove their infrastructure and multi-zone storage architecture and clustered instances work together given a broad spectrum of failure modes.
In English: they took their eye off the ball because the sales department distracted core QA functionality-- and it blew up, and badly, and expensively.
---- Teach Peace. It's Cheaper Than War.
Is Sony dead yet?
Or Google...
During the whole issue they never posted a cause and took them forever to even say 'still investigating'. Even if they have a bare bones monitoring system up, it should have been readily apparent that traffic was flowing over the wrong network.
So they're basically saying if the primary network has issues theres not really a point in the backup because the backup network will make things explode just as much as having no backup.
Your hair look like poop, Bob! - Wanker.
If this was the cause why wasn't the change corrected immediately and the traffic routed to where it was originally intended. 3 days of downtime just doesn't happen when you fuck up a line in a config. If this was actually the case the downtime would have been minimal.
They don't care, they are making too much money off of spammers and script kiddies to worry about reliability. Blocking their ip ranges reduced attempts on my servers by a significant percentage, and their abuse involves asking the customer what happened.... it is pretty clear what happened is they were running a spam network for xyz erection pills; cut them off already. I have a list of about 7 hosting companies that if they could be disconnected from their peers internet spam and related sites would plummet within a week. Amazon is on that list. Oh, can't beat the captcha? Pay turkers 5 cents to fill them out for you...
Get a web developer
Not that would admit it in public. ;-)
Get a web developer
It's nice to see that everyone has the same problem:
There is no approach to identify wrong assumptions.
But what's the conclusion?
Should we stay away from huge systems, because the damage due to a wrong assumption in a huge system is huge?
Huh. Sounds like a 21st century version of the routing failure that caused the 1965 Northeast blackout, just with data instead of electricity.
http://en.wikipedia.org/wiki/Northeast_Blackout_of_1965
Cloud computing is a marketing architecture, not a technical architecture.
Cloud computing is a form of shared hosting, just with more encapsulation; Clouds fall over the same way a server can fall over. It's hard to blame "The Cloud" when the reality is the people that were suckered in by obtuse, non-specific marketing are the ones at fault. The argument can even be made that Clouds are worse becuase instead of many discreet isolated servers you start sharing more single points of failure, which lead to IO bottlenecks, etc.
Website Hosting
Have you ever worked in a real environment?
There is ALWAYS a difference between test and production. No matter how many test cases and iterations of changes that you go through, there is always a non-zero percent chance that the change in production will behave differently.
This is why most companies require fall-back procedures for any production change in addition to testing.
It sounds like it may have taken them longer than some might be comfortable to reach the point where they did roll back changes...but I'm sure that this change tested as okay in all of their test cases.
In thinking about why this happened, don't loose sight of the time they chose to make the configuration change was 00:47 local. Human performance on 3rd shift isn't what it is on day shift, and I would think it very likely the people managing this change had been up and working for a significant number of hours at that time. Would they have noticed something or done something differently at 10:00 local? Certainly making an upgrade at a time of lowest use sounds right, but it's not always as simple as that, and you have to respect the realities of circadian rhythms or suffer the consequences. If this were an air crash, we would not we interviewing survivors, coworkers and family to identify when each of the participants in the event and the decisions made had slept during the days preceding the event.
The nuclear industry claims a chance of major accidents around 1 in 10^7 reactor years, based on this kind of probabilistic analysis. But then we've seen 2 major incidents at western-style nuclear plants (Three Mile Island and Fukushima Daiichi) over a period of about 15,000 reactor-years. The problem is, these studies only account for the risk of simultaneous failures of pre-identified critical components within the engineered system. They don't account for acts of nature or people doing something dumb.
Hollywood has got to turn this into a movie ...
I'd be first in line to buy a ticket
I was a quite happy Compuserve user until AOL took it over and destroyed it. (OK, CIS was in slow decline at the time too. But that decline became a nose dive when AOL took over.)
There - what was difficult or embarrassing about that?
Birds are not dinosaur descendants;birds are dinosaurs, for all useful meanings of "birds", "are" and "dinosaurs"