Amazon EC2 Failure Post-Mortem

Oops by Anonymous Coward · 2011-04-29 00:56 · Score: 0

So basically they unleashed all the traffic on a poor little network that couldn't handle it. Somebody dun goofed...

Re:Oops by Whalou · 2011-04-29 01:38 · Score: 2

Kudos to Amazon for rapidly explaining, in length, what happened.

Unlike some other company... *cough* Sony *cough*

--
English is not this .sig mother tongue...
Re:Oops by DrXym · 2011-04-29 02:14 · Score: 2

Sony hasn't fixed their issue. Kind of hard to have a post mortem while the solution is still ongoing. There has plenty of extrapolation and bullshit in the information vacuum surrounding the attack though. So when things return to normality it would be in their interest to provide a decent technical overview of what happened, the safeguards that were there before, why they failed and what steps have been made since to improve things.
Re:Oops by postbigbang · 2011-04-29 02:25 · Score: 1

It was good that they were forthcoming, as competitors are both breathing down their necks, and also looking at their own infrastructure for possible race conditions that would crater post-failure storage isolation(s).
They also admitted but don't seem to get the message that their focus has been on developing novel customer solutions-- NOT keeping the core infrastructure bulletproof. Loose-and-fast rather than unrelenting QA will cause Amazon a lot of pain; it'll be hard to trust them until they can prove their infrastructure and multi-zone storage architecture and clustered instances work together given a broad spectrum of failure modes.
In English: they took their eye off the ball because the sales department distracted core QA functionality-- and it blew up, and badly, and expensively.

--
---- Teach Peace. It's Cheaper Than War.
Re:Oops by atisss · 2011-04-29 02:32 · Score: 1

Is Sony dead yet?
Re:Oops by darth+dickinson · 2011-04-29 02:35 · Score: 1

Or Google...
Re:Oops by datapharmer · 2011-04-29 02:46 · Score: 1

They don't care, they are making too much money off of spammers and script kiddies to worry about reliability. Blocking their ip ranges reduced attempts on my servers by a significant percentage, and their abuse involves asking the customer what happened.... it is pretty clear what happened is they were running a spam network for xyz erection pills; cut them off already. I have a list of about 7 hosting companies that if they could be disconnected from their peers internet spam and related sites would plummet within a week. Amazon is on that list. Oh, can't beat the captcha? Pay turkers 5 cents to fill them out for you...

--
Get a web developer

I realise this is "News for Nerds"... by Haedrian · 2011-04-29 00:56 · Score: 3, Funny

But can I get an understandable car analogy here?

Re:I realise this is "News for Nerds"... by MagicM · 2011-04-29 00:59 · Score: 4, Informative

Instead of closing off one lane of highway for construction, they closed off all lanes and forced highway traffic to go through town. The roads in town weren't able to handle all the cars. Massive back-ups ensued.
Re:I realise this is "News for Nerds"... by Anonymous Coward · 2011-04-29 00:59 · Score: 0

Yeah, take the highway, you aren't welcome in this town.
Re:I realise this is "News for Nerds"... by Anonymous Coward · 2011-04-29 01:00 · Score: 0

Cars on a 2 lane highway were directed onto the sidewalk.
Re:I realise this is "News for Nerds"... by RealGene · 2011-04-29 01:02 · Score: 5, Funny

..and according to http://it.slashdot.org/story/11/04/29/0254215/Amazon-EC2-Crash-Caused-Data-Loss, the DPW mistakenly pushed some of the cars into the old abandoned quarry.

--
Mission: To provide products that consume time and energy as entertainingly as permitted by the laws of thermodynamics.
Re:I realise this is "News for Nerds"... by Anonymous Coward · 2011-04-29 01:04 · Score: 0

The bridge broke and all the cars fell into the river. None of them can be recovered.
Re:I realise this is "News for Nerds"... by kingsqueak · 2011-04-29 01:21 · Score: 4, Funny

Instead of the usual commuter rail line, we've had to do some maintenance causing us to provide a single Yugo as transport for the NY morning rush.
After packing 25 angry commuters into the Yugo we left a few hundred thousand stranded on the platform, ping-ponging between the parking lot and home, completely confused how they would get to work.
In addition to that, unfortunately the Yugo couldn't handle the added weight of the passengers and the leaf springs shattered all over the ground. So the 25 passengers we initially planned for were left trapped, to die, inside of the disabled Yugo. They all starved in the days it took us to realize the Yugo never left the station parking lot.
We are sorry for any inconvenience this may have caused and have upgraded to AAA Gold status to prevent any further future disruptions. This will ensure that at least 25 people will actually reach their destinations should this occur again, though they may need to ride on a flat-bed to get there.
Re:I realise this is "News for Nerds"... by Anonymous Coward · 2011-04-29 01:27 · Score: 1

What I've learned from Pontifex and other bridge building games, it's not really that hard to build bridges. Common sense goes long way in seeing how much it can handle. I don't know how it is with everyone, but I've always somehow "seen" in my head exactly how physics function. This made it great for me when playing baseball (not the US version), because I could directly see when you should hit the ball, how hard and where it's going to land. Same thing when I was catching the balls.
Re:I realise this is "News for Nerds"... by kriston · 2011-04-29 01:31 · Score: 1

That analogy is just like Irwin Allen's movie "The Towering Inferno."

--
Kriston
Re:I realise this is "News for Nerds"... by operator_error · 2011-04-29 01:31 · Score: 2

A classic Dilbert might be useful here:
http://dilbert.com/strips/comic/1995-02-26/
Re:I realise this is "News for Nerds"... by Xserv · 2011-04-29 01:49 · Score: 1

I think I just peed in my pants. My cubemates (who we lovingly refer to each other as cellmates) poked their heads around walls wondering what I thought was so funny. Bravo!

--
"I love lamp."
Re:I realise this is "News for Nerds"... by Anonymous Coward · 2011-04-29 01:55 · Score: 0

So the intertubes *are* like a big truck...
Re:I realise this is "News for Nerds"... by yakovlev · 2011-04-29 02:01 · Score: 2

Traffic was diverted from a major highway onto a 2-lane road. This caused the buses to run late.

Because the buses were running late, everyone decided to take their own car to work. This further increased the amount of traffic on the tiny road.

The cops figured out that everyone was on the wrong road, and diverted traffic onto another freeway. However, by this point everyone was already taking their cars, so diverting to the other freeway didn't completely fix the problem.

All this traffic indirectly caused minor traffic problems in neighboring cities, because all the traffic cops in those cities were covering the traffic nightmare in this city.

Eventually, the cops got everyone to stop getting on the roads, and piecemeal managed to get people where they were going, which eventually cleaned things up.
Re:I realise this is "News for Nerds"... by BigSlowTarget · 2011-04-29 02:10 · Score: 1

Damn that's close. It's freaky how almost anything can be expressed as a car analogy.
Re:I realise this is "News for Nerds"... by A+Big+Gnu+Thrush · 2011-04-29 02:11 · Score: 1

There's so much wrong with this post, I don't even know where to begin, but what I really want to know is... ...there's a non-US version of baseball?!?
Re:I realise this is "News for Nerds"... by davidbrit2 · 2011-04-29 02:26 · Score: 1

Yeah. It's kind of like how you can bolt just about any part onto just about any car if you think it through enough.
Re:I realise this is "News for Nerds"... by atisss · 2011-04-29 02:37 · Score: 1

mod parent up
Re:I realise this is "News for Nerds"... by Anonymous Coward · 2011-04-29 02:45 · Score: 0

I've always somehow "seen" in my head exactly how physics function.
Cool.Maybe you could clear up a small matter for the rests of us? It's called the "Theory of Everything", and certainly someone that sees exactly how "physics function" [sic] shouldn't have any problem.
Re:I realise this is "News for Nerds"... by datapharmer · 2011-04-29 02:49 · Score: 1

Thank you, wondering that myself. I know it is played other places, but aren't the rules very similar??

--
Get a web developer
Re:I realise this is "News for Nerds"... by steelfood · 2011-04-29 03:23 · Score: 1

You mean they shut down the tubes and shit got clogged?

--
"If a nation expects to be ignorant and free in a state of civilization, it expects what never was and never will be."
Re:I realise this is "News for Nerds"... by x*yy*x · 2011-04-29 04:09 · Score: 1

h ttp://en.wikipedia.org/wiki/Pesäpallo (slashcode breaks the link so copypaste)
Re:I realise this is "News for Nerds"... by by+(1706743) · 2011-04-29 06:32 · Score: 1

Indeed -- such visionary powers would've come in handy on some of my old quantum psets!

I suspect the comment was referring to simple classical mechanical systems. I do find it fascinating that on a windy day at the beach, I can throw a baseball well over a hundred feet and have the other person catch it without needing to move (and I'm not particularly coordinated). Granted, with such lax accuracy, the relevant calculations aren't too tricky, but I still find it neat that humans (and other animals) have such a good intuitive sense of (classical) mechanics.
Re:I realise this is "News for Nerds"... by Caerdwyn · 2011-04-29 09:56 · Score: 1

Stickers make it go faster and gets girls.
Red paint makes it go faster and gets girls.
Neon lights and spinner hubcaps make it go faster and gets girls.
Blue lenses to make your headlights into fakey HID lights make it go faster and gets girls.
Slicked-back hair and sunglasses at night make it go faster and gets girls.
Chopping off two coils from the factory springs makes it go faster and gets girls.
Wings and spoilers and air dams and side skirts make it go faster and gets girls.
Replacing a hood latch with posts-and-cotter-pins makes it go faster and gets girls.
I'm ready for the aftermarket, the thinker's marketplace!

--
Everybody gets what the majority deserves.
Re:I realise this is "News for Nerds"... by man_of_mr_e · 2011-04-29 11:24 · Score: 1

Yes, and I learned so much about city planning from Sim City. I should put that on my resume and become apply for a civil engineering job.
Come off it. You can't just look at a piece of metal and know what it's tensile strength is, how much load it can withstand, or how load over time will affect it. You have to know the composition of the metal, and all the varying factors that affect it to calculate those things.

--
If you need web hosting, you could do worse than here
Re:I realise this is "News for Nerds"... by Anonymous Coward · 2011-04-29 11:50 · Score: 0

was any pr0n lost? oh the humanity!
Re:I realise this is "News for Nerds"... by BranMan · 2011-05-02 03:45 · Score: 1

Like a JATO rocket engine. Hey! That's the ticket!

That doesnt explain anything by Anonymous Coward · 2011-04-29 00:57 · Score: 1

That only explains the loss in availability of the AWS service. It in no way explains why the data is destroyed and unrecoverable

Re:That doesnt explain anything by ruiner13 · 2011-04-29 01:03 · Score: 1

It could be that in the process of isolating the problem, they rebooted servers that (due to network problems) may not have been able to fully replicate their local changes.

--
today is spelling optional day.
Re:That doesnt explain anything by mysidia · 2011-04-29 01:25 · Score: 0

It could be that in the process of isolating the problem, they rebooted servers that (due to network problems) may not have been able to fully replicate their local changes.
In other words.... someone executed an improper "problem isolation" procedure........
Re:That doesnt explain anything by Darth_brooks · 2011-04-29 01:33 · Score: 2

"At 12:30 PM PDT on April 24, we had finished the volumes that we could recover in this way and had recovered all but 1.04% of the affected volumes. At this point, the team began forensics on the remaining volumes which had suffered machine failure and for which we had not been able to take a snapshot. At 3:00 PM PDT, the team began restoring these. Ultimately, 0.07% of the volumes in the affected Availability Zone could not be restored for customers in a consistent state."

--
There are some people that if they don't know, you can't tell 'em.

AOL's 19-hour outage by kriston · 2011-04-29 01:03 · Score: 0

Whom else is reminded of AOL's 19-hour outage in 1996? Routers misconfigured to send data to the wrong place, cascading into failure?

--

Kriston

Re:AOL's 19-hour outage by Mister+Fright · 2011-04-29 01:15 · Score: 2

No one. No one else remembers AOL.
Re:AOL's 19-hour outage by Anonymous Coward · 2011-04-29 01:49 · Score: 0

Whom else is reminded of AOL's 19-hour outage in 1996? Routers misconfigured to send data to the wrong place, cascading into failure?
When you try to use a word like "whom" because you think it makes you sound smart, and then you use it incorrecty, the effect is quite the opposite.
Whomsoever doest thou thinkest thou art? Thy pedantry is showing.
Hey Kriston, is this really easier than admitting you fucked up?
Re:AOL's 19-hour outage by datapharmer · 2011-04-29 02:52 · Score: 1

Not that would admit it in public. ;-)

--
Get a web developer
Re:AOL's 19-hour outage by RockDoctor · 2011-04-30 05:35 · Score: 1

I remember AOL. I've still got some of the CDs propping up a broken cupboard somewhere. I remember them with a combination of mild loathing and some contempt.
I was a quite happy Compuserve user until AOL took it over and destroyed it. (OK, CIS was in slow decline at the time too. But that decline became a nose dive when AOL took over.)
There - what was difficult or embarrassing about that?

--
Birds are not dinosaur descendants;birds are dinosaurs, for all useful meanings of "birds", "are" and "dinosaurs"

Isn't the point of a secondary network... by gtvr · 2011-04-29 01:10 · Score: 1

... to be able to handle loads if the primary fails?

Re:Isn't the point of a secondary network... by Anonymous Coward · 2011-04-29 01:21 · Score: 1

I think the secondary network is use to deal with the little overage you get at peak times.
Like most of the time a T1 may be fine and can deal with bandwidth, but sometime your backup ISDN comes up when the bandwidth is a little more then a T1 along can hold.
Re:Isn't the point of a secondary network... by Anonymous Coward · 2011-04-29 01:21 · Score: 0

FTFA:

The secondary network, the replication network, is a lower capacity network used as a back-up network to allow EBS nodes to reliably communicate with other nodes in the EBS cluster and provide overflow capacity for data replication. This network is not designed to handle all traffic from the primary network but rather provide highly-reliable connectivity between EBS nodes inside of an EBS cluster.
Re:Isn't the point of a secondary network... by ae1294 · 2011-04-29 01:22 · Score: 1

... to be able to handle loads if the primary fails?
No it's so marketing can make redundancy claims for 1/100 the cost of true redundancy.
Re:Isn't the point of a secondary network... by mysidia · 2011-04-29 01:27 · Score: 2, Informative

... to be able to handle loads if the primary fails?
No. That's the point of the redundant elements and backup of the primary network.
The secondary network they routed traffic to was designed for a different purpose, and never meant to receive traffic from the primary network.
Re:Isn't the point of a secondary network... by natet · 2011-04-29 04:32 · Score: 1

... to be able to handle loads if the primary fails?
No. That's the point of the redundant elements and backup of the primary network.
The secondary network they routed traffic to was designed for a different purpose,
and never meant to receive traffic from the primary network.
For example, management, monitoring, and logging traffic.

--
IANAL... But I play one on /.

Amazon issues 10-day service credit by kriston · 2011-04-29 01:22 · Score: 4, Interesting

Dear AWS Customer,

Starting at 12:47AM PDT on April 21st, there was a service disruption (for a period of a few hours up to a few days) for Amazon EC2 and Amazon RDS that primarily involved a subset of the Amazon Elastic Block Store (âoeEBSâ) volumes in a single Availability Zone within our US East Region. You can read our detailed summary of the event here:
http://aws.amazon.com/message/65648

Weâ(TM)ve identified that you had an attached EBS volume or a running RDS database instance in the affected Availability Zone at the time of the disruption. Regardless of whether your resources and application were impacted, we are going to provide a 10 day credit (for the
period 4/18-4/27) equal to 100% of your usage of EBS Volumes, EC2 Instances and RDS database instances that were running in the affected Availability Zone. This credit will be automatically applied to your April bill, and you donâ(TM)t need to do anything to receive it.
You can see your service credit by logging into your AWS Account Activity page after you receive your upcoming billing statement.

Last, but certainly not least, we want to apologize. We know how critical the services we provide are to our customersâ(TM) businesses and we will do everything we can to learn from this event and use it to drive improvement across our services.

Sincerely,
The Amazon Web Services Team

This message was produced and distributed by Amazon Web Services, LLC, 410 Terry Avenue
North, Seattle, Washington 98109-5210

--

Kriston

Re:Amazon issues 10-day service credit by Anonymous Coward · 2011-04-29 01:52 · Score: 0

Hmm. But if you were suffering an OUTAGE -- wouldn't the 100% usage be almost zero?
Re:Amazon issues 10-day service credit by StikyPad · 2011-04-29 03:43 · Score: 1

Nice. You know what Sony offered me for disclosing all of my information?
They sent me an e-mail which pointed out that, by law, I can get one free credit report per year, and they encouraged me to take advantage of that to look for any fraudulent activity.

--
https://www.eff.org/https-everywhere
Re:Amazon issues 10-day service credit by kriston · 2011-04-29 05:40 · Score: 1

I don't know what bothers me more: the outage itself or the alternative codes Amazon used for punctuation in that email that made my post look messed up only after I posted it.

--
Kriston
Re:Amazon issues 10-day service credit by matt_hs · 2011-04-29 08:22 · Score: 1

Then they're wrong. You can get one free credit report each year from each credit reporting bureau. You don't have to get them all at the same time. Do Experian one time, then a few months later do TransUnion, then a few months later do EquiFax. Rinse and repeat.
But yeah, Sony is lame. I had a DVD drive for just over a year when it went out. Wouldn't do anything about it. I won't buy Sony again if there's any other alternative out there.

Amazon EC2 outage analysis by doperative · 2011-04-29 01:24 · Score: 1

"Last Thursday’s Amazon EC2 outage was the worst in cloud computing’s history .. I will try to summarize what happened, what worked and didn’t work, and what to learn from it. I’ll do my best to add signal to all the noise out there" link

The Cloud by ae1294 · 2011-04-29 01:24 · Score: 1

So we now know that the promise of the cloud is a lie. How long before we get a new buzz word for turning over all of our data to the new Internet Barron's because they know what is best?

Re:The Cloud by cryfreedomlove · 2011-04-29 01:43 · Score: 1

So we now know that the promise of the cloud is a lie. How long before we get a new buzz word for turning over all of our data to the new Internet Barron's because they know what is best?
How does this event lead to the conclusion that 'the promise of the cloud is a lie'? Be specific.
Re:The Cloud by TrevorDoom · 2011-04-29 03:30 · Score: 1

"The Cloud" has always been nothing more than marketing buzz. All "The Cloud" is are physical servers running a hypervisor and running your machine instances as VMs.
There's still people, switches, routers, firewalls, servers, and storage that are used to build "The Cloud."
This belief that doing things in "The Cloud" makes them impervious to hardware failure, power outage, network connection drops, etc. has always been misinformed.
Re:The Cloud by ae1294 · 2011-04-29 04:12 · Score: 1

This belief that doing things in "The Cloud" makes them impervious to hardware failure, power outage, network connection drops, etc. has always been misinformed.
But profitable....
Re:The Cloud by tlhIngan · 2011-04-29 04:46 · Score: 1

"The Cloud" has always been nothing more than marketing buzz. All "The Cloud" is are physical servers running a hypervisor and running your machine instances as VMs.
There's still people, switches, routers, firewalls, servers, and storage that are used to build "The Cloud."
This belief that doing things in "The Cloud" makes them impervious to hardware failure, power outage, network connection drops, etc. has always been misinformed.
Actually, that's the whole point of doing things "in the cloud" versus just using a webhost or a colo facility. At the latter you provisoin your machine as how you think it'll be used and manage it as you would any other piece of IT equipment locally. If it dies, you go down and replace it. If you get hit by a link from /., your server gets slow. If its the holidays your server crashes, etc. You could overprovision your services by buying extra servers to handle the overflow, but then you're paying lots of extra money to handle the few instances when you get heavy traffic. Since you're leasing hardware and/or physical space, you're paying for that all the time. Depending on their size, provisioning extra services can take hours or days.
Whereas, had it been "in the cloud" at Amazon or something, if your website gets slow because someone discovered your product and posted it on /. or did the social networking thing, you can spin up a new instance immediately (for just a few more dollars) and rake in the cash. At the end of the week when traffic drops off, you destroy the new instance and pay just for the computation you used. Should the datacenter suffer a power or network outage, unless you have a spare in another datacenter, you're hosed.
And yes, the cloud should make things impervious to hardware failure, power outages, network outages, etc. After all, you've decoupled the datacenter from the servers itself. If a physical EC2 box dies, everything should be movied automagically - since you're only dealing with a VM not attached to any particular hardware, it shouldn't matter that your server now isn't hte same one as yesterday. Network and power outages the same - your server should be floating amongst the datacenters that your provider has since you're paying for the CPU cycle, not for the server or physical space.
Alas, what happened here was Amazon wasn't decoupled enough.
Re:The Cloud by sjames · 2011-04-29 05:32 · Score: 1

The marketing promise, that is. We know it because according to the hype, the cloud means you are NEVER down and your data is ALWAYS safe.
I'm sure there will be a few "no true cloud" marketing fallacies running about though.
Re:The Cloud by Vancorps · 2011-04-29 09:18 · Score: 1

Out of curiosity, have you used Amazon EC2 services? You manage your hosts exactly like you would in a dedicated hosting environment, you even remote desktop into Windows hosts or SSH into Linux hosts. Unless you can use your existing server as a template and clone away you're going to need to do a lot of fine tuning and that's assuming your webapp can handle that as load balancing and others are all addons which can easily end up costing quite a bit.
I tried out EC2 for my last large web event knowing I needed some extra database horsepower and not only was clock speed low but my only choice was to scale out instead of up.
There are very specific cases I've found where an external cloud is useful. Now internally I like the technology a lot as I've practically eliminated server hardware failure causing any downtime. Rapid prototyping is a snap too. When it comes to external resources and expected traffic spikes I'll spawn a few instances a week ahead on EC2 and sure it's better than buying servers for that one event. When constant traffic levels require cloud resources then it's time to scale up in-house resources as EC2 quickly becomes expensive. Want to test new Extranet functionality? Okay, EC2. Generally anything long term I feel at this point has no business being in a cloud service providers hands.

At least they admit it by jesseck · 2011-04-29 01:27 · Score: 4, Insightful

I commend Amazon for providing us with this information. Yes, bad things happened, and data is gone forever. Amazon knows what happened and why, and I'm sure they will implement controls to prevent this again. I doubt we'll hear as much from Sony, though.

Re:At least they admit it by david.emery · 2011-04-29 01:46 · Score: 4, Insightful

We all benefit from these kinds of disclosures, I remember Google posting post-mortem analyses of some of their failures. Even Microsoft provided information on their Sidekick meltdown. This does seem to be the 'typical' melange of a human error and cascading consequences.
Someone first said, "You learn much more from failure than you do from success." If nothing else, it's the thesis of the classic Petrosky book, "To Engineer is Human: The Role of Failure in Successful Design" http://www.amazon.com/Engineer-Human-Failure-Successful-Design/dp/0679734163 (If you haven't read this, you should!!)
And I'm also reminded of a core principle from safety critical system design, that you cannot provide 100% safety. The best you can do is a combination of probabilistic analysis against known hazards. As a Boeing 777 safety engineer told me, "9 9's of safety, i.e. chance of failure 1/10 ^-9, applied over the expected flying hours of the 777 fleet, still means a 50-50 chance of an aircraft falling out of the sky." That kind of reasoning also applies to the current Japanese nuke plant failure...
Re:At least they admit it by tompaulco · 2011-04-29 02:35 · Score: 1

I doubt we'll hear as much from Sony, though.
I have an account on Playstation Network and they have already sent me a long and thorough e-mail explaining what happened and what the implications are. And since the problem is ongoing, that makes them MORE proactive than Amazon in getting the word out.
Not to mention that Playstation Network is free and any uptime over 0% ought to be considered a bonus. Whereas you are paying for a certain level of service with cloud computing.

--
If you are not allowed to question your government then the government has answered your question.
Re:At least they admit it by afex · 2011-04-29 03:01 · Score: 2

this has gone mildly offtopic, but as a PSN user i just wanted to chime and say the following...

I can't STAND when people say 'its free, so its ok if it goes down.' When i purchased a PS3, the PSN was a FEATURE that i considered when i bought it. As such, it's not really "free", its more like it was wrapped into the MSRP. By your logic, they should be able to take away the entire network for GOOD and everyone should be completely happy. is this true? Heck, let's start pulling out other features that you got for 'free' as well. I mean geez, I heard that no one uses otherOS, lets just...pull..that...oh shit.
Re:At least they admit it by ccady · 2011-04-29 03:22 · Score: 1

No, that kind of reasoning does not apply to nuke plant failures. There are not millions of nuke plants running each day. There are only 442 nuke plants. If we cannot secure 442 plants from having disasters, then we need to do something else that does not cause disasters.

--
J'aime mieux les méchants que les imbéciles, parce qu'ils se reposent. -- Alexandre Dumas
Re:At least they admit it by Anonymous Coward · 2011-04-29 03:23 · Score: 0

I've worked in or alongside IT operations for more than a decade, and this is far more complete a post-mortem than I've ever seen produced internally or by a vendor.
This level of disclosure is precisely what ends up building trust.
I'm glad to see that they're committing to make it easier for customers to understand how to leverage the capabilities of AWS which contribute to increased availability. I can't help but think that some of the companies who use AWS do so "because it's cheap", and not "because it allows better use of capital". It's easier to justify salaries for architects when you're talking about a $2M capex project than when someone whipped out a credit card, put a prototype in the cloud, and it became the production system.
Re:At least they admit it by david.emery · 2011-04-29 03:32 · Score: 1

So how intense an earthquake, at what distance from the plant, and how high a tsunami should we plan for next time???
Re:At least they admit it by david.emery · 2011-04-29 03:33 · Score: 1

p.s. Wikipedia says there are 923 B777's out there, about 2x the number of nuke plants. http://en.wikipedia.org/wiki/Boeing_777
Re:At least they admit it by the+eric+conspiracy · 2011-04-29 04:05 · Score: 2

The issue is not uptime. It is the loss of sensitive data. If Sony is holding personal data they have an obligation to protect that data.
Re:At least they admit it by StuartHankins · 2011-04-29 04:11 · Score: 1

+1 Insightful. The Playstation would never have acquired the market share it has without PSN. People would have bought something else. It's a major part of the promotion of the product.
Re:At least they admit it by Draknor · 2011-04-29 04:30 · Score: 1

As a Boeing 777 safety engineer told me, "9 9's of safety, i.e. chance of failure 1/10 ^-9, applied over the expected flying hours of the 777 fleet, still means a 50-50 chance of an aircraft falling out of the sky."
This doesn't even make any sense -- what am I missing? A 50-50 chance of falling out of the sky? I'm assuming that's hyperbole, but I'm not grasping the concept here.
For what its worth, the wiki article (linked in another post) indicates the 777 has been involved in 7 "incidents", although the only fatality was a ground crew worker who suffered fatal burns during a refueling fire.
Re:At least they admit it by Anonymous Coward · 2011-04-29 04:30 · Score: 0

I commend Amazon for providing us with this information. Yes, bad things happened, and data is gone forever. Amazon knows what happened and why, and I'm sure they will implement controls to prevent this again. I doubt we'll hear as much from Sony, though.
While I support Amazon's decision to disclose - I'm not sure it was the best thing for survivability as a provider. I think its competitors are going to take information from this failure and scare less-technically-endowed decision makers at clients into using a less-capable provider. In my opinion - transparency and disclosure work very differently in the technical world vs. the business world.
Re:At least they admit it by Anonymous Coward · 2011-04-29 04:43 · Score: 0

He means there's something like a 50-50 chance that the plane will fall out of the sky (i.e. suffer some sort of emergencat least once in its expected total lifetime. Of course it will make tens of thousands of successful flights before that happens.
Re:At least they admit it by LWATCDR · 2011-04-29 04:56 · Score: 1, Informative

You have not taken a statistics course have you? You can have one airplane and have it fall out of the sky or you can 1,000,000 and never have one crash and both systems could 9 9's safey. This is the risk of failure it isn't destiny.
So to combat the FUD.
1. So far the death toll from the event is 18000. Death toll from radiation so far 0.
2. The nuclear plant didn't cause the disaster the earthquake and tsunami that followed it did.
3. People died in cars, buildings, on the street, and in a dam that also collapsed.
4. A lot of the lives where lost because of a failure of a sea wall.
So by your logic we really need to replace cars, buildings, streets, dams, and sea walls first since they all have caused so many deaths. Might I suggest a cave? Oh and no fire because that is also too risky. And keep away from those sharp stones as well.

--
See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
Re:At least they admit it by Anonymous Coward · 2011-04-29 05:19 · Score: 0

Shouldn't that be 1/10^9 or 1*10^-9? 1/10^-9 seems like a pretty big chance of failure...
Re:At least they admit it by Anonymous Coward · 2011-04-29 05:28 · Score: 0

I commend Amazon for providing us with this information. Yes, bad things happened, and data is gone forever. Amazon knows what happened and why, and I'm sure they will implement controls to prevent this again. I doubt we'll hear as much from Sony, though.
FTFA... They detail the controls they are implementing to prevent this from happening again.
Re:At least they admit it by Anonymous Coward · 2011-04-29 07:47 · Score: 0

There are 442 nuke plants...OPERATING AROUND THE FUCKING CLOCK! That's the parameter you rate against, bub.
Re:At least they admit it by Anonymous Coward · 2011-04-29 11:11 · Score: 0

Not necessary. Just design every reactor with a kill switch and employ someone with big balls to use it.

Can we get this in non-Amazon speak by gad_zuki! · 2011-04-29 01:29 · Score: 3, Interesting

What is an EBS? Is it really just a Xen or VMWare disk image? Which data center corresponds with each availability zone? What are they using for storage iSCSI targets on a SAN?

Re:Can we get this in non-Amazon speak by pdbaby · 2011-04-29 02:10 · Score: 1

While that would be nice to know I don't think it's relevant to a postmortem: they described the architectural elements which encountered the failures.
FYI, though, based on what they've said today and in the past: it seems that they are using regular servers with disks rather than a SAN & I believe they use GNBD to connect the EBS server disk images and EC2 instances (rather than iSCSI)

--
Global symbol "$deity" requires explicit package name at line 2. - If only $scripture started "use strict;"
Re:Can we get this in non-Amazon speak by Synn · 2011-04-29 02:56 · Score: 1

Amazon EC2 is Xen. The back end storage for that is empherical in that it goes away when you shut it down. So they introduced EBS which is persistent storage. So you could have a EC2 server and mount EBS volumes on it and those EBS volumes will exist even when the EC2(Xen) server goes away. You can even mount them on different EC2(Xen) servers(though not at the same time). Also today you can have the EC2 server itself run on top of EBS if you want the data on it to stay around after a reboot.
Then there's also S3 which while a separate product, ties into all the above as a "slower" storage medium where you keep your Xen images and (if you're smart) permanent backups. You can also use it for storage for your applications, there's an API to put and get data from it. S3 has an insane level of reliability, supports versioning and can even be setup so that you can't delete from it without a hardware fob key device. So you could put customer data there and only the CEO of the corporation could delete it.
EBS is basically like iSCSI, but far more complex. There's a lot of proprietary stuff they're doing with it.
So you have:
EC2(Xen) servers which can be thought of as disposable.
S3 which is an absurdly reliable storage back end, but you can't really use it as a filesystem. It's not high IO.
EBS which is your high IO filesystem that's persistent and portable. It's something you'd use for a database filesystem and their database product(RDS) uses this, since it's basically just MySQL Xen on top of EBS.
In this case they had an issue with EBS in 1 of their north Virginia data centers. This affected 1 of 3 east coast availability zones but you can't really say which one since the zone names are randomized for each customer(to prevent everyone from using the same zone name).
Outside of the 3 east coast availability zones they also have zones in other regions that weren't affected.
Re:Can we get this in non-Amazon speak by StuartHankins · 2011-04-29 04:18 · Score: 1

Forgive my ignorance, but why not KVM instead of Xen? And assuming you have a complicated setup with bunches of scripts, mounts, etc, how do you image the entire thing? We have to schedule off-hour downtime to do a snapshot (everything except data) for our internal servers since a new install / config from scratch would take too long for recovery -- but that involves a lot of control that you may or may not have in a "cloud" situation.
Re:Can we get this in non-Amazon speak by kriston · 2011-04-29 05:49 · Score: 1

They use more than just Xen but they don't really publicize it. With paravirtualization they can use anything they want, but Xen seems to be the most prevalent. Some of my instances say "xen" and others say "paravirtual." Just because the kernel says "xen" or "paravirtual" does not necessarily mean that the hypervisor is Xen or something else.
Also, speaking towards migrating instances between "availability zones," I found out that I cannot use Windows Server EBS boot volumes on anything but the instance I created it with. You can do this with most of the Unix instances, though. I found this little oddity a bit maddenning when I wanted to move a Windows Server instance from a t1.micro to t1.small.
No-can-do with boot volumes. Well, not officially, anyway.
The other thing I don't get is why I should be able to use my Amazon.com shopping account to log into AWS. It seems, well, silly, even with the two-factor authentication dongle, that the same account used to buy Kindle books and Fisher-Price toys is paying for my AWS usage.

--
Kriston
Re:Can we get this in non-Amazon speak by Synn · 2011-04-29 06:36 · Score: 1

They used Xen because that's what was mature at the time. In many ways Xen is still more mature than KVM, though it won't be that way for long.
Amazon supplies you a bunch of tools for dealing with the images. They call the "stored" images AMI's. There's a huge list of public AMI's you can choose from and anyone can create their own AMI off pretty much any Linux distribution.
You can snapshot a running instance using the ec2-ami-tools which are installed in your running instances. Using those you can easily create your own AMI's off a running EC2 instance. So you'd create a base EC2 instance off a public AMI, customize it and then "snapshot" that as a custom AMI you can re-use later.
Re:Can we get this in non-Amazon speak by StuartHankins · 2011-04-29 06:43 · Score: 1

Many thanks.
Re:Can we get this in non-Amazon speak by Slashdot+Parent · 2011-04-29 07:22 · Score: 1

EBS which is your high IO filesystem

Damnit, now you owe me a new monitor.

--
They don't grade fathers, but if your daughter's a stripper, you fucked up. --Chris Rock
Re:Can we get this in non-Amazon speak by Slashdot+Parent · 2011-04-29 07:42 · Score: 1

And assuming you have a complicated setup with bunches of scripts, mounts, etc, how do you image the entire thing? We have to schedule off-hour downtime to do a snapshot (everything except data) for our internal servers since a new install / config from scratch would take too long for recovery -- but that involves a lot of control that you may or may not have in a "cloud" situation.
There are a couple of interesting tools to help you with this. One that's been around for a while, but is not maintained by Amazon, is ec2-consistent-snapshot. It is a tool that automates the process of quickly quiescing your database and filesystem, initiating a snapshot of your volume (or volumes, if you have a RAID array), and then restoring read/write access. It all happens very quickly, so the disruption to your application is short (although the I/O performance of your volume will suffer while the snapshot is in progress).
I use ec2-consistent-snapshot for all of my EBS volumes and it made recovery from last week's outage pretty painless. It went something like this:
1. Application died at about 10am. Alarm.
2. I look at the instances. Can't log in.
3. I look at the AWS status monitor page. Find out EBS the service (as opposed to EBS volumes, 0.5% of which are expected to fail in any given year) has failed in all 4 us-east-1 availability zones. Find out my app is running in the AZ that has completely failed.
4. Curse.
5. Look at status of today's EBS snapshots, which begin at 6am. Some have completed, not all.
6. Wait for snapshots to complete. Happened at about 12pm.
7. Terminate all instances. None terminate. Curse.
8. AWS recovers 2 availability zones (I think this happened at about 1pm.)
9. Do some testing to figure out which of my zones are good.
10. Launch application from snapshots in a good AZ. App is up and running, but missing 4 hours worth of transactions, by 1:30pm.
11. Wait for AWS to restore access to bad AZ.
12. Initiate snapshots of bad volumes.
13. Replay missing transactions from 6-10am.
14. That's it.
In all, it was a few hours of outage, and a bit of care not to lose any transactions. Not bad for AWS's biggest failure, to date.
Another interesting tool, which I have not used, is CloudFormation. It automates the task of provisioning your application's infrastructure. I'm not sure how complete it is, as I haven't looked into it. I already have scripts to provision my app's resources from before CloudFormation was created.

--
They don't grade fathers, but if your daughter's a stripper, you fucked up. --Chris Rock
Re:Can we get this in non-Amazon speak by StuartHankins · 2011-04-29 08:32 · Score: 1

Thanks! I will have to check that out.
Re:Can we get this in non-Amazon speak by gbrayut · 2011-04-29 10:53 · Score: 1

EBS is basically like iSCSI, but far more complex. There's a lot of proprietary stuff they're doing with it.
Anyone know how it compares in speed to iSCSI or a SAN? From reading the report it sounds like there is A LOT more going on, and I have even heard of people using multiple EBS volumes in a "RAID" like array for faster IO speed. Sounds like way too complex of a system.

Windows Azure Drives are like EBS but they are simply VHD files stored in Page Blobs (Azure's version of cloud storage, similar to Amazon S3) with local caching on each VM instance. I assume they have slower read/write speeds then EBS but seem like they would be much less complex to manage/maintain then a completely separate proprietary storage cluster. Does Amazon have anything similar using S3 or RRS for backing virtual hard drives instead of EBS?

Voltron by Kamiza+Ikioi · 2011-04-29 01:54 · Score: 1

But can I get an understandable car analogy here?

15 cars tried to transform into Voltron but instead turned into Snarf.

--
I8-D

Data loss? by Alex+Belits · 2011-04-29 02:04 · Score: 1

And HOW THE HELL does such a procedure cause data loss?!

Are those geniuses using the service transfer procedures that do not perform clean transaction handling and instead just send stuff to be copied expecting that it will sync soon enough?

--
Contrary to the popular belief, there indeed is no God.

Re:Data loss? by Anonymous Coward · 2011-04-29 02:44 · Score: 0

I think it has to do with the state of user's databases, which could be in-memory only and mid-processing. IIUC, storage is not local to individual machines and requires a consistent data link to remain connected to their filesystems. Virtualization is a complicating factor here, too.
Here is one possible example, off the top of my head (or OOMA). Data could be accepted from a user; it's state changed in memory; but then is not able to be written back out to disk due this failure mode.
Just a WAG, I could be wrong.
Re:Data loss? by Alex+Belits · 2011-04-29 04:44 · Score: 1

This should never be possible if transaction model is properly implemented -- data in memory would have to be stored and confirmed to be stored, or transaction should be cleanly reverted before anything is moved.

--
Contrary to the popular belief, there indeed is no God.

Other outage by nns6561 · 2011-04-29 02:06 · Score: 1

I'm trying to remember what the other outage was recently where the web service failed because they forgot to implement exponential backoff. Anyone remember?

Re:Other outage by Anonymous Coward · 2011-04-29 22:11 · Score: 0

Pretty much every web service failure, ever?

I dont buy it by OverlordQ · 2011-04-29 02:38 · Score: 1

During the whole issue they never posted a cause and took them forever to even say 'still investigating'. Even if they have a bare bones monitoring system up, it should have been readily apparent that traffic was flowing over the wrong network.

[..] because traffic was purposely shifted away from the primary network and the secondary network couldn't handle the traffic level it was receiving.

So they're basically saying if the primary network has issues theres not really a point in the backup because the backup network will make things explode just as much as having no backup.

--
Your hair look like poop, Bob! - Wanker.

pure and utter BS by dieth · 2011-04-29 02:45 · Score: 1

If this was the cause why wasn't the change corrected immediately and the traffic routed to where it was originally intended. 3 days of downtime just doesn't happen when you fuck up a line in a config. If this was actually the case the downtime would have been minimal.

Re:pure and utter BS by bruce_the_loon · 2011-04-29 03:49 · Score: 2

Go and read the entire notice, not just the pathetic snippet a bad submitter used. Makes more sense.
Also, this is a storage network, not an access network. Effectively it's like pulling the SAS cable out of the RAID card while the machines are running.

--
Trying to become famous by taking photos. Visit my homepage please.
Re:pure and utter BS by Zondar · 2011-04-29 03:53 · Score: 1

This was a cascade failure that affected multiple systems on multiple layers, with ramping race conditions that worsened over time. The engineer didn't hit the "Enter" key and suddenly the little green light turned red to tell him 1/3 of the grid was down.
Re:pure and utter BS by DigiShaman · 2011-04-29 06:07 · Score: 1

OUCH! I wonder how many VHD files got corrupted that way.

--
Life is not for the lazy.

Tsunami waves are not higher than 19 feet, by tlk+nnr · 2011-04-29 02:55 · Score: 1

and the primary and secondary network will not fail simultaneously for a large number of nodes.

It's nice to see that everyone has the same problem:
There is no approach to identify wrong assumptions.

But what's the conclusion?
Should we stay away from huge systems, because the damage due to a wrong assumption in a huge system is huge?

Where is their testing lab? by Anonymous Coward · 2011-04-29 02:58 · Score: 0

Someone (or multiples thereof) at the top of the Amazon infrastructure management heap should be fired and/or executed! Making such a network change to a live system that can impact so many users and applications without first testing it in a fully functional test environment that reasonably mirrors the real environment, is reckless, incompetent, and unconscionable. If they had done so, the error in configuration would have (should have) been caught before it was applied to the live system. And I don't think they can just blame it on "stumble fingers" either!

Re:Where is their testing lab? by TrevorDoom · 2011-04-29 03:35 · Score: 2

Have you ever worked in a real environment?
There is ALWAYS a difference between test and production. No matter how many test cases and iterations of changes that you go through, there is always a non-zero percent chance that the change in production will behave differently.
This is why most companies require fall-back procedures for any production change in addition to testing.
It sounds like it may have taken them longer than some might be comfortable to reach the point where they did roll back changes...but I'm sure that this change tested as okay in all of their test cases.

1965 by Triv · 2011-04-29 03:17 · Score: 1

Huh. Sounds like a 21st century version of the routing failure that caused the 1965 Northeast blackout, just with data instead of electricity.

http://en.wikipedia.org/wiki/Northeast_Blackout_of_1965

Won't touch the cloud now. by Anonymous Coward · 2011-04-29 03:18 · Score: 0

I was one of the businesses who has suffered from this. I was in the process of migrating my mail server that hosts multiple virtual domains for clients to a EC2 instance. I had provided amazon with the information needed to remove my elastic IP from specific RBL's and was moving forward gracefully with my configuration. I had a couple of clients moved to the new server when I first heard about the data loss. I was still able to access my instance last night so I thought, "ok I must not have been affected" I woke up this morning to the email:

Dear AWS Customer,

Starting at 12:47AM PDT on April 21st, there was a service disruption (for a period of a few hours up to a few days) for Amazon EC2 and Amazon RDS ..... and on and on ...

I logged into my Management Console to see all my work and all my customer data gone. You can imagine how happy my clients were when I told them of the news. The cloud can kiss my ass.

Re:Won't touch the cloud now. by teknopurge · 2011-04-29 03:24 · Score: 1

Cloud computing is a marketing architecture, not a technical architecture.

Cloud computing is a form of shared hosting, just with more encapsulation; Clouds fall over the same way a server can fall over. It's hard to blame "The Cloud" when the reality is the people that were suckered in by obtuse, non-specific marketing are the ones at fault. The argument can even be made that Clouds are worse becuase instead of many discreet isolated servers you start sharing more single points of failure, which lead to IO bottlenecks, etc.

--
Website Hosting

Data Loss by Anonymous Coward · 2011-04-29 03:24 · Score: 0

My read of the explanation is that 0.07% of physical machines happened to die during the incident (1 out of 1400) due to hardware failure. Since they couldn't re-mirror, there was a single copy of the data on those machines and data was lost. I'm actually pretty impressed with their response and design, based on the explanation.

1% meltdown rate... by Anonymous Coward · 2011-04-29 04:21 · Score: 0

apparently we can't secure them from disasters.

5 reactors of the 442 have melted down. That is over 1% catastrophic failure.

Circadian rhythms by sleep-doc · 2011-04-29 04:22 · Score: 1

In thinking about why this happened, don't loose sight of the time they chose to make the configuration change was 00:47 local. Human performance on 3rd shift isn't what it is on day shift, and I would think it very likely the people managing this change had been up and working for a significant number of hours at that time. Would they have noticed something or done something differently at 10:00 local? Certainly making an upgrade at a time of lowest use sounds right, but it's not always as simple as that, and you have to respect the realities of circadian rhythms or suffer the consequences. If this were an air crash, we would not we interviewing survivors, coworkers and family to identify when each of the participants in the event and the decisions made had slept during the days preceding the event.

Re:Circadian rhythms by PAStheLoD · 2011-05-01 05:35 · Score: 1

If their network guys work like the ones I know, 00:47 is just right before lunch time.
There are human errors, sure, but the worst one I've seen come from management trying to rush things, so the network guys "just stay until it works", instead of leaving it in a known good state and go and take some rest.

Safe, unless something unexpected happens by superposed · 2011-04-29 09:30 · Score: 1

The nuclear industry claims a chance of major accidents around 1 in 10^7 reactor years, based on this kind of probabilistic analysis. But then we've seen 2 major incidents at western-style nuclear plants (Three Mile Island and Fukushima Daiichi) over a period of about 15,000 reactor-years. The problem is, these studies only account for the risk of simultaneous failures of pre-identified critical components within the engineered system. They don't account for acts of nature or people doing something dumb.

I thought cloud was all automatic by Anonymous Coward · 2011-04-29 10:04 · Score: 0

Damn - we still need those human system and network admins .............grrrrrrrrrrrrrrrrrr

Movie ... by AdiBean · 2011-04-29 12:51 · Score: 1

Hollywood has got to turn this into a movie ... I'd be first in line to buy a ticket

Slashdot Mirror

Amazon EC2 Failure Post-Mortem

117 comments