Slashdot Mirror


Amazon EC2 Failure Post-Mortem

CPE1704TKS tips news that Amazon has provided a post-mortem on why EC2 failed. Quoting: "At 12:47 AM PDT on April 21st, a network change was performed as part of our normal AWS scaling activities in a single Availability Zone in the US East Region. The configuration change was to upgrade the capacity of the primary network. During the change, one of the standard steps is to shift traffic off of one of the redundant routers in the primary EBS network to allow the upgrade to happen. The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network. For a portion of the EBS cluster in the affected Availability Zone, this meant that they did not have a functioning primary or secondary network because traffic was purposely shifted away from the primary network and the secondary network couldn't handle the traffic level it was receiving."

19 of 117 comments (clear)

  1. I realise this is "News for Nerds"... by Haedrian · · Score: 3, Funny

    But can I get an understandable car analogy here?

    1. Re:I realise this is "News for Nerds"... by MagicM · · Score: 4, Informative

      Instead of closing off one lane of highway for construction, they closed off all lanes and forced highway traffic to go through town. The roads in town weren't able to handle all the cars. Massive back-ups ensued.

    2. Re:I realise this is "News for Nerds"... by RealGene · · Score: 5, Funny

      ..and according to http://it.slashdot.org/story/11/04/29/0254215/Amazon-EC2-Crash-Caused-Data-Loss, the DPW mistakenly pushed some of the cars into the old abandoned quarry.

      --
      Mission: To provide products that consume time and energy as entertainingly as permitted by the laws of thermodynamics.
    3. Re:I realise this is "News for Nerds"... by kingsqueak · · Score: 4, Funny

      Instead of the usual commuter rail line, we've had to do some maintenance causing us to provide a single Yugo as transport for the NY morning rush.

      After packing 25 angry commuters into the Yugo we left a few hundred thousand stranded on the platform, ping-ponging between the parking lot and home, completely confused how they would get to work.

      In addition to that, unfortunately the Yugo couldn't handle the added weight of the passengers and the leaf springs shattered all over the ground. So the 25 passengers we initially planned for were left trapped, to die, inside of the disabled Yugo. They all starved in the days it took us to realize the Yugo never left the station parking lot.

      We are sorry for any inconvenience this may have caused and have upgraded to AAA Gold status to prevent any further future disruptions. This will ensure that at least 25 people will actually reach their destinations should this occur again, though they may need to ride on a flat-bed to get there.

    4. Re:I realise this is "News for Nerds"... by operator_error · · Score: 2

      A classic Dilbert might be useful here:

      http://dilbert.com/strips/comic/1995-02-26/

    5. Re:I realise this is "News for Nerds"... by yakovlev · · Score: 2

      Traffic was diverted from a major highway onto a 2-lane road. This caused the buses to run late.

      Because the buses were running late, everyone decided to take their own car to work. This further increased the amount of traffic on the tiny road.

      The cops figured out that everyone was on the wrong road, and diverted traffic onto another freeway. However, by this point everyone was already taking their cars, so diverting to the other freeway didn't completely fix the problem.

      All this traffic indirectly caused minor traffic problems in neighboring cities, because all the traffic cops in those cities were covering the traffic nightmare in this city.

      Eventually, the cops got everyone to stop getting on the roads, and piecemeal managed to get people where they were going, which eventually cleaned things up.

  2. Re:AOL's 19-hour outage by Mister+Fright · · Score: 2

    No one. No one else remembers AOL.

  3. Amazon issues 10-day service credit by kriston · · Score: 4, Interesting

    Dear AWS Customer,

    Starting at 12:47AM PDT on April 21st, there was a service disruption (for a period of a few hours up to a few days) for Amazon EC2 and Amazon RDS that primarily involved a subset of the Amazon Elastic Block Store (âoeEBSâ) volumes in a single Availability Zone within our US East Region. You can read our detailed summary of the event here:
    http://aws.amazon.com/message/65648

    Weâ(TM)ve identified that you had an attached EBS volume or a running RDS database instance in the affected Availability Zone at the time of the disruption. Regardless of whether your resources and application were impacted, we are going to provide a 10 day credit (for the
    period 4/18-4/27) equal to 100% of your usage of EBS Volumes, EC2 Instances and RDS database instances that were running in the affected Availability Zone. This credit will be automatically applied to your April bill, and you donâ(TM)t need to do anything to receive it.
    You can see your service credit by logging into your AWS Account Activity page after you receive your upcoming billing statement.

    Last, but certainly not least, we want to apologize. We know how critical the services we provide are to our customersâ(TM) businesses and we will do everything we can to learn from this event and use it to drive improvement across our services.

    Sincerely,
    The Amazon Web Services Team

    This message was produced and distributed by Amazon Web Services, LLC, 410 Terry Avenue
    North, Seattle, Washington 98109-5210

    --

    Kriston

  4. At least they admit it by jesseck · · Score: 4, Insightful

    I commend Amazon for providing us with this information. Yes, bad things happened, and data is gone forever. Amazon knows what happened and why, and I'm sure they will implement controls to prevent this again. I doubt we'll hear as much from Sony, though.

    1. Re:At least they admit it by david.emery · · Score: 4, Insightful

      We all benefit from these kinds of disclosures, I remember Google posting post-mortem analyses of some of their failures. Even Microsoft provided information on their Sidekick meltdown. This does seem to be the 'typical' melange of a human error and cascading consequences.

      Someone first said, "You learn much more from failure than you do from success." If nothing else, it's the thesis of the classic Petrosky book, "To Engineer is Human: The Role of Failure in Successful Design" http://www.amazon.com/Engineer-Human-Failure-Successful-Design/dp/0679734163 (If you haven't read this, you should!!)

      And I'm also reminded of a core principle from safety critical system design, that you cannot provide 100% safety. The best you can do is a combination of probabilistic analysis against known hazards. As a Boeing 777 safety engineer told me, "9 9's of safety, i.e. chance of failure 1/10 ^-9, applied over the expected flying hours of the 777 fleet, still means a 50-50 chance of an aircraft falling out of the sky." That kind of reasoning also applies to the current Japanese nuke plant failure...

    2. Re:At least they admit it by afex · · Score: 2

      this has gone mildly offtopic, but as a PSN user i just wanted to chime and say the following...

      I can't STAND when people say 'its free, so its ok if it goes down.' When i purchased a PS3, the PSN was a FEATURE that i considered when i bought it. As such, it's not really "free", its more like it was wrapped into the MSRP. By your logic, they should be able to take away the entire network for GOOD and everyone should be completely happy. is this true? Heck, let's start pulling out other features that you got for 'free' as well. I mean geez, I heard that no one uses otherOS, lets just...pull..that...oh shit.

    3. Re:At least they admit it by the+eric+conspiracy · · Score: 2

      The issue is not uptime. It is the loss of sensitive data. If Sony is holding personal data they have an obligation to protect that data.

  5. Re:Isn't the point of a secondary network... by mysidia · · Score: 2, Informative

    ... to be able to handle loads if the primary fails?

    No. That's the point of the redundant elements and backup of the primary network.

    The secondary network they routed traffic to was designed for a different purpose, and never meant to receive traffic from the primary network.

  6. Can we get this in non-Amazon speak by gad_zuki! · · Score: 3, Interesting

    What is an EBS? Is it really just a Xen or VMWare disk image? Which data center corresponds with each availability zone? What are they using for storage iSCSI targets on a SAN?

  7. Re:That doesnt explain anything by Darth_brooks · · Score: 2

    "At 12:30 PM PDT on April 24, we had finished the volumes that we could recover in this way and had recovered all but 1.04% of the affected volumes. At this point, the team began forensics on the remaining volumes which had suffered machine failure and for which we had not been able to take a snapshot. At 3:00 PM PDT, the team began restoring these. Ultimately, 0.07% of the volumes in the affected Availability Zone could not be restored for customers in a consistent state."

    --
    There are some people that if they don't know, you can't tell 'em.
  8. Re:Oops by Whalou · · Score: 2

    Kudos to Amazon for rapidly explaining, in length, what happened.

    Unlike some other company... *cough* Sony *cough*

    --
    English is not this .sig mother tongue...
  9. Re:Oops by DrXym · · Score: 2

    Sony hasn't fixed their issue. Kind of hard to have a post mortem while the solution is still ongoing. There has plenty of extrapolation and bullshit in the information vacuum surrounding the attack though. So when things return to normality it would be in their interest to provide a decent technical overview of what happened, the safeguards that were there before, why they failed and what steps have been made since to improve things.

  10. Re:Where is their testing lab? by TrevorDoom · · Score: 2

    Have you ever worked in a real environment?

    There is ALWAYS a difference between test and production. No matter how many test cases and iterations of changes that you go through, there is always a non-zero percent chance that the change in production will behave differently.
    This is why most companies require fall-back procedures for any production change in addition to testing.
    It sounds like it may have taken them longer than some might be comfortable to reach the point where they did roll back changes...but I'm sure that this change tested as okay in all of their test cases.

  11. Re:pure and utter BS by bruce_the_loon · · Score: 2

    Go and read the entire notice, not just the pathetic snippet a bad submitter used. Makes more sense.

    Also, this is a storage network, not an access network. Effectively it's like pulling the SAS cable out of the RAID card while the machines are running.

    --
    Trying to become famous by taking photos. Visit my homepage please.