Amazon EC2 Failure Post-Mortem

Posted by Soulskill on Friday April 29, 2011 @12:50AM from the ted-tripped-over-a-power-cord dept.

CPE1704TKS tips news that Amazon has provided a post-mortem on why EC2 failed. Quoting: "At 12:47 AM PDT on April 21st, a network change was performed as part of our normal AWS scaling activities in a single Availability Zone in the US East Region. The configuration change was to upgrade the capacity of the primary network. During the change, one of the standard steps is to shift traffic off of one of the redundant routers in the primary EBS network to allow the upgrade to happen. The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network. For a portion of the EBS cluster in the affected Availability Zone, this meant that they did not have a functioning primary or secondary network because traffic was purposely shifted away from the primary network and the secondary network couldn't handle the traffic level it was receiving."

2 of 117 comments (clear)

Min score:

Reason:

Sort:

Re:I realise this is "News for Nerds"... by MagicM · 2011-04-29 00:59 · Score: 4, Informative

Instead of closing off one lane of highway for construction, they closed off all lanes and forced highway traffic to go through town. The roads in town weren't able to handle all the cars. Massive back-ups ensued.
Re:Isn't the point of a secondary network... by mysidia · 2011-04-29 01:27 · Score: 2, Informative

... to be able to handle loads if the primary fails?
No. That's the point of the redundant elements and backup of the primary network.
The secondary network they routed traffic to was designed for a different purpose, and never meant to receive traffic from the primary network.