More Uptime Problems For Amazon Cloud

← Back to Stories (view on slashdot.org)

More Uptime Problems For Amazon Cloud

Posted by Soulskill on Saturday June 30, 2012 @05:40AM from the stormy-weather dept.

1sockchuck writes "An Amazon Web Services data center in northern Virginia lost power Friday night during an electrical storm, causing downtime for numerous customers — including Netflix, which uses an architecture designed to route around problems at a single availability zone. The same data center suffered a power outage two weeks ago and had connectivity problems earlier on Friday."

4 of 183 comments (clear)

Min score:

Reason:

Sort:

Largest non-hurricane related power outage ever by Anonymous Coward · 2012-06-30 05:44 · Score: 5, Informative

I live in the affected area and that's what they're saying. May take 7 days for the last person to have their power restored.
1. Re:Largest non-hurricane related power outage ever by jrmcferren · 2012-06-30 06:19 · Score: 5, Informative
  
  The automatic transfer switch(es) would be the first component I would check even without knowing anything. In order to maintain the UL listing on the transfer switch, it must be tested monthly. The idea is, if it is tested monthly, everything is operated and is less likely to seize and fail than if the device is not tested. Modern systems can be designed that the generators can start BEFORE the transfer switch operates when in test mode to reduce the impact of the test (miliseconds without power versus 30 seconds or so).
  
  --
  sudo mod me up
Re:Millions of dollars spent for nothing. by hawguy · 2012-06-30 06:10 · Score: 5, Informative

So this is the second time this month Amazons cloud has gone down, there should be serious questions being asked of the sustainability of this service given the extremely poor uptime record and extremely large customer base.
They would have spent millions of dollars installing diesel or gas generators and/or battery banks and who knows how much money maintaining and testing it, but when it comes time to actually use it in an emergency, the entire system fails.
You would think having redundant power would be a fundamental crucial thing to get right in owning and operating a data centre, yet Amazon seems unable to handle this relatively easy task.
Well, the entire system didn't fail, my servers in us-east-1a weren't affected at all.
Hardware fails, even well tested hardware... especially in extreme conditions - don't forget that this storm has left millions of people without power, killed at least 10, and caused 3 states to declare an emergency. Amazon may have priority maintenance contracts with their generator and UPS system vendors and fuel delivery contracts, but when a storm like this hits, they vendors are busy keeping government and medical customers online. Rather than spend millions more dollars building redundancy for their redundancy (which adds complexity that can cause a failure itself), Amazon isolates datacenters into availability zones, and has geographically disperse datacenters.
Customers are free to take advantage of availability zones and regions if they want to (which costs more money), but if they chose not to, they shouldn't blame Amazon.
Re:Millions of dollars spent for nothing. by dbrueck · 2012-06-30 07:09 · Score: 5, Informative

Sorry, but "Amazon's cloud has gone down" is wildly incorrect. From the sounds of it, *one* of their many data centers went down. We run tons of stuff on AWS and some of our servers were affected but most were not. Most important of all is that we had *zero* service interruption because we deployed our service according to their published best practices, so our traffic was automatically handled in different zones/regions.
Having managed our own infrastructure in the past, it's these sort of outages at AWS that make us grateful we switched and that continue to convince us it was a good move. It might not be for everybody, but for us it's been a huge win. When we started getting alarms that some of our servers weren't responding, it was so cool to see that the overall service continued on its merry way. I didn't even bother staying up late to babysit things - checked it before bed and checked it again this morning.
Firing up a VM on EC2 (or any other provider) != architecting for the cloud.