Slashdot Mirror


More Uptime Problems For Amazon Cloud

1sockchuck writes "An Amazon Web Services data center in northern Virginia lost power Friday night during an electrical storm, causing downtime for numerous customers — including Netflix, which uses an architecture designed to route around problems at a single availability zone. The same data center suffered a power outage two weeks ago and had connectivity problems earlier on Friday."

18 of 183 comments (clear)

  1. Cloud takes down cloud by AlienIntelligence · · Score: 5, Funny

    Nuf said

    --
    For me, it is far better to grasp the Universe as it really is than to persist in delusion
  2. Largest non-hurricane related power outage ever by Anonymous Coward · · Score: 5, Informative

    I live in the affected area and that's what they're saying. May take 7 days for the last person to have their power restored.

    1. Re:Largest non-hurricane related power outage ever by jrmcferren · · Score: 5, Interesting

      That really shouldn't matter though as long as the Data center's generators are running and they can get fuel. It seems that they are not performing the proper testing and maintenance on their switchgear and generators if they are having this much trouble. The last time the data center in the building where I work went down for a power outage was when we had an arc flash in one of the UPS battery cabinets and they had to shut the data center (and the rest of the building's power for that matter) down.

      --
      sudo mod me up
    2. Re:Largest non-hurricane related power outage ever by John+Bresnahan · · Score: 4, Insightful

      Of course, the network only works if every router in between the data center and the customer has power. In a power outage of this size, it's entirely possible that more than one link is down.

    3. Re:Largest non-hurricane related power outage ever by jrmcferren · · Score: 5, Informative

      The automatic transfer switch(es) would be the first component I would check even without knowing anything. In order to maintain the UL listing on the transfer switch, it must be tested monthly. The idea is, if it is tested monthly, everything is operated and is less likely to seize and fail than if the device is not tested. Modern systems can be designed that the generators can start BEFORE the transfer switch operates when in test mode to reduce the impact of the test (miliseconds without power versus 30 seconds or so).

      --
      sudo mod me up
  3. Infrastructure by TubeSteak · · Score: 5, Insightful

    We need to invest trillions in roads, water, and electrical infrastructure to keep this country going.
    If you let the basic building blocks of civilization rot, don't be surprised when everything else follows suit.

    --
    [Fuck Beta]
    o0t!
    1. Re:Infrastructure by rubycodez · · Score: 4, Insightful

      war is the basic building block of our particular civilization. if we waste money on your frivolities, how will we afford war & keep war machine shareholder value?

    2. Re:Infrastructure by tyler_larson · · Score: 4, Interesting

      In my past two jobs and over the past 20 years, we've worked with dozens of independent an unrelated vendors with locations around the country, including Virginia. Of all the locations where these companies have operations, the ones in Virginia have been dramatically, almost comically, more disaster-prone than the rest of the country and even the rest of the world. The running joke in the office is that whenever any vendor or service provider drops offline, we first check the weather in Virginia before checking to see if any of our own systems are offline. Every time, we see a post-mortem a few days later disclosing some failed system or backup or contingency, and every time, they say this problem that will never happen again.

      You'd think that all the failing locations would share a operations center or service provider or even a single city, but it turns out that the only thing these disaster-prone operations have in common is that they're in Virginia. I have no idea why this is the case. But our company has a policy singling out Virginia saying that no mission-critical components are allowed to be based there.

      --
      "With sufficient thrust, pigs fly just fine. However, this is not necessarily a good idea...."
      RFC 1925
  4. Seems like anything takes down the cloud... by Anonymous+Brave+Guy · · Score: 5, Interesting

    It seems that recently, anything can take down the cloud, or at least cause a serious disruption for any of the major cloud providers. I wonder how many more of these it takes before the cloud-skeptics start winning the debates with management a lot more often.

    You can only argue that the extra costs and admin involved with cloud hosting outweigh the extra costs of self-hosting and paying competent IT staff for so long. If you read the various forums after an event like this, the mantra from cloud evangelists already seems to have changed from a general "cloud=reliable, and Google's/Amazon's/whoever's people are smarter than your in house people" to a much more weasel-worded "cloud is realiable as long as you've figured out exactly how to set it all up with proper redundancy etc." If you're going to pay people smart enough to figure that out, and you're not one of the few businesses whose model really does benefit disproportionately from the scalability at a certain stage in its development, why not save a fortune and host everything in-house?

    --
    If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
    1. Re:Seems like anything takes down the cloud... by hawguy · · Score: 5, Insightful

      It seems that recently, anything can take down the cloud, or at least cause a serious disruption for any of the major cloud providers. I wonder how many more of these it takes before the cloud-skeptics start winning the debates with management a lot more often.

      I think it's more because a cloud outage affects thousands of customers, so it has more visibility. When Amazon has problems, the news is reported on Slashdot. When a smaller collocation center has an accidental fire suppression discharge taking hundreds of customers offline, it doesn't get any press coverage at all.

      But the biggest takeaway from this is - never put all of your assets in one region. No matter how much redundancy Amazon builds into a region, a local disaster can still take out the datacenter. That's why they have Availability zones *and* regions. I have some servers in us-east-1a and they weren't affected at all. If they were down, I could bring up my servers in us-west within about an hour. (I could even automate it, but a few hours or even a day of downtime for these servers is no big deal)

  5. What, you thought "cloud" meant "no outage"? by ebunga · · Score: 4, Insightful

    Cloud computing is nothing more than 1960s timesharing services with modern operating systems. Unless you design for resilience, you're not resilient to problems.

    1. Re:What, you thought "cloud" meant "no outage"? by dkf · · Score: 4, Funny

      And 8-track tapes while we're at it.

      We need those tape machines. Stick them in front of the real machines and get something hacked from a Raspberry Pi to spin them back and forth in an interesting pattern, with some extra blinkenlights for good measure, and we'll be able to once again prove to all the management types that we're doing serious computing so they can leave us alone and go back to their golf handicap.

      --
      "Little does he know, but there is no 'I' in 'Idiot'!"
  6. Millions of dollars spent for nothing. by Anonymous Coward · · Score: 5, Interesting

    So this is the second time this month Amazons cloud has gone down, there should be serious questions being asked of the sustainability of this service given the extremely poor uptime record and extremely large customer base.

    They would have spent millions of dollars installing diesel or gas generators and/or battery banks and who knows how much money maintaining and testing it, but when it comes time to actually use it in an emergency, the entire system fails.

    You would think having redundant power would be a fundamental crucial thing to get right in owning and operating a data centre, yet Amazon seems unable to handle this relatively easy task.

    Now before people say "well this was a major storm system that killed 10 people, what do you expect", my response is that cloud computing is expected to do work for customers hundreds and thousands of kilometres/miles from the actual data centre so this is a somewhat crucial thing that we're talking about - millions of people literally depend on these services; that's my first point.

    My second point is it's not like anything happened to the data centre, it simply lost mains energy. It's not like there was a fire, or flood, or the roof blew off the building, or anything like that; they simply lost power and failed to bring all their millions of dollars in equipment up to the task of picking up the load.

    If I were a corporate customer, or even a regular consumer I would be seriously questioning the sustainability of at least Amazons cloud computing, Google and Facebook seem to be able to handle it but not Amazon - granted they don't offer identical products the overall data centres seem to stay up 100 or 99.9999999% of the time unlike Amazons.

    1. Re:Millions of dollars spent for nothing. by hawguy · · Score: 5, Informative

      So this is the second time this month Amazons cloud has gone down, there should be serious questions being asked of the sustainability of this service given the extremely poor uptime record and extremely large customer base.

      They would have spent millions of dollars installing diesel or gas generators and/or battery banks and who knows how much money maintaining and testing it, but when it comes time to actually use it in an emergency, the entire system fails.

      You would think having redundant power would be a fundamental crucial thing to get right in owning and operating a data centre, yet Amazon seems unable to handle this relatively easy task.

      Well, the entire system didn't fail, my servers in us-east-1a weren't affected at all.

      Hardware fails, even well tested hardware... especially in extreme conditions - don't forget that this storm has left millions of people without power, killed at least 10, and caused 3 states to declare an emergency. Amazon may have priority maintenance contracts with their generator and UPS system vendors and fuel delivery contracts, but when a storm like this hits, they vendors are busy keeping government and medical customers online. Rather than spend millions more dollars building redundancy for their redundancy (which adds complexity that can cause a failure itself), Amazon isolates datacenters into availability zones, and has geographically disperse datacenters.

      Customers are free to take advantage of availability zones and regions if they want to (which costs more money), but if they chose not to, they shouldn't blame Amazon.

    2. Re:Millions of dollars spent for nothing. by dbrueck · · Score: 5, Informative

      Sorry, but "Amazon's cloud has gone down" is wildly incorrect. From the sounds of it, *one* of their many data centers went down. We run tons of stuff on AWS and some of our servers were affected but most were not. Most important of all is that we had *zero* service interruption because we deployed our service according to their published best practices, so our traffic was automatically handled in different zones/regions.

      Having managed our own infrastructure in the past, it's these sort of outages at AWS that make us grateful we switched and that continue to convince us it was a good move. It might not be for everybody, but for us it's been a huge win. When we started getting alarms that some of our servers weren't responding, it was so cool to see that the overall service continued on its merry way. I didn't even bother staying up late to babysit things - checked it before bed and checked it again this morning.

      Firing up a VM on EC2 (or any other provider) != architecting for the cloud.

  7. I live nowhere near Va by bugs2squash · · Score: 4, Interesting

    However "Netflix, which uses an architecture designed to route around problems at a single availability zone." seems to have efficiently spread the pain of a North Eastern outage to the rest of the country. Sometimes I think redundancy in solutions is better left turned off.

    --
    Nullius in verba
  8. Wasn't even a big storm by gman003 · · Score: 4, Informative

    I was in it - it was not a particularly bad storm. Heavy winds, lots of cloud-to-cloud lightning, but very little rain or cloud-to-ground lightning. I lost power repeatedly, but it was always back up within seconds. And I'm located way out in a rural area, where the power supply is much more vulnerable (every time a major hurricane hits, I'm usually without power for about a week - bad enough that I bought a small generator).

    According to TFA, they were only without power for half an hour, and that the ongoing problems were related to recovery, not actual power-lossage. So their problems are more "bad disaster planning" than "bad disaster".

    Still, you'd think a major data center would have the usual UPS and generator setup most major data centers have - half an hour without power is something they should have been able to handle. Or at least have enough UPS capacity to cleanly shut down all the machines or migrate the virtual instances to a different datacenter.

  9. Re:My instance was down for 9hrs... by PTBarnum · · Score: 4, Interesting

    There is a gap between technical and marketing requirements here.

    The Amazon infrastructure was initially built to support Amazon retail, and Amazon put a lot of pressure on its engineers to make sure their apps were properly redundant across three or more data centers. At one point, the Amazon infrastructure team used to do "game days" where they would randomly take a data center offline and see what broke. The EC2 infrastructure is mostly independent of retail infrastructure, but it was designed in a similar fashion.

    However, Amazon can't tell their customers how to build apps. The customers build what is familiar to them, and make assumptions about up time of individual servers or data centers. As the OP says, it's "the standard people are used to". Since the customer is always right, Amazon has a marketing need to respond by bringing availability up to those standards, even though it isn't technically necessary.