Slashdot Mirror


More Uptime Problems For Amazon Cloud

1sockchuck writes "An Amazon Web Services data center in northern Virginia lost power Friday night during an electrical storm, causing downtime for numerous customers — including Netflix, which uses an architecture designed to route around problems at a single availability zone. The same data center suffered a power outage two weeks ago and had connectivity problems earlier on Friday."

24 of 183 comments (clear)

  1. Cloud takes down cloud by AlienIntelligence · · Score: 5, Funny

    Nuf said

    --
    For me, it is far better to grasp the Universe as it really is than to persist in delusion
  2. Largest non-hurricane related power outage ever by Anonymous Coward · · Score: 5, Informative

    I live in the affected area and that's what they're saying. May take 7 days for the last person to have their power restored.

    1. Re:Largest non-hurricane related power outage ever by jrmcferren · · Score: 5, Interesting

      That really shouldn't matter though as long as the Data center's generators are running and they can get fuel. It seems that they are not performing the proper testing and maintenance on their switchgear and generators if they are having this much trouble. The last time the data center in the building where I work went down for a power outage was when we had an arc flash in one of the UPS battery cabinets and they had to shut the data center (and the rest of the building's power for that matter) down.

      --
      sudo mod me up
    2. Re:Largest non-hurricane related power outage ever by John+Bresnahan · · Score: 4, Insightful

      Of course, the network only works if every router in between the data center and the customer has power. In a power outage of this size, it's entirely possible that more than one link is down.

    3. Re:Largest non-hurricane related power outage ever by jrmcferren · · Score: 5, Informative

      The automatic transfer switch(es) would be the first component I would check even without knowing anything. In order to maintain the UL listing on the transfer switch, it must be tested monthly. The idea is, if it is tested monthly, everything is operated and is less likely to seize and fail than if the device is not tested. Modern systems can be designed that the generators can start BEFORE the transfer switch operates when in test mode to reduce the impact of the test (miliseconds without power versus 30 seconds or so).

      --
      sudo mod me up
    4. Re:Largest non-hurricane related power outage ever by fuzzyfuzzyfungus · · Score: 3, Interesting

      The problem is that a lot of people cheap out on their backup power. Generators and UPSes are expensive.

      I wonder, in comparing the price/performance numbers on the invoices from Dell and the invoices from APC(hint, one of these has Moore's law at its back, the other... Doesn't.) what it would take in terms of hardware pricing and software system reliability design to make these backup power systems economically obsolete for most of the 'bulk' data-shoveling and HTTP cruft that keep the tubes humming...

      Obviously, if your software doesn't allow any sort of elegant failover, or you paid a small fortune per core, redundant PSUs, UPSes, generators, and all the rest make perfect sense. If, however, your software can tolerate a hardware failure and the price of silicon and storage is plummeting and the price of electrical gear that is going to spend most of its life generating heat and maintenance bills isn't, it becomes interesting to consider the point at which the 'Eh, fuck it. Move the load to somewhere where the lights are still on until the utility guys figure it out.' theory of backup power becomes viable.

    5. Re:Largest non-hurricane related power outage ever by Salgak1 · · Score: 3, Informative

      Well, as of current reports. . . . 2.5 million are without power in Virginia, 800 Thousand in Maryland, 400+ thousand in DC. I've seen numbers in the 3.5 million region between Ohio and New Jersey. We got power back early this morning ~0400, but we STILL don't have phone, net, or cable at home. The real question, since some areas in DC Metro are not supposed to get power back for nearly a week is. . . . do the emergency fuel generators have sufficient fuel bunkers ???

  3. Infrastructure by TubeSteak · · Score: 5, Insightful

    We need to invest trillions in roads, water, and electrical infrastructure to keep this country going.
    If you let the basic building blocks of civilization rot, don't be surprised when everything else follows suit.

    --
    [Fuck Beta]
    o0t!
    1. Re:Infrastructure by rubycodez · · Score: 4, Insightful

      war is the basic building block of our particular civilization. if we waste money on your frivolities, how will we afford war & keep war machine shareholder value?

    2. Re:Infrastructure by Sir_Sri · · Score: 3, Informative

      In the case of panama it's control of the panama canal zone, which while by itself isn't a natural economic resource, but it saves a crap load of them in reduced shipping costs.

      Though true, wars are generally fought for gold glory and god as one of my past history teachers used to say. I think what she meant is that wars are *started* for gold glory or god. Afghanistan was very much god and glory (for Al Qaeda and the Taliban at least), and it was for them in part about natural resources and control, benefit and possession of the islamic caliphates (yes, that's doesn't actually exist, but that's the kind of level they were thinking at) resources.

      The invasion of Grenada is more tricky. By itself Grenada isn't anything, but a major military airfield in Grenada could cover all of the oil export ports from Venezuela, and there was the matter of US prestige on the issue.

    3. Re:Infrastructure by tyler_larson · · Score: 4, Interesting

      In my past two jobs and over the past 20 years, we've worked with dozens of independent an unrelated vendors with locations around the country, including Virginia. Of all the locations where these companies have operations, the ones in Virginia have been dramatically, almost comically, more disaster-prone than the rest of the country and even the rest of the world. The running joke in the office is that whenever any vendor or service provider drops offline, we first check the weather in Virginia before checking to see if any of our own systems are offline. Every time, we see a post-mortem a few days later disclosing some failed system or backup or contingency, and every time, they say this problem that will never happen again.

      You'd think that all the failing locations would share a operations center or service provider or even a single city, but it turns out that the only thing these disaster-prone operations have in common is that they're in Virginia. I have no idea why this is the case. But our company has a policy singling out Virginia saying that no mission-critical components are allowed to be based there.

      --
      "With sufficient thrust, pigs fly just fine. However, this is not necessarily a good idea...."
      RFC 1925
  4. Seems like anything takes down the cloud... by Anonymous+Brave+Guy · · Score: 5, Interesting

    It seems that recently, anything can take down the cloud, or at least cause a serious disruption for any of the major cloud providers. I wonder how many more of these it takes before the cloud-skeptics start winning the debates with management a lot more often.

    You can only argue that the extra costs and admin involved with cloud hosting outweigh the extra costs of self-hosting and paying competent IT staff for so long. If you read the various forums after an event like this, the mantra from cloud evangelists already seems to have changed from a general "cloud=reliable, and Google's/Amazon's/whoever's people are smarter than your in house people" to a much more weasel-worded "cloud is realiable as long as you've figured out exactly how to set it all up with proper redundancy etc." If you're going to pay people smart enough to figure that out, and you're not one of the few businesses whose model really does benefit disproportionately from the scalability at a certain stage in its development, why not save a fortune and host everything in-house?

    --
    If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
    1. Re:Seems like anything takes down the cloud... by tnk1 · · Score: 3, Interesting

      And this is ridiculous. How are they not in a datacenter with backup diesel generators and redundant internet egress points? Even the smallest service business I have worked for had this. All they need to do is buy space in a place like Qwest or even better, Equinix and it's all covered. A company like Amazon shouldn't be taken out by power issues of all things. They are either cheaping out or their systems/datacenter leads need to be replaced.

    2. Re:Seems like anything takes down the cloud... by hawguy · · Score: 5, Insightful

      It seems that recently, anything can take down the cloud, or at least cause a serious disruption for any of the major cloud providers. I wonder how many more of these it takes before the cloud-skeptics start winning the debates with management a lot more often.

      I think it's more because a cloud outage affects thousands of customers, so it has more visibility. When Amazon has problems, the news is reported on Slashdot. When a smaller collocation center has an accidental fire suppression discharge taking hundreds of customers offline, it doesn't get any press coverage at all.

      But the biggest takeaway from this is - never put all of your assets in one region. No matter how much redundancy Amazon builds into a region, a local disaster can still take out the datacenter. That's why they have Availability zones *and* regions. I have some servers in us-east-1a and they weren't affected at all. If they were down, I could bring up my servers in us-west within about an hour. (I could even automate it, but a few hours or even a day of downtime for these servers is no big deal)

  5. What, you thought "cloud" meant "no outage"? by ebunga · · Score: 4, Insightful

    Cloud computing is nothing more than 1960s timesharing services with modern operating systems. Unless you design for resilience, you're not resilient to problems.

    1. Re:What, you thought "cloud" meant "no outage"? by dkf · · Score: 4, Funny

      And 8-track tapes while we're at it.

      We need those tape machines. Stick them in front of the real machines and get something hacked from a Raspberry Pi to spin them back and forth in an interesting pattern, with some extra blinkenlights for good measure, and we'll be able to once again prove to all the management types that we're doing serious computing so they can leave us alone and go back to their golf handicap.

      --
      "Little does he know, but there is no 'I' in 'Idiot'!"
  6. Millions of dollars spent for nothing. by Anonymous Coward · · Score: 5, Interesting

    So this is the second time this month Amazons cloud has gone down, there should be serious questions being asked of the sustainability of this service given the extremely poor uptime record and extremely large customer base.

    They would have spent millions of dollars installing diesel or gas generators and/or battery banks and who knows how much money maintaining and testing it, but when it comes time to actually use it in an emergency, the entire system fails.

    You would think having redundant power would be a fundamental crucial thing to get right in owning and operating a data centre, yet Amazon seems unable to handle this relatively easy task.

    Now before people say "well this was a major storm system that killed 10 people, what do you expect", my response is that cloud computing is expected to do work for customers hundreds and thousands of kilometres/miles from the actual data centre so this is a somewhat crucial thing that we're talking about - millions of people literally depend on these services; that's my first point.

    My second point is it's not like anything happened to the data centre, it simply lost mains energy. It's not like there was a fire, or flood, or the roof blew off the building, or anything like that; they simply lost power and failed to bring all their millions of dollars in equipment up to the task of picking up the load.

    If I were a corporate customer, or even a regular consumer I would be seriously questioning the sustainability of at least Amazons cloud computing, Google and Facebook seem to be able to handle it but not Amazon - granted they don't offer identical products the overall data centres seem to stay up 100 or 99.9999999% of the time unlike Amazons.

    1. Re:Millions of dollars spent for nothing. by hawguy · · Score: 5, Informative

      So this is the second time this month Amazons cloud has gone down, there should be serious questions being asked of the sustainability of this service given the extremely poor uptime record and extremely large customer base.

      They would have spent millions of dollars installing diesel or gas generators and/or battery banks and who knows how much money maintaining and testing it, but when it comes time to actually use it in an emergency, the entire system fails.

      You would think having redundant power would be a fundamental crucial thing to get right in owning and operating a data centre, yet Amazon seems unable to handle this relatively easy task.

      Well, the entire system didn't fail, my servers in us-east-1a weren't affected at all.

      Hardware fails, even well tested hardware... especially in extreme conditions - don't forget that this storm has left millions of people without power, killed at least 10, and caused 3 states to declare an emergency. Amazon may have priority maintenance contracts with their generator and UPS system vendors and fuel delivery contracts, but when a storm like this hits, they vendors are busy keeping government and medical customers online. Rather than spend millions more dollars building redundancy for their redundancy (which adds complexity that can cause a failure itself), Amazon isolates datacenters into availability zones, and has geographically disperse datacenters.

      Customers are free to take advantage of availability zones and regions if they want to (which costs more money), but if they chose not to, they shouldn't blame Amazon.

    2. Re:Millions of dollars spent for nothing. by dbrueck · · Score: 5, Informative

      Sorry, but "Amazon's cloud has gone down" is wildly incorrect. From the sounds of it, *one* of their many data centers went down. We run tons of stuff on AWS and some of our servers were affected but most were not. Most important of all is that we had *zero* service interruption because we deployed our service according to their published best practices, so our traffic was automatically handled in different zones/regions.

      Having managed our own infrastructure in the past, it's these sort of outages at AWS that make us grateful we switched and that continue to convince us it was a good move. It might not be for everybody, but for us it's been a huge win. When we started getting alarms that some of our servers weren't responding, it was so cool to see that the overall service continued on its merry way. I didn't even bother staying up late to babysit things - checked it before bed and checked it again this morning.

      Firing up a VM on EC2 (or any other provider) != architecting for the cloud.

  7. I live nowhere near Va by bugs2squash · · Score: 4, Interesting

    However "Netflix, which uses an architecture designed to route around problems at a single availability zone." seems to have efficiently spread the pain of a North Eastern outage to the rest of the country. Sometimes I think redundancy in solutions is better left turned off.

    --
    Nullius in verba
  8. it seems like the switching system failed by Joe_Dragon · · Score: 3, Informative

    it seems like the switching system failed and or the back up power generators did not kick on.

    Maybe natural gas ones are better. The firehouses have them. I also see them at a big power sub station as well.

  9. Wasn't even a big storm by gman003 · · Score: 4, Informative

    I was in it - it was not a particularly bad storm. Heavy winds, lots of cloud-to-cloud lightning, but very little rain or cloud-to-ground lightning. I lost power repeatedly, but it was always back up within seconds. And I'm located way out in a rural area, where the power supply is much more vulnerable (every time a major hurricane hits, I'm usually without power for about a week - bad enough that I bought a small generator).

    According to TFA, they were only without power for half an hour, and that the ongoing problems were related to recovery, not actual power-lossage. So their problems are more "bad disaster planning" than "bad disaster".

    Still, you'd think a major data center would have the usual UPS and generator setup most major data centers have - half an hour without power is something they should have been able to handle. Or at least have enough UPS capacity to cleanly shut down all the machines or migrate the virtual instances to a different datacenter.

  10. Re:My instance was down for 9hrs... by PTBarnum · · Score: 4, Interesting

    There is a gap between technical and marketing requirements here.

    The Amazon infrastructure was initially built to support Amazon retail, and Amazon put a lot of pressure on its engineers to make sure their apps were properly redundant across three or more data centers. At one point, the Amazon infrastructure team used to do "game days" where they would randomly take a data center offline and see what broke. The EC2 infrastructure is mostly independent of retail infrastructure, but it was designed in a similar fashion.

    However, Amazon can't tell their customers how to build apps. The customers build what is familiar to them, and make assumptions about up time of individual servers or data centers. As the OP says, it's "the standard people are used to". Since the customer is always right, Amazon has a marketing need to respond by bringing availability up to those standards, even though it isn't technically necessary.

  11. Stupid: Military is Insurance by Anonymous Coward · · Score: 3, Insightful

    What are you, 14? Democracies don't like War, because they don't like their sons, fathers, brothers, and husbands getting killed. It generally takes quite a lot to motivate Democracies into war, because of the hatred of casualties. Even when it is the best option. Example: going to war against Hitler in 1934, or 1936, or in 1938.

    Out here in the real world, the sum total of human experience suggests a strong military is like insurance or a seat belt. You hope you never have to use it, but its a godsend if you need it. Indeed having a strong military deters attacks. Nobody goes down to Venice Beach to pick fights with body builders, or down to the Gracie's gym to start fights.

    Like insurance, working out, eating right, avoiding bad areas, a strong military is a pain in the ass. It costs a lot. It is a pain and non-productive to maintain. And sure, you could save a lot by going without auto or health insurance. You could eat more cheaply at McDonalds than cooking healthy meals at home. Its cheaper to live in the ghetto than a nice area.

    As far as market value of defense stocks, the market capitalization of Lockheed Martin is 28.27 Billion, of Apple Computer 546.08 Billion. The market value of L'Oreal at 54.83 billion is about twice that of Lockheed Martin, suggesting lipstick pays a lot more than military avionics. Defense firms since their inception have been very cyclical, made relatively little money, and are merging like crazy as war spending winds down. But unless you're going to change human nature with Harry Potter's magic wand, carrying otherwise unprofitable defense firms is worth it because making drones, airplanes, missiles, tanks, ships, and helicopters to kill well-armed enemies is a very narrow engineering niche with knowledge quickly lost.

    As soon as your computer runs on unicorn farts and rainbows, we can all forget about dominance in the Persian Gulf and other oil areas. Until then, I'd prefer to drive to work and run the AC not live like a dirty smelly hippie. That AC making life bearable in 118F Kansas? Runs on oil not tree-hugging and drum circles.