Slashdot Mirror


Car Hits Utility Pole, Takes Out EC2 Datacenter

1sockchuck writes "An Amazon cloud computing data center lost power Tuesday when a vehicle struck a nearby utility pole. When utility power was lost, a transfer switch in the data center failed to properly manage the shift to backup power. Amazon said a "small number" of EC2 customers lost service for about an hour, but the downtime followed three power outages last week at data centers supporting EC2 customers. Tuesday's incident is reminiscent of a 2007 outage at a Dallas data center when a truck crash took out a power transformer."

8 of 250 comments (clear)

  1. Farmville updates on Facebook stop by kriston · · Score: 5, Insightful

    And, as a result, Farmville/Mafiawars updates on Facebook temporarily stop.
    Nothing of value was lost.

    --

    Kriston

  2. It's failure on multiple levels by GilliamOS · · Score: 5, Insightful

    Amazon for not load-testing their emergency backup power on a regular basis, not having more than one connection the power grid, and the power grid for not having redundancies. Our aging power grid is really beginning to show on so many levels that this is going to become a lot more common over the coming years.

    --
    "There might be intelligent beings created by God in outer space even if there are none here on Earth." -Anonymous
    1. Re:It's failure on multiple levels by OnlineAlias · · Score: 4, Insightful

      You said it. They failed to test. I design/run datacenters, and have had exactly this kind of thing happen recently. No outage, hardly anyone even noticed. My most critical stuff runs active/active out of multiple data centers...you could nuke one of them and everything would still be up.

      I'm actually a little blown away that the all powerful Amazon could possibly let this kind of thing happen. They are supposed to be pro team, a power failure is high school ball.

    2. Re:It's failure on multiple levels by TubeSteak · · Score: 4, Insightful

      Only one switch out of many failed, due to it being set up from the factory incorrectly. The rest of the system switched over properly. I would say that is pretty good considering the data center size and number of switches needed for redundancy.

      Sounds like Amazon's tech monkeys didn't do their job when they received the hardware from the factory.
      Or is it normal to just plug in mission critical hardware and not check that it is setup properly?

      "We have already made configuration changes to the switch which will prevent it from misinterpreting any similar event in the future and have done a full audit to ensure no other switches in any of our data centers have this incorrect setting," Amazon reported.

      I guess TFA answered that question.
      If they're smart, they'll be creating policies for those types of audits to be done up front instead of after a failure.

      --
      [Fuck Beta]
      o0t!
  3. Re:An untested DR plan is a worthless DR plan by Albanach · · Score: 3, Insightful

    Seriously, Amazon screwed up in a fairly major way with this.

    What more upsetting is this: If Amazon doesn't have working disaster recovery, what do other websites/companies have?

    What on earth leads you to suggest they don't have working disaster recovery? The experienced some disparate power outages and say they're implementing changes to improve their power distribution.

    I've hosted in data centers where the UPS was regularly tested, yet on a real live incident switchover failed. Even though the UPS did come up there was a brief outage shutting down all the racks. Each rack needs brought back online one at a time to prevent overloading. Immediately you're looking at significant downtime.

    I've hosted in another data center where someone hit the BIG RED BUTTON underneath the plastic case, cutting off power to the floor.

    I'm sure Amazon could have done thing better and will learn lessons. That's life in a data center.

    Nonetheless, Amazon allow you to keep your data at geographically diverse locations. As a customer you can pay the money and get geographic diversity that would have mitigated. If you don't take advantage of that, you can hardly blame Amazon for your decision.

  4. Re:Redundancies, Redundancies by mirix · · Score: 4, Insightful

    Redundancy costs money. If it costs more than downtime, you don't get it.

    --
    Sent from my PDP-11
  5. Re:Murphy's law by JWSmythe · · Score: 4, Insightful

        Funny thing, I thought "cloud" computing means that you're placed into an automatically redundant network of machines, so if there's a site wide outage it didn't interfere with the operations.

        Now I see that Amazon's definition of "cloud" simply means "hosting provider". I guess in this case it means hosting provider with no DC power room, N+1 generators and regular testing to ensure the fallback systems actually work.

        That kind of reminds me of a company (who will remain nameless) who did tape backups, but never verified their tapes. When the data was lost, a good percentage of the tapes didn't work.

        I worked near a good datacenter. Out on smoke breaks late at night, you could hear them test fire their generators once a week. I was in there helping someone one night during a thunderstorm that sounded like it would rip the roof off, when I heard the generators spin up. The inside of the datacenter didn't miss a beat. When I left an hour later, I saw that there was no power (street lights, traffic lights, and normally illuminated buildings) for about 1/2 mile around it. The power company had it fixed by morning though. When I came back in the morning, everything was fine. Well, except my workstation in the office that didn't have redundant power.

    --
    Serious? Seriousness is well above my pay grade.
  6. Again: The IT Uptime Lightweights by RobotRunAmok · · Score: 3, Insightful

    When was the last time anyone heard of a TV Network going dark for an hour? A Hospital Emergency Room? IT guys always run around like self-important Star Trek Blue Shirts, but they never seem to take the proper steps to ensure -- really ensure -- their uptime.

    I'm sure there are exceptions, but it just seems that they have a ways to go, compared to the real "critical systems" industries to which they are so fond of comparing themselves. Is it money, arrogance, or ignorance?