Slashdot Mirror


Car Hits Utility Pole, Takes Out EC2 Datacenter

1sockchuck writes "An Amazon cloud computing data center lost power Tuesday when a vehicle struck a nearby utility pole. When utility power was lost, a transfer switch in the data center failed to properly manage the shift to backup power. Amazon said a "small number" of EC2 customers lost service for about an hour, but the downtime followed three power outages last week at data centers supporting EC2 customers. Tuesday's incident is reminiscent of a 2007 outage at a Dallas data center when a truck crash took out a power transformer."

6 of 250 comments (clear)

  1. Farmville updates on Facebook stop by kriston · · Score: 5, Insightful

    And, as a result, Farmville/Mafiawars updates on Facebook temporarily stop.
    Nothing of value was lost.

    --

    Kriston

  2. It's failure on multiple levels by GilliamOS · · Score: 5, Insightful

    Amazon for not load-testing their emergency backup power on a regular basis, not having more than one connection the power grid, and the power grid for not having redundancies. Our aging power grid is really beginning to show on so many levels that this is going to become a lot more common over the coming years.

    --
    "There might be intelligent beings created by God in outer space even if there are none here on Earth." -Anonymous
    1. Re:It's failure on multiple levels by OnlineAlias · · Score: 4, Insightful

      You said it. They failed to test. I design/run datacenters, and have had exactly this kind of thing happen recently. No outage, hardly anyone even noticed. My most critical stuff runs active/active out of multiple data centers...you could nuke one of them and everything would still be up.

      I'm actually a little blown away that the all powerful Amazon could possibly let this kind of thing happen. They are supposed to be pro team, a power failure is high school ball.

    2. Re:It's failure on multiple levels by TubeSteak · · Score: 4, Insightful

      Only one switch out of many failed, due to it being set up from the factory incorrectly. The rest of the system switched over properly. I would say that is pretty good considering the data center size and number of switches needed for redundancy.

      Sounds like Amazon's tech monkeys didn't do their job when they received the hardware from the factory.
      Or is it normal to just plug in mission critical hardware and not check that it is setup properly?

      "We have already made configuration changes to the switch which will prevent it from misinterpreting any similar event in the future and have done a full audit to ensure no other switches in any of our data centers have this incorrect setting," Amazon reported.

      I guess TFA answered that question.
      If they're smart, they'll be creating policies for those types of audits to be done up front instead of after a failure.

      --
      [Fuck Beta]
      o0t!
  3. Re:Redundancies, Redundancies by mirix · · Score: 4, Insightful

    Redundancy costs money. If it costs more than downtime, you don't get it.

    --
    Sent from my PDP-11
  4. Re:Murphy's law by JWSmythe · · Score: 4, Insightful

        Funny thing, I thought "cloud" computing means that you're placed into an automatically redundant network of machines, so if there's a site wide outage it didn't interfere with the operations.

        Now I see that Amazon's definition of "cloud" simply means "hosting provider". I guess in this case it means hosting provider with no DC power room, N+1 generators and regular testing to ensure the fallback systems actually work.

        That kind of reminds me of a company (who will remain nameless) who did tape backups, but never verified their tapes. When the data was lost, a good percentage of the tapes didn't work.

        I worked near a good datacenter. Out on smoke breaks late at night, you could hear them test fire their generators once a week. I was in there helping someone one night during a thunderstorm that sounded like it would rip the roof off, when I heard the generators spin up. The inside of the datacenter didn't miss a beat. When I left an hour later, I saw that there was no power (street lights, traffic lights, and normally illuminated buildings) for about 1/2 mile around it. The power company had it fixed by morning though. When I came back in the morning, everything was fine. Well, except my workstation in the office that didn't have redundant power.

    --
    Serious? Seriousness is well above my pay grade.