Slashdot Mirror


Car Hits Utility Pole, Takes Out EC2 Datacenter

1sockchuck writes "An Amazon cloud computing data center lost power Tuesday when a vehicle struck a nearby utility pole. When utility power was lost, a transfer switch in the data center failed to properly manage the shift to backup power. Amazon said a "small number" of EC2 customers lost service for about an hour, but the downtime followed three power outages last week at data centers supporting EC2 customers. Tuesday's incident is reminiscent of a 2007 outage at a Dallas data center when a truck crash took out a power transformer."

6 of 250 comments (clear)

  1. Farmville updates on Facebook stop by kriston · · Score: 5, Insightful

    And, as a result, Farmville/Mafiawars updates on Facebook temporarily stop.
    Nothing of value was lost.

    --

    Kriston

  2. It's failure on multiple levels by GilliamOS · · Score: 5, Insightful

    Amazon for not load-testing their emergency backup power on a regular basis, not having more than one connection the power grid, and the power grid for not having redundancies. Our aging power grid is really beginning to show on so many levels that this is going to become a lot more common over the coming years.

    --
    "There might be intelligent beings created by God in outer space even if there are none here on Earth." -Anonymous
  3. Re:Murphy's law by turing_m · · Score: 5, Funny

    Nice try, but you still fail to grammar.

    This is why I long ago resolved to never, ever, ever correct someone else's grammar on slashdot. The risk in inadvertently failing to grammar is unacceptable.

    --
    If I have seen further it is by stealing the Intellectual Property of giants.
  4. Re:Hurrr, durrr by plover · · Score: 5, Funny

    What about your power supply? Is that not allowed to go along a road? I am all for underground power BTW but I know that if you operate a digger and you want to find the owner of a cable the easiest way is to break it and wait for the complaints.

    That's also the fastest way to get rescued off a desert island or out in the woods, and why you should always carry a piece of fiber in your pocket. Should you get stranded, you simply bury the fiber, and some asshole with a backhoe will be along in about five minutes to cut it. Ask him to rescue you.

    --
    John
  5. Failure is often not a boolean by mcrbids · · Score: 5, Interesting

    For years, I co-located at the top-rated 365 Main data center in San Francisco, CA until they had a power failure a few years ago. Despite having 5x redundant power that was regularly tested, it apparently wasn't tested against a *brown out*. So when Pacific Gas and Electric had a brownout, it failed to trigger 2 of the 5 redundant generators. Unfortunately, the system was designed so that any *one* of the redundant generators could fail and there wouldn't be any problem.

    So power was in a brownout condition, the voltage dropped from the usual 120 volts or so down to 90. Many power supplies have brownout detectors and will shut off. Many did, until the total system load dropped to the point where normal power was restored. All of this happened within a few seconds, and the brownout was fixed in just a few minutes. But at the end of it all, there was perhaps 20% of all the systems in the building shut down. The "24x7 hot hands" were beyond swamped. Techies all around the San Francisco area were pulled from whatever they were doing to converge on downtown SF. And me, 4 hours drive away, managed to restore our public-facing services on the one server (of four) I had that survived the voltage spikes before driving in. (Alas, my servers had the "higher end" power supplies with brownout detection)

    And so it was a long chain of almost success of well-tested, high-quality equipment that failed all in sequence because real life didn't happen to behave like the frequently performed tests did.

    When I did finally arrive, the normally quiet, meticulously clean facility was a shambles. Littered with bits of network cable, boxes of freshly-purchased computer equipment, pizza boxes, and other refuge were to be found in every corner. The aisles were crowded with techies performing disk checks and chattering tersely on cell phones. It was other-worldly.

    All of my systems came up normally; simply pushing the power switch and letting the fsck run did the trick, we were fully back up and all tests performed (and the system configuration returned to normal) in about an hour.

    Upon reflection, I realized that even though I had some down time, I was really in a pretty good position:

    1) I had backup hosting elsewhere, with a backup from the previous night. I could have switched over, but decided not to because we had current data on one system and we figured it was better not to have anybody lose any data than to have everybody lose the morning's work.

    2) I had good quality equipment; the fact that none of my equipment was damaged from the event may have been partly due to the brownout detection in the power supplies of my servers.

    3) At no point did I have any less than two backups off site in two different location, so I had multiple, recent data snapshots off site. As long as the daisy chains of failure can be, it would be freakishly rare to have all of these points go down at once.

    4) Even with 75% of my hosting capacity taken offline, we were able to maintain uptime throughout all this because our configuration has full redundancy within our cluster - everything is stored in at least 2 places onsite.

    Moral of the story? Never, EVER have all your eggs in one basket.

    --
    I have no problem with your religion until you decide it's reason to deprive others of the truth.
  6. Re:An untested DR plan is a worthless DR plan by thegarbz · · Score: 5, Interesting

    It is exactly that level of understanding that can cause most outages (and even failures of safety critical systems). There is one part of the UPS that is uninterruptible and that is the voltage at the battery. Between the voltage at the battery and the computer you have cables, electronics, control systems, charging circuits, and inverters. Beyond that if it's an industrial sized UPS there'll be circuit breakers, distribution boards, and other such equipment, each adding failure modes to the "uninterruptible" supply.

    I'll give you an example of what went wrong at my work (a large petro chemical plant in Australia). Like a lot of plants most pumps are redundant, and fed from two different sub stations, that doesn't prevent loss of power but the control circuits in those sub stations run from 24V. Those 24V come from two different cross linked UPS units (cross linked meaning that both redundant boards are fed from both redundant UPS). So in theory not only is there a backup at the plant, backup substations, and backup UPSs but in theory any component can fail and still keep upstream and downstream systems redundant.

    Anyway we had to take down one of the UPS for maintenance reasons following a procedure we'd used plenty of times before. The procedure is simple: 1. Check the current in the circuit breakers so that the redundant breakers can withstand the load, 2. close the circuit breakers upstream of the UPS that is being shut down, 3. Close main isolator to the UPS. So that's exactly what we did, and when we isolated one of the UPS, the upstream circuit breaker tripped from the OTHER UPS and control power was lost to half the plant as it was now effectively not only isolated from battery backup, but from the main 24V supply.

    So after lots of head scratching we did some thermal imagery of the installation. The circuit breaker which tripped in sympathy when we took down it's counterpart was running significantly hotter than the main one. The cause was determined to be a lose wire. So even though the load through the circuit breaker was much less than 1/2 of the total load, when we took down the redundant supply and the circuit breaker got loaded, the temperature pushed it over the edge.

    A carefully designed dually redundant UPS system providing 4 sources of power failed when we took down 2 of them in a careful way due to a lose wire in a circuit breaker. A UPS is never truly uninterruptible, and even internal batteries in servers would be protected by a fuse of some kind to ensure the equipment goes down, but ultimately survives a fault