Slashdot Mirror


Lightning Strikes Amazon's Cloud (Really)

The Register has details on a recent EC2 outage that is being blamed on a lightning strike that zapped a power distribution unit of the data center. The interruption only lasted around 6 hours, but the irony should last much longer. "While Amazon was correcting the problem, it told customers they had the option of launching new server instances to replace those that went down. But customers were also able to wait for their original instances to come back up after power was restored to the hardware in question."

18 of 109 comments (clear)

  1. Irony? by Anonymous Coward · · Score: 5, Insightful

    Isn't cloud computing supposed to tackle such instances?

    1. Re:Irony? by evanbd · · Score: 5, Funny

      Perhaps that's the problem. I think lightning rods are supposed to be more coppery than irony.

  2. God here... by Deus.1.01 · · Score: 5, Funny

    Is the message clear?

    -RMS

    --
    My -1 Troll is actually a +1 funny. And my -1 flame is actually a +1 insightfull.
  3. Inconcievable! by binaryspiral · · Score: 5, Insightful

    While everyone is talking up the cloud and how resilient it is... this is just yet another example to never put all your eggs in one basket. If your service is so damn important that it can't go down - have it hosted in two places.

    Notice, Amazon.com didn't go down... :)

    1. Re:Inconcievable! by nine-times · · Score: 4, Informative

      Well it does seem like it was pretty resilient:

      While Amazon was correcting the problem, it told customers they had the option of launching new server instances to replace those that went down.

      So basically a set of servers went down, and it took down the particular instances running on those servers. Customers were still able to take the same exact image and start new instances-- it sounds like immediately. Now sure, it'd be nice if they worked out some kind of automatic clustering and failover to take care of this sort of thing for you, but when my server goes down with my dedicated host, I don't have the option to start up a new host immediately with the same exact configuration.

  4. Do any of you know how they survived? by mr_stinky_britches · · Score: 4, Interesting

    Do any of you know how an instance could survive a power outage? Surely every operation is written out to disk before it's performed..so how did they design it?

    --
    Censorship is obscene. Patriotism is bigotry. Faith is a vice. Slashdot 2.0 sucks.
    1. Re:Do any of you know how they survived? by KahabutDieDrake · · Score: 4, Informative

      You've never actually worked with enterprise class gear have you? It's standard for most of the servers and all of the data storage to have capacitance/battery backups for just such an emergency.

      Typically, the raid controller will have enough on board capacity to clear it's write cache before losing power entirely. While the drive array will be connected to a decent UPS that can hold for at least a few minutes. Meanwhile, the server itself will also likely be connected to the same UPS, or a different one.

      The real question at hand is, were the UPS between the power distribution node and the server, or were they on the other side of the distribution node, and therefore worthless in a case like this? I've seen both configurations, but the latter is rarer. Not because of this particular case, but because of efficiency concerns.

      If there was a failure of design, it was most likely in the building wiring itself. The building was clearly not properly grounded against lightning strikes, as if it was, the surge would never have hit the internal wiring. It might have kicked the building off the grid for a time, but it should never have reached a power distribution node. Although it's likely the outcome would be similar if not identical.

    2. Re:Do any of you know how they survived? by RsG · · Score: 5, Insightful

      I'm reading between the lines here (it doesn't actually say this in TFA), but it sounds like this was a direct hit. Not an outage, which is a different beast.

      A UPS is about as useful in this instance as antibiotics against a virus - it's a solution to a different problem. Surge protectors don't help much either, not unless the strike was a fairly mild and/or remote one. You could switch over to a disconnected UPS system every time there's a thunderstorm on the horizon, but that seems needlessly complicated and expensive.

      That being said, the GP referred to an outage, so you've quite correctly answered his question; it's just the wrong question to ask in this instance. And of course I could be misreading (or Amazon could be misrepresenting) the exact nature of the failure - if it were a regular outage, none of the above would apply.

      --
      Erotic is when you use a feather. Exotic is when you use the whole chicken.
    3. Re:Do any of you know how they survived? by sirsnork · · Score: 4, Informative

      RAID Controllers have batteries so they can remember whats in the cache (for about 48hours), not so they can write that data out to disks befoer they power off. When power is returned and thr disks come back up the cache is flushed before any other action, thereby keeping the array in one piece

      --

      Normal people worry me!
  5. Re:What irony? by mail2345 · · Score: 5, Funny

    I think the poster means popular irony, not irony as it actually means. Popular irony is like getting a fly in your white wine. Regular irony is not wearing your tin foil hat on the one day someone actually does beam thoughts into your brain.

  6. Re:What irony? by ZorbaTHut · · Score: 4, Funny

    In Soviet Russia, clouds get hit by lightning?

    Yeah, it's sorta weak, but that's what they were going for.

    --
    Breaking Into the Industry - A development log about starting a game studio.
  7. Re:What irony? by Anonymous Coward · · Score: 5, Funny

    Regular irony is not wearing your tin foil hat on the one day someone actually does beam thoughts into your brain.

    Nope. You've still got it wrong... That's still Morissette irony.

  8. Re:Lightning once striked our office building. by xrayspx · · Score: 5, Insightful

    I'm thinking critically because Amazon, EMC, VMWare, etc bill The Cloud as a mystical place where you throw your shit and then it's universally available 100%. Nothing bad happens in The Cloud.

    So what's the deal with having all copies of these VMs in one datacenter? That's not very The Cloud of them. Maybe they should replicate all of EC2 to GFS. Would The Cloud win then?

    Customers being given the option of redeploying their VMs or waiting an unspecified period of time until The Cloud is back online isn't The Cloud we were promised.

    /cloud

  9. Re:What irony? by quanticle · · Score: 5, Insightful

    Perhaps they were referring to the irony of Amazon's EC2 being affected by one of the very natural disasters it advertises protection against.

    Its rather like an "unsinkable" vessel going down on her maiden voyage.

    --
    We all know what to do, but we don't know how to get re-elected once we have done it
  10. Re:What irony? by Anonymous Coward · · Score: 5, Funny

    The real irony here is that tinfoil hats are actually required in order to beam thoughts into your head...

  11. Re:Lightning once striked my friends house. by Kotoku · · Score: 5, Funny

    In the civilized world, we just call those "walk through holes" doors.

  12. Re:Lightning once striked our office building. by Achromatic1978 · · Score: 4, Informative

    No, they don't. You're either being disingenuous, or idiotic.

    "Amazon EC2 provides developers the tools to build failure resilient applications and isolate themselves from failure scenarios."

    "you can protect your applications from failure of a single location"

  13. Re:Lightning once striked our office building. by sumdumass · · Score: 4, Interesting

    I have to wonder if those who are critical of Amazon here have ever experienced a direct lightning strike? I doubt it.

    Just so people know, this can be a real bitch.

    I took a direct lightning strike at one site I work with that entered the corner of the building, traveled down the inside wall leaving a scorch mark on two levels and into the basement where all the servers and switches were located. The lightening then traveled through the electrical service main lines to an encased transformer located in the parking lot next door causing it to explode with enough force that is shattered the windows of the bank building next door and a door panel was found on a rood about a block away. It appears that one half of the electrical system was grounded properly through a specific ground rod and the other half was tied into the plumbing that ran inches away from the lightning rod grounds. When they purchased the building, they didn't redo all the electrical on the side of the building that wasn't remodeled and that way of grounding was normal.

    We lost 3 of the 5 servers instantly and couldn't keep the other two stable. Both switches were down, 20 of the 44 workstations along with the tape backup machine, copiers, and networked printers were completely dead when we got there. The entire building had a lightning/surge protector with battery backup and natural gas generator on the mains so they weren't too concerned over in house specific protections. Only the systems with UPS on them directly survived with the exceptions of the servers which I'm not sure if they died from the lightning strike or from getting soaked by the fire sprinklers that was set off by the strike. (surprisingly, there was no fire).

    It took us two days at almost 20 hours a day among 5 people with a lot of borrowing from other sites, about 20 trips to five or six computer stores in the surrounding counties, and a generator to come back on line and be operational again. We even had a make shift phone system in place while waiting on a new Avaya to come in. We did this all before the electric company got the transformer replaced and service back on. Until we replaced the other machines that were thought not to be effected, we experienced all sorts of weird behavior on the network and I'm still not confident with the cabling even though it passed the testing. Of course I didn't run the certification so it might just be me not trusting others.

    If you get a direct strike, you might as well count on replacing everything in a production environment. When I say direct strike, I mean evidence it actually hit the building and not something down the road and traveled to the building. It will be easier and cheaper in the long run. Now, I have as part of the catastrophe plan, a means to replace every computer and component on the network at one time just to be safe. If it wasn't for two other sites having the same tape drives, we would have had to wait a week for a replacement to come in and start the data recovery process. Thank god for off-site tape storage.