Slashdot Mirror


Lightning Strikes Amazon's Cloud (Really)

The Register has details on a recent EC2 outage that is being blamed on a lightning strike that zapped a power distribution unit of the data center. The interruption only lasted around 6 hours, but the irony should last much longer. "While Amazon was correcting the problem, it told customers they had the option of launching new server instances to replace those that went down. But customers were also able to wait for their original instances to come back up after power was restored to the hardware in question."

109 comments

  1. Irony? by Anonymous Coward · · Score: 5, Insightful

    Isn't cloud computing supposed to tackle such instances?

    1. Re:Irony? by evanbd · · Score: 5, Funny

      Perhaps that's the problem. I think lightning rods are supposed to be more coppery than irony.

    2. Re:Irony? by jo42 · · Score: 1

      Amazon's definition of a "cloud" is a whole bunch of XEN-based VPSes running in less than a handful of data centers here and there.

      Resilience is an exercise left to the customer.

    3. Re:Irony? by log0n · · Score: 3, Insightful

      The irony is that a cloud was struck by lightning. Lightning usually comes from clouds.

      Sometimes we all need to tone back the nerd a bit :)

    4. Re:Irony? by drinkypoo · · Score: 2, Funny

      Sometimes we all need to tone back the nerd a bit :)

      What? Get out. *points*

      --
      "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
    5. Re:Irony? by MadnessASAP · · Score: 3, Funny

      Well as it just so happens most lightning is ground to cloud or cloud to cloud with very little cloud to ground.

      My nerdiness goes up to 11 by the way.

      --
      I may agree with what you say, but I will defend to the death your right to face the consequences of saying it.
    6. Re:Irony? by Anonymous Coward · · Score: 0

      Lightning actually comes from the ground.

    7. Re:Irony? by Anonymous Coward · · Score: 0

      I think you use them for the rain clouds.

    8. Re:Irony? by ckaminski · · Score: 1

      This whole cloud computing "fad" is really underwhelming. I'm dying for the day that I can take my compute resources from my desktop at work, move them to my phone, load them at my computer at home, and move them into some amorphous network computer when I need more mobility and power than my phone and laptop can provide.

      Cloud computing promises to do none of this. What it is is datacenter provisioning akin to the mainframe days of old, except you get to "mostly" choose the platform you want to run on, Windows or Linux.

      Cloud computing is nothing more than VMware on steroids. And THAT is disappointing, and underwhelming.

    9. Re:Irony? by Hurricane78 · · Score: 1

      There's a "in soviet amazon" joke in there somewhere. I know it!

      --
      Any sufficiently advanced intelligence is indistinguishable from stupidity.
    10. Re:Irony? by Hurricane78 · · Score: 1

      Well, get a bigger laptop then! Or just move your user profile.

      --
      Any sufficiently advanced intelligence is indistinguishable from stupidity.
  2. Struck the cloud, eh? by BorgCopyeditor · · Score: 1, Funny

    Did it leave a silver lining?

    --
    Shop as usual. And avoid panic buying.
  3. Who covers the cost? by Anonymous Coward · · Score: 2, Interesting

    Naive question: Are data centers usually insured for the cost of hardware replacement and/or loss of revenue in a situation like this?

    1. Re:Who covers the cost? by boredomist · · Score: 1

      I would assume they would have to be, Amazon would be completely stupid not to insure it's data (since the actual data is basically the only way Amazon makes money)

    2. Re:Who covers the cost? by TheLink · · Score: 2, Insightful

      What if they insured with AIG?

      Who covers the cost then? :)

      --
    3. Re:Who covers the cost? by Anonymous Coward · · Score: 0, Redundant

      Me. I pay taxes in the U.S.

    4. Re:Who covers the cost? by Anonymous Coward · · Score: 0

      What if they insured with AIG?

      Who covers the cost then? :)

      Tax payers.

    5. Re:Who covers the cost? by afidel · · Score: 1

      Probably too much money to buy insurance, if you're that large having multiple physically dispersed systems IS your insurance. For the customers the insurance is included in the cost of the service, Amazon adds a little bit to the bill and puts it in a fund in case they have to issue SLA related refunds (or more likely they just make less that month due to issuing credits).

      --
      There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
  4. Well, now that that's over with.... by Anonymous Coward · · Score: 2, Funny

    There's nothing to worry about, because as we all know, Lightning never strikes twice.

    Yay for savings on the surge protectors!

    1. Re:Well, now that that's over with.... by Lord+Fury · · Score: 3, Funny

      It's only that lightning never strikes only twice. There's nothing stopping you from getting hit more than twice

      http://en.wikipedia.org/wiki/Roy_Sullivan

    2. Re:Well, now that that's over with.... by Anonymous Coward · · Score: 0

      He made the mistake of moving.

    3. Re:Well, now that that's over with.... by P1h3r1e3d13 · · Score: 1

      The guy who played Jesus in The Passion of the Christ got hit twice during filming.

      Also, lightning rods on tall buildings can get hit hundreds of times annually.

  5. Lightning once striked our office building. by QuietLagoon · · Score: 3, Interesting
    Our computer room was down for three days as a result. Amazon's six hour downtime looks like a big improvement.
    .

    I have to wonder if those who are critical of Amazon here have ever experienced a direct lightning strike? I doubt it.

    1. Re:Lightning once striked our office building. by xrayspx · · Score: 5, Insightful

      I'm thinking critically because Amazon, EMC, VMWare, etc bill The Cloud as a mystical place where you throw your shit and then it's universally available 100%. Nothing bad happens in The Cloud.

      So what's the deal with having all copies of these VMs in one datacenter? That's not very The Cloud of them. Maybe they should replicate all of EC2 to GFS. Would The Cloud win then?

      Customers being given the option of redeploying their VMs or waiting an unspecified period of time until The Cloud is back online isn't The Cloud we were promised.

      /cloud

    2. Re:Lightning once striked our office building. by Logic+and+Reason · · Score: 3, Insightful

      I'm thinking critically because Amazon, EMC, VMWare, etc bill The Cloud as a mystical place where you throw your shit and then it's universally available 100%. Nothing bad happens in The Cloud.

      No, they don't. You're either being disingenuous, or idiotic.

      So what's the deal with having all copies of these VMs in one datacenter? That's not very The Cloud of them.

      So you expect Amazon to somehow be running the same VM simultaneously on multiple machines? The point of EC2 is that you have machine images prepared in advance, which you can launch at any time to instantiate a new, ready-to-go VM. The VMs themselves are obviously still running on actual machines, which are (surprise!) still vulnerable to things like lightning strikes and other random hardware failures.

      If a few minutes downtime when something like that happens is unacceptable, then you should be running multiple machines in different availability zones-- which is exactly what you'd be doing in a more traditional environment. EC2 just makes it easier to do this in a flexible way. Yes, you pay for that privilege, but it's clearly worth it to some people.

    3. Re:Lightning once striked our office building. by Achromatic1978 · · Score: 4, Informative

      No, they don't. You're either being disingenuous, or idiotic.

      "Amazon EC2 provides developers the tools to build failure resilient applications and isolate themselves from failure scenarios."

      "you can protect your applications from failure of a single location"

    4. Re:Lightning once striked our office building. by Anonymous Coward · · Score: 3, Interesting

      Three days is lucky. My very first job (many, many moons ago) was at a company which had a few 5, 10 and 15 meter SATCOM dishes outside. One fall night, a set of severe T-storms rolled through around 2 am, and lightning struck the SAT farm. Nearly knocked me out of my NOC chair where I was fighting to stay awake, and I swore something big had exploded outside.

      Turns out, one of the SAT dishes had not been properly grounded, and the current surged through the SATNET into our internal networks. Several mid-range systems, network gear, LAN pc's, modems, etc were fried. Console terminals were also, and if I'd been typing instead of fighting sleep, I would have been crispy.

      The next several days were spent replacing the instantly fried gear. But initially unaffected systems started having serious glitches show up over the next few weeks. My guess is that Amazon may have this same problem.

    5. Re:Lightning once striked our office building. by Logic+and+Reason · · Score: 1

      Amazon EC2 provides developers the tools to build failure resilient applications and isolate themselves from failure scenarios.

      you can protect your applications from failure of a single location

      If you look closely, you may be able to discern a difference between the previous two statements and the following:

      ...The Cloud [is] a mystical place where you throw your shit and then it's universally available 100%. Nothing bad happens in The Cloud.

      Can you guess which statement was not made by Amazon?

    6. Re:Lightning once striked our office building. by Anonymous Coward · · Score: 1, Insightful

      Endless arguing. Did or didn't amazon say that using the cloud you "protect your application from failure of a single location"? And did or didn't this happen? Answering the two question in the right order will explain what the OP meant even to you.

    7. Re:Lightning once striked our office building. by lena_10326 · · Score: 1

      "Amazon EC2 provides developers the tools to build failure resilient applications and isolate themselves from failure scenarios"

      "you can protect your applications from failure of a single location"

      You still have to design your apps to be distributed in some manner. You can't just throw your single process server code onto the cloud and expect it to be failure resistant without giving it any thought. You have to purchase extra capacity and decide which locations they should reside in. Any user purchasing 1 or 2 instances and expecting it to be fault tolerant is begging for a wake-up call.

      If you're a developer throwing your code onto a cloud and none of this ever occurs to you, then you should quit now and re-train yourself with a new career because this is not a field for you.

      --
      Camping on quad since 1996.
    8. Re:Lightning once striked our office building. by fluffy99 · · Score: 2, Interesting

      I'm thinking critically because Amazon, EMC, VMWare, etc bill The Cloud as a mystical place where you throw your shit and then it's universally available 100%. Nothing bad happens in The Cloud.

      No, they don't. You're either being disingenuous, or idiotic.

       

      Per http://aws.amazon.com/ec2/#highlights, Amazon is promising "Reliable Amazon EC2 offers a highly reliable environment where replacement instances can be rapidly and predictably commissioned. The service runs within Amazons proven network infrastructure and datacenters. The Amazon EC2 Service Level Agreement commitment is 99.95% availability for each Amazon EC2 Region.

      The irony here is that 6 hours in a year is 99.93% so they've already blown it for the year.

       

      So what's the deal with having all copies of these VMs in one datacenter? That's not very The Cloud of them.

      If it's only one instance running, its kinda hard to run it in multiple datacenters. They might be running clustering within a datacenter, but that can still be taken down by a power outages affecting multiple servers. As pointed out earlier, you can have instances in multiple datacenters (zones as they call it) if you're willing to pay for it.

    9. Re:Lightning once striked our office building. by Anonymous Coward · · Score: 0

      Endless arguing. Did or didn't amazon say that using the cloud you "protect your application from failure of a single location"? And did or didn't this happen? Answering the two question in the right order will explain what the OP meant even to you.

      Nope they said "you can protect your applications from failure of a single location". Just like "Wearing a seatbelt can protect your life in a car accident". Yes it is a car analogy.

    10. Re:Lightning once striked our office building. by sumdumass · · Score: 4, Interesting

      I have to wonder if those who are critical of Amazon here have ever experienced a direct lightning strike? I doubt it.

      Just so people know, this can be a real bitch.

      I took a direct lightning strike at one site I work with that entered the corner of the building, traveled down the inside wall leaving a scorch mark on two levels and into the basement where all the servers and switches were located. The lightening then traveled through the electrical service main lines to an encased transformer located in the parking lot next door causing it to explode with enough force that is shattered the windows of the bank building next door and a door panel was found on a rood about a block away. It appears that one half of the electrical system was grounded properly through a specific ground rod and the other half was tied into the plumbing that ran inches away from the lightning rod grounds. When they purchased the building, they didn't redo all the electrical on the side of the building that wasn't remodeled and that way of grounding was normal.

      We lost 3 of the 5 servers instantly and couldn't keep the other two stable. Both switches were down, 20 of the 44 workstations along with the tape backup machine, copiers, and networked printers were completely dead when we got there. The entire building had a lightning/surge protector with battery backup and natural gas generator on the mains so they weren't too concerned over in house specific protections. Only the systems with UPS on them directly survived with the exceptions of the servers which I'm not sure if they died from the lightning strike or from getting soaked by the fire sprinklers that was set off by the strike. (surprisingly, there was no fire).

      It took us two days at almost 20 hours a day among 5 people with a lot of borrowing from other sites, about 20 trips to five or six computer stores in the surrounding counties, and a generator to come back on line and be operational again. We even had a make shift phone system in place while waiting on a new Avaya to come in. We did this all before the electric company got the transformer replaced and service back on. Until we replaced the other machines that were thought not to be effected, we experienced all sorts of weird behavior on the network and I'm still not confident with the cabling even though it passed the testing. Of course I didn't run the certification so it might just be me not trusting others.

      If you get a direct strike, you might as well count on replacing everything in a production environment. When I say direct strike, I mean evidence it actually hit the building and not something down the road and traveled to the building. It will be easier and cheaper in the long run. Now, I have as part of the catastrophe plan, a means to replace every computer and component on the network at one time just to be safe. If it wasn't for two other sites having the same tape drives, we would have had to wait a week for a replacement to come in and start the data recovery process. Thank god for off-site tape storage.

    11. Re:Lightning once striked our office building. by lena_10326 · · Score: 2, Insightful

      The irony here is that 6 hours in a year is 99.93% so they've already blown it for the year.

      A region consists of multiple datacenters. 99.93% would be for 1 datacenter, not the region.

      --
      Camping on quad since 1996.
    12. Re:Lightning once striked our office building. by d'baba · · Score: 1

      Well, to some of us the hype of the cloud was that my app would be running somewhere regardless of of any piece of hardware being compromised. (but you gotta like the cloud/lightening synchronicity thing)
      So I guess you're saying that EC2 isn't a cloud.
      ---
      HTML isn't what it's marked up to be.

    13. Re:Lightning once striked our office building. by dhall · · Score: 3, Insightful

      "Amazon EC2 provides developers the tools to build failure resilient applications and isolate themselves from failure scenarios."

      Let's highlight the words that needs emphasis.

      "provides", "developers", "tools"

      As to whether the developers use them or not isn't always automatic.

      "you can protect your applications from failure of a single location"

      "can"

      Highly available does not meant fault tolerance. The latter allows an application to continue functioning after a component failure. Regardless of the snake oil that has been thrown around, there is no silver bullet that can automagically enable application to be multi-node aware with no chance of deadlock or data corruption. You need to program for this. Again, tools are provided, but that doesn't mean everyone will use them. So in the absense of a fault tolerant application, the cloud provides high availability.

    14. Re:Lightning once striked our office building. by Anonymous Coward · · Score: 0

      The irony here is that 6 hours in a year is 99.93% so they've already blown it for the year.

      The irony here is that that's not ironic at all.

    15. Re:Lightning once striked our office building. by Anonymous Coward · · Score: 0

      The irony here is that 6 hours in a year is 99.93% so they've already blown it for the year.

      The irony here is that that's not ironic at all.

      You mean it's like raaaaaiiiiiiiiiiiiiaaaaaaaain on your wedding day?

    16. Re:Lightning once striked our office building. by Anonymous Coward · · Score: 0

      I took a direct lightning strike at one site I work with

      It was just a flesh wound, I take it?

    17. Re:Lightning once striked our office building. by Anonymous Coward · · Score: 0

      I think you're missing the point. Amazon made a big fail here in sending notices to all their customers to ask if they wanted their VM restarted somewhere else. Which if they had backups of those images at other data centers, they should have done "automagically". This was their big chance to show the usefulness of this cloud (distributed) computing "thing", and they failed. To most people this gave the message that they are not much better than a traditional web hosting service, if you don't pay for the _real_ cloud hosting service. Big ripoff really.

    18. Re:Lightning once striked our office building. by sydbarrett74 · · Score: 1

      I'm thinking critically because Amazon, EMC, VMWare, etc bill The Cloud as a mystical place where you throw your shit and then it's universally available 100%. Nothing bad happens in The Cloud. No, they don't. You're either being disingenuous, or idiotic.

      His irony went entirely over your head. Look at the word I rendered in bold. Get the irony now? Duh.

      --
      'He who has to break a thing to find out what it is, has left the path of wisdom.' -- Gandalf to Saruman
    19. Re:Lightning once striked our office building. by Anonymous Coward · · Score: 0

      Lol... Not even that bad. I meant the site took the lightning strike, I took the problems it created and had to deal with it.

    20. Re:Lightning once striked our office building. by Jim+Hall · · Score: 1

      [Lightning once striked{sic} our office building.] Our computer room was down for three days as a result. Amazon's six hour downtime looks like a big improvement.

      Never had a lightning strike, but last year the building transformer that feeds our data center Fucking Exploded (I was on the other side of the building, and I tell you the earth moved.) No injuries, since it's shielded from the building by a retaining wall. Backup power (UPS, generator) went totally dark about 30 minutes later, which should never happen, but it was an odd day.

      We were down for about 12 hours. And we're a University, not a Fortune-100. Massive electrical repair can happen quickly if you have the right people involved, with the right agreements.

    21. Re:Lightning once striked our office building. by drinkypoo · · Score: 1

      Regardless of the snake oil that has been thrown around, there is no silver bullet that can automagically enable application to be multi-node aware with no chance of deadlock or data corruption.

      It's not a silver bullet, but you can give the same input to two [virtual] machines and if one fails the traffic is picked up by the other one. It does however provide pretty linear redundancy... Potentially at the cost of some latency.

      --
      "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
    22. Re:Lightning once striked our office building. by vidarh · · Score: 1
      Anyone that uses EC2 without being set up to handle instance failures are idiots. EC2 instances fail fairly regularly, and Amazon has went out of their way to point out to people that if they do, your instance data is gone so you better design your apps to be resilient against instance failure.

      The whole point being that you pay only for the resilience YOU want, not for a bunch of things that may or may not be appropriate depending on your app. Amazon can't know whether bringing an image up is safe or not unless the backup is up to the clock cycle consistent - bringing them up could be disastrous. Different apps require extremely different failover solutions.

      The usefulness is that you have the API to do this without pre-provisioning a bunch of servers.

    23. Re:Lightning once striked our office building. by drinkypoo · · Score: 1

      Getting one transformer isn't too hard if you are willing to pay for it, unless perhaps there's been a recent massive solar flare that's burned out equipment across half of your state or something.

      --
      "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
    24. Re:Lightning once striked our office building. by xrayspx · · Score: 1

      No, they don't. You're either being disingenuous, or idiotic

      I choooooose....2:) Disingenuous FTW Alex. I was also being 3:) drunk and snarky, and annoyed with EMC and VMWare spinning cloud computing as "fault tolerant" computing somehow.

      The sales pitch of The Cloud is that, and yes I've heard this, you can move VMs from one physical location to another with no downtime. I fail to see how that pitch works in terms of IP subnets, which must be different for the networks to work, but there you have it.

      I suspect by "move" they mean "copy and re-ip" and by no downtime, they mean "ecxept for DNS change propagation time", but I'm no VM/Cloud Computing expert yet. I'm not saying it can't happen, but I really need that part explained to me, and no VMWare or EMC people have been able to do so adequately yet.

    25. Re:Lightning once striked our office building. by friedo · · Score: 1

      The failure described by the article affected one availability zone out of seven in the EC2 cloud. Anybody who built their application redundantly across multiple zones would not have been affected by the outage.

    26. Re:Lightning once striked our office building. by Logic+and+Reason · · Score: 1

      The sales pitch of The Cloud is that, and yes I've heard this, you can move VMs from one physical location to another with no downtime.

      I'd be interested to know where you heard this. I don't recall Amazon ever making such claims (yes, I know you also mentioned EMC and VMWare in your original post, but this story is about Amazon after all).

    27. Re:Lightning once striked our office building. by Synn · · Score: 1

      I suspect by "move" they mean "copy and re-ip" and by no downtime, they mean "ecxept for DNS change propagation time", but I'm no VM/Cloud Computing expert yet. I'm not saying it can't happen, but I really need that part explained to me, and no VMWare or EMC people have been able to do so adequately yet.

      You do not have to deal with DNS change propagation. You have 2 choices here, you can use elastic IP addresses which are permanent and can be assigned to any instance you want:

      Instance A goes down.
      You bring up instance B and assign the IP that was on instance A to B.

      Or you can use Elastic Load Balancing which gives you a public CNAME that you can use to load balance across instances. The ELB is itself fault tolerant and can exist in multiple availability zones.

      The ELB can also be configured to automatically bring online new instances of one fails.

    28. Re:Lightning once striked our office building. by xrayspx · · Score: 1

      From EMC at an EMC technical event. Don't get me wrong, it was a great event and the upcoming products look really nice, but I have to know how this works. It's also possible that the guy who was saying that had no idea what he was talking about.

    29. Re:Lightning once striked our office building. by Anonymous Coward · · Score: 0

      In VMware its possible to have automatic failover of the VM's running on a node to another functioning node. You can even push running VM's between nodes without downtime.

      One of the points of "cloud computing" and virtualization is that you can have fault tolerance built into the cloud instead of the application. The same instance can fail over to another node... it doesn't care. If you replicate your storage, the VM can fail over to another node in another datacenter.

  6. God here... by Deus.1.01 · · Score: 5, Funny

    Is the message clear?

    -RMS

    --
    My -1 Troll is actually a +1 funny. And my -1 flame is actually a +1 insightfull.
    1. Re:God here... by wwwillem · · Score: 1

      I guess that coming Monday morning the discussion at Amazon's boardroom table will be more along the lines of "Devil here....". :-)

      --
      Browsers shouldn't have a back button!! It's all about going forward...
    2. Re:God here... by Lennie · · Score: 1

      I hear he's in the details, so look closely

      --
      New things are always on the horizon
  7. What irony? by MrMista_B · · Score: 3, Insightful

    What irony?

    Maybe I'm just tired, but I'm not sure what irony is being referred to by the poster.

    1. Re:What irony? by __aamisb9940 · · Score: 1

      Lightning killed the 'cloud'.

      It's not great irony, but it's kinda there.

    2. Re:What irony? by mail2345 · · Score: 5, Funny

      I think the poster means popular irony, not irony as it actually means. Popular irony is like getting a fly in your white wine. Regular irony is not wearing your tin foil hat on the one day someone actually does beam thoughts into your brain.

    3. Re:What irony? by ZorbaTHut · · Score: 4, Funny

      In Soviet Russia, clouds get hit by lightning?

      Yeah, it's sorta weak, but that's what they were going for.

      --
      Breaking Into the Industry - A development log about starting a game studio.
    4. Re:What irony? by Anonymous Coward · · Score: 2, Insightful

      That a computing technology that was supposed to be largely immune to damage of individual "nodes" in the cloud could be taken down by lightning hitting a single point?

    5. Re:What irony? by Anonymous Coward · · Score: 5, Funny

      Regular irony is not wearing your tin foil hat on the one day someone actually does beam thoughts into your brain.

      Nope. You've still got it wrong... That's still Morissette irony.

    6. Re:What irony? by Anonymous Coward · · Score: 0

      No, that's not irony, that's just stupid.
      Irony is beaming thoughts into someone else's head while wearing a tinfoil hat yourself.

    7. Re:What irony? by quanticle · · Score: 5, Insightful

      Perhaps they were referring to the irony of Amazon's EC2 being affected by one of the very natural disasters it advertises protection against.

      Its rather like an "unsinkable" vessel going down on her maiden voyage.

      --
      We all know what to do, but we don't know how to get re-elected once we have done it
    8. Re:What irony? by Anonymous Coward · · Score: 5, Funny

      The real irony here is that tinfoil hats are actually required in order to beam thoughts into your head...

    9. Re:What irony? by layer3switch · · Score: 1

      How about, "cloud computing on a sunny day only".

      irony FTW!

      --
      "Don't let fools fool you. They are the clever ones."
    10. Re:What irony? by fatalwall · · Score: 1

      shouldn't it be...
      In soviet russia clouds hit you
      seeing as here the clouds are what get hit...

    11. Re:What irony? by The+Grim+Reefer2 · · Score: 1

      Popular irony is like getting a fly in your white wine.

      Actually that would be, "It's a black fly in your Chardonnay...
      It's a death row pardon two minutes too late
      And isn't it ironic... don't you think"

    12. Re:What irony? by zippthorne · · Score: 1

      Regular irony is what happens to your drinking water when the junk yard dumps its old car bodies in the reservoir.

      Or when a robot reads the definition of "irony" from the OED during a one-off production of the greatest opera ever.

      Magnetite suspended in oil is pretty irony, too.

      Your tin foil hat example is just plain, old, ordinary unfortunate coincidence. a.k.a alanian irony. Calling it "tin foil" when it's actually aluminum, however...

      BTW, the the white wine thing actually is irony (well, fairly loosely). See, in the song, it seems like she's saying that the fly ruined the wine. But the wine is chardonnay, so it's actually the other way around.

      --
      Can you be Even More Awesome?!
    13. Re:What irony? by Sulphur · · Score: 1

      Hey hey you you get off a my cloud

    14. Re:What irony? by Anonymous Coward · · Score: 0

      That's what they want you to think.

  8. Inconcievable! by binaryspiral · · Score: 5, Insightful

    While everyone is talking up the cloud and how resilient it is... this is just yet another example to never put all your eggs in one basket. If your service is so damn important that it can't go down - have it hosted in two places.

    Notice, Amazon.com didn't go down... :)

    1. Re:Inconcievable! by Anonymous Coward · · Score: 2, Interesting

      I don't see how cloud hosting is somehow incompatible with hosting in two places.

    2. Re:Inconcievable! by nine-times · · Score: 4, Informative

      Well it does seem like it was pretty resilient:

      While Amazon was correcting the problem, it told customers they had the option of launching new server instances to replace those that went down.

      So basically a set of servers went down, and it took down the particular instances running on those servers. Customers were still able to take the same exact image and start new instances-- it sounds like immediately. Now sure, it'd be nice if they worked out some kind of automatic clustering and failover to take care of this sort of thing for you, but when my server goes down with my dedicated host, I don't have the option to start up a new host immediately with the same exact configuration.

    3. Re:Inconcievable! by Jim+Hall · · Score: 1

      So basically a set of servers went down, and it took down the particular instances running on those servers. Customers were still able to take the same exact image and start new instances-- it sounds like immediately. Now sure, it'd be nice if they worked out some kind of automatic clustering and failover to take care of this sort of thing for you, but when my server goes down with my dedicated host, I don't have the option to start up a new host immediately with the same exact configuration.

      I don't think I read this the same way you did. From my reading, customers could fire up a new server instance, but I doubt it had their same data. Sure, the base OS configuration was the same - but same data, I don't think so.

      From the article:

      While Amazon was correcting the problem, it told customers they had the option of launching new server instances to replace those that went down. But customers were also able to wait for their original instances to come back up after power was restored to the hardware in question.

    4. Re:Inconcievable! by jcnnghm · · Score: 2, Informative

      EC2 instances don't contain instance data. The GP is correct. State data is generally stored on S3, on shared storage, or using their db interface.

      --
      You don't make the poor richer by making the rich poorer. - Winston Churchill
    5. Re:Inconcievable! by Servo · · Score: 1

      You can make the "cloud" resilient, redundant, and highly available. They obviously did not. If the cloud is extended to two places then you don't need servers (virtual or otherwise) in two places.

      --
      A slip of the foot you may soon recover, but a slip of the tongue you may never get over. -Benjamin Franklin
  9. Do any of you know how they survived? by mr_stinky_britches · · Score: 4, Interesting

    Do any of you know how an instance could survive a power outage? Surely every operation is written out to disk before it's performed..so how did they design it?

    --
    Censorship is obscene. Patriotism is bigotry. Faith is a vice. Slashdot 2.0 sucks.
    1. Re:Do any of you know how they survived? by ShadowRangerRIT · · Score: 2, Informative

      UPS, or backup generator, or some other equivalent system that gives just enough power for a clean shut down. I've seen blades with built-in UPS (possibly not even a battery, just a capacitor) that exists solely to sync to disk in the event of a power loss.

      --
      $_ = "wftedskaebjgdpjgidbsmnjgcdwatb"; tr/a-z/oh, turtleneck Phrase Jar!/; print
    2. Re:Do any of you know how they survived? by Darkness404 · · Score: 1

      Any decent server is at the very least hooked up to a UPS, I would imagine that any mission-critical desktops would be too.

      --
      Taxation is legalized theft, no more, no less.
    3. Re:Do any of you know how they survived? by KahabutDieDrake · · Score: 4, Informative

      You've never actually worked with enterprise class gear have you? It's standard for most of the servers and all of the data storage to have capacitance/battery backups for just such an emergency.

      Typically, the raid controller will have enough on board capacity to clear it's write cache before losing power entirely. While the drive array will be connected to a decent UPS that can hold for at least a few minutes. Meanwhile, the server itself will also likely be connected to the same UPS, or a different one.

      The real question at hand is, were the UPS between the power distribution node and the server, or were they on the other side of the distribution node, and therefore worthless in a case like this? I've seen both configurations, but the latter is rarer. Not because of this particular case, but because of efficiency concerns.

      If there was a failure of design, it was most likely in the building wiring itself. The building was clearly not properly grounded against lightning strikes, as if it was, the surge would never have hit the internal wiring. It might have kicked the building off the grid for a time, but it should never have reached a power distribution node. Although it's likely the outcome would be similar if not identical.

    4. Re:Do any of you know how they survived? by KPU · · Score: 1

      I thought the point was software fault tolerance so that the hardware is cheap and lacks the fancy features you describe.

    5. Re:Do any of you know how they survived? by RsG · · Score: 5, Insightful

      I'm reading between the lines here (it doesn't actually say this in TFA), but it sounds like this was a direct hit. Not an outage, which is a different beast.

      A UPS is about as useful in this instance as antibiotics against a virus - it's a solution to a different problem. Surge protectors don't help much either, not unless the strike was a fairly mild and/or remote one. You could switch over to a disconnected UPS system every time there's a thunderstorm on the horizon, but that seems needlessly complicated and expensive.

      That being said, the GP referred to an outage, so you've quite correctly answered his question; it's just the wrong question to ask in this instance. And of course I could be misreading (or Amazon could be misrepresenting) the exact nature of the failure - if it were a regular outage, none of the above would apply.

      --
      Erotic is when you use a feather. Exotic is when you use the whole chicken.
    6. Re:Do any of you know how they survived? by RsG · · Score: 1

      This is one instance where you can have a system that's cheap, redundant or sophisticated, pick 2. Cloud computing is the cheap, redundant option, in which case they may have cut corners on eventualities like lightning strikes.

      I'm more curious as to why the servers were centralized enough to be vulnerable to this. Kinda defeats the purpose of redundancy, no? OTOH, it does sound like they had enough backups in place to get everything up and running again in short order, so maybe it's unfair to second-guess them.

      --
      Erotic is when you use a feather. Exotic is when you use the whole chicken.
    7. Re:Do any of you know how they survived? by sirsnork · · Score: 4, Informative

      RAID Controllers have batteries so they can remember whats in the cache (for about 48hours), not so they can write that data out to disks befoer they power off. When power is returned and thr disks come back up the cache is flushed before any other action, thereby keeping the array in one piece

      --

      Normal people worry me!
    8. Re:Do any of you know how they survived? by rivaldufus · · Score: 1

      the problem was that the lightning rods were all grounded to the pdus....

    9. Re:Do any of you know how they survived? by dkf · · Score: 1

      I'm more curious as to why the servers were centralized enough to be vulnerable to this. Kinda defeats the purpose of redundancy, no? OTOH, it does sound like they had enough backups in place to get everything up and running again in short order, so maybe it's unfair to second-guess them.

      Because with Amazon, if you really care about being resilient you need put your instances in more than one "availability zone" (i.e., datacenter). That's how they do it, they're open about this being the case, and there's really no magic, just competent hosting.

      --
      "Little does he know, but there is no 'I' in 'Idiot'!"
    10. Re:Do any of you know how they survived? by drinkypoo · · Score: 2, Insightful

      You could switch over to a disconnected UPS system every time there's a thunderstorm on the horizon, but that seems needlessly complicated and expensive.

      Actually, that's NOT a bad idea at all. If you used fiber to the rack and you had big ugly relays that would open the connections, it might be a useful strategy in lightning country. It shouldn't be too hard to detect when lightning is striking nearby, and open the contacts. You would definitely need to do it per-rack at minimum though, because having a battery in every system is an ecological nightmare.

      --
      "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
    11. Re:Do any of you know how they survived? by Anonymous Coward · · Score: 0

      And of course I could be misreading (or Amazon could be misrepresenting) the exact nature of the failure -

      Misrepresenting, indeed. Just like that Airfrance plane which was downed by lightning, as initially claimed, but rather by flight computer confused by iced airspeed sensors.

      What better way to cover up your own errors than unstoppable force of nature?

    12. Re:Do any of you know how they survived? by mr_stinky_britches · · Score: 1

      Not what I was asking about...at all. Thanks anyways..

      --
      Censorship is obscene. Patriotism is bigotry. Faith is a vice. Slashdot 2.0 sucks.
    13. Re:Do any of you know how they survived? by mr_stinky_britches · · Score: 1

      In the article I got the impression that they booted things back up and their apps started running again~ my bad if I mis-read.

      Thanks everyone for the good information! I should have realized that the solution was simpler - and probability of my error greater.

      cheers.

      --
      Censorship is obscene. Patriotism is bigotry. Faith is a vice. Slashdot 2.0 sucks.
  10. **typo** should be: is NOT written out by mr_stinky_britches · · Score: 2, Informative

    **typo** should be: is NOT written out

    Sorry about that.

    --
    Censorship is obscene. Patriotism is bigotry. Faith is a vice. Slashdot 2.0 sucks.
  11. Probability by PleaseFearMe · · Score: 1

    If a service is so resilient that it takes a highly unlikely lightning strike for it to go down, then the service is good.

    1. Re:Probability by bertoelcon · · Score: 1

      Just putting the odds in your favor does not guarantee total success.

      --
      Anything can be found funny, from a certain point of view.
  12. Lightning once striked my friends house. by z4ckpete · · Score: 1, Interesting

    It blew a hole in the kitchen so big you could climb through it.

    1. Re:Lightning once striked my friends house. by Kotoku · · Score: 5, Funny

      In the civilized world, we just call those "walk through holes" doors.

  13. Apropos, sure. Irony, nah by dmomo · · Score: 0, Redundant

    Unless by Irony, you mean "like rain on your wedding day"

  14. It evidently did by grahamsz · · Score: 1, Redundant

    If i'm not mistaken then the whole point of a cloud is that you spread your processing around different hardware (in different geographies) and so that no part failing constitutes a total failure. Only one of Amazon's two zones went down so a well designed cloud app shouldn't have failed.

    1. Re:It evidently did by kasperd · · Score: 2, Informative

      Only one of Amazon's two zones went down so a well designed cloud app shouldn't have failed.

      If you want to guarantee data integrity and consistent data between your instances, then you cannot tolerate one out of two going down. Byzantine agreement protocols can tolerate less than one third failures, so you would actually need four to tolerate one failure.

      --

      Do you care about the security of your wireless mouse?
    2. Re:It evidently did by Wesley+Felter · · Score: 1

      This failure was fail-stop, not Byzantine.

    3. Re:It evidently did by friedo · · Score: 2, Informative

      Only one of Amazon's two zones went down

      There are two regions (US and EU) each with several availability zones (US currently has four.) The AZ's are designed to be isolated from one another. This outage affected one AZ in the US region.

      If you are doing load balancing across instances in multiple AZ's (or even using Amazon's own Elastic Load Balancing and Auto-Scaling features) you would have been fine, since this is exactly the kind of problem they're designed to handle.

    4. Re:It evidently did by kasperd · · Score: 1

      This failure was fail-stop, not Byzantine.

      • Even in the fail-stop model you cannot handle one out of two failing unless you also assume synchronous communication.
      • It sounds like this isn't functionality that Amazon provides, so it is up to the customers which implementation they want to use.
      • Given the possibility of undetected bitflips and temporary network partitioning, a fail-stop synchronous model seems like asking for trouble.

      Customers shouldn't really need to run their own byzantine agreement system though. Amazon could provide such a service for their customers. Then customers could just run two instances of whatever they are running and have those two instances be clients of the byzantine agreement system. Of course if Amazon were to provide it, they would still have to spread the byzantine agreement system across four different locations and ideally use four different machines in each place for a total of 16. That way they could tolerate a total of five simultaneous failures, which would be equivalent to one location gone and one of the remaining 12 machines malfunctioning in some way. But if they only have two data centers, they would need to use third party hosting for some of those.

      --

      Do you care about the security of your wireless mouse?
  15. Having taken weather and climate 101... by turing_m · · Score: 2, Funny

    This is clearly a case of cloud-to-cloud lightning.

    --
    If I have seen further it is by stealing the Intellectual Property of giants.
  16. speaking of lightning and electronics. by stine2469 · · Score: 3, Interesting

    Back in the late 80's, I worked as network admin at a university.   Most of the buildings on campus where relatively old, but I only had recurring problems in one of them.   The building that held the English and History departments had an Equinox LM-48 in a cabinet in the back of a typing lab.   One Monday morning we got a call that no-one in the building could get online.   I checked the DS-15 port in the data center, and sure enough, no link, so I walked over to the lab and met the assistant dean who had the keys to let me in.  When he unlocked the door, we both knew something was wrong because we could both smell the fried electronics... When I disconnected the LM-48 and picked it up,  we could both hear what turned out to be pieces of serial chips rattling around inside the case.    I replaced the unit with a spare and took the dead one back to my office.  When I opened it up, I could see a couple (don't remember how many) of the chips had been blown up.   Looking back, I probably had enough information to determine which PCs weren't grounded by which chips blew up, but that didn't occur to me then.   About a month or so later, the same thing happened, but it happened on a week-night and when I heard the thunder, I knew I had just lost the replacement unit.   Unfortunatlely, this was at 1am or so and I did not have keys to the English department.... So at 7am the next morning, when the assistant dean showed up, I was sitting outside his office with another replacement.   He said something like "...the storm last night..." and I just nodded.

    I don't remember the final resolution of the problem, but I do remember that from the 2nd strike until the problem was solved, every time I heard thunder I would run to the English building and with my newly assigned key, run upstairs and disconnect the rj-21 fanout cables.   I would then leave a note on the English dept office informing them that they'd need to plug them in the in a.m.    One evening, I didn't make it.  I heard thunder and bolted for the English dept... I had my key in the buildings' outside door when lightning struck the building...and I knew I was too late.  When I got upstairs, I could smell burnt electronics....

    Probably at the same time as this was going on for me, my dad, who was a large-scale CSE had similar problems.   I don't know how much 16-port line-cards for the system that he was supporting cost, but one day he had to replace eight or nine of them.   The next day, UPS delivered two cases of copper-fiber-copper serial surge suppressors and he scheduled to install them.  I don't think that site had problems after that.

  17. Telecom by Anonymous Coward · · Score: 0

    Telecom Class A data centers have a few characteristics to prevent - YES PREVENT - this type of issue.

    a) lightning rods at every corner of the building and the highest points that are PROPERLY GROUNDED. Sometimes you need to drip water to get a good ground.
    b) Power supplied from two or more *different* power substations
    c) Local UPSes - different for each power feed. We're talking $150K each.
    d) On site generation (diesel or gas turbine usually)
    e) Heavy construction to survive tornadoes and hurricanes
    f) Strong physical security procedures (the computer, inside the cage, inside the room, inside the room, in the center of the building).
    g) data center floors may be located on huge springs to reduce earthquake impacts.
    h) Not located an area prone to flooding, not even 100 year floods.
    i) EMC DMX systems have built in batteries and capacitors with enough juice that if power is pulled, all data in cache will still be written to disk. http://www.emc.com/collateral/hardware/specification-sheet/c1166-dmx4-ss.pdf

    And they usually get something else that the rest of us can't - extremely high prioritization for refueling. A trauma hospital may be higher priority, but other normal hospitals are lower priority that a telecom data center.

    Did you ever wonder why your phone bill was so high? REDUNDANCY is a way of life. Chances are your telecom has automatic fail over to a redundant system 500+ miles away too. Keeping those systems and their data synchronized isn't cheap either. Fortunately, the huge data pipes are considered internal costs.