Lightning Strikes Amazon's Cloud (Really)
The Register has details on a recent EC2 outage that is being blamed on a lightning strike that zapped a power distribution unit of the data center. The interruption only lasted around 6 hours, but the irony should last much longer. "While Amazon was correcting the problem, it told customers they had the option of launching new server instances to replace those that went down. But customers were also able to wait for their original instances to come back up after power was restored to the hardware in question."
Isn't cloud computing supposed to tackle such instances?
Did it leave a silver lining?
Shop as usual. And avoid panic buying.
Naive question: Are data centers usually insured for the cost of hardware replacement and/or loss of revenue in a situation like this?
There's nothing to worry about, because as we all know, Lightning never strikes twice.
Yay for savings on the surge protectors!
.
I have to wonder if those who are critical of Amazon here have ever experienced a direct lightning strike? I doubt it.
Is the message clear?
-RMS
My -1 Troll is actually a +1 funny. And my -1 flame is actually a +1 insightfull.
What irony?
Maybe I'm just tired, but I'm not sure what irony is being referred to by the poster.
While everyone is talking up the cloud and how resilient it is... this is just yet another example to never put all your eggs in one basket. If your service is so damn important that it can't go down - have it hosted in two places.
Notice, Amazon.com didn't go down... :)
Do any of you know how an instance could survive a power outage? Surely every operation is written out to disk before it's performed..so how did they design it?
Censorship is obscene. Patriotism is bigotry. Faith is a vice. Slashdot 2.0 sucks.
**typo** should be: is NOT written out
Sorry about that.
Censorship is obscene. Patriotism is bigotry. Faith is a vice. Slashdot 2.0 sucks.
If a service is so resilient that it takes a highly unlikely lightning strike for it to go down, then the service is good.
It blew a hole in the kitchen so big you could climb through it.
Unless by Irony, you mean "like rain on your wedding day"
If i'm not mistaken then the whole point of a cloud is that you spread your processing around different hardware (in different geographies) and so that no part failing constitutes a total failure. Only one of Amazon's two zones went down so a well designed cloud app shouldn't have failed.
This is clearly a case of cloud-to-cloud lightning.
If I have seen further it is by stealing the Intellectual Property of giants.
Back in the late 80's, I worked as network admin at a university. Most of the buildings on campus where relatively old, but I only had recurring problems in one of them. The building that held the English and History departments had an Equinox LM-48 in a cabinet in the back of a typing lab. One Monday morning we got a call that no-one in the building could get online. I checked the DS-15 port in the data center, and sure enough, no link, so I walked over to the lab and met the assistant dean who had the keys to let me in. When he unlocked the door, we both knew something was wrong because we could both smell the fried electronics... When I disconnected the LM-48 and picked it up, we could both hear what turned out to be pieces of serial chips rattling around inside the case. I replaced the unit with a spare and took the dead one back to my office. When I opened it up, I could see a couple (don't remember how many) of the chips had been blown up. Looking back, I probably had enough information to determine which PCs weren't grounded by which chips blew up, but that didn't occur to me then. About a month or so later, the same thing happened, but it happened on a week-night and when I heard the thunder, I knew I had just lost the replacement unit. Unfortunatlely, this was at 1am or so and I did not have keys to the English department.... So at 7am the next morning, when the assistant dean showed up, I was sitting outside his office with another replacement. He said something like "...the storm last night..." and I just nodded.
I don't remember the final resolution of the problem, but I do remember that from the 2nd strike until the problem was solved, every time I heard thunder I would run to the English building and with my newly assigned key, run upstairs and disconnect the rj-21 fanout cables. I would then leave a note on the English dept office informing them that they'd need to plug them in the in a.m. One evening, I didn't make it. I heard thunder and bolted for the English dept... I had my key in the buildings' outside door when lightning struck the building...and I knew I was too late. When I got upstairs, I could smell burnt electronics....
Probably at the same time as this was going on for me, my dad, who was a large-scale CSE had similar problems. I don't know how much 16-port line-cards for the system that he was supporting cost, but one day he had to replace eight or nine of them. The next day, UPS delivered two cases of copper-fiber-copper serial surge suppressors and he scheduled to install them. I don't think that site had problems after that.
Telecom Class A data centers have a few characteristics to prevent - YES PREVENT - this type of issue.
a) lightning rods at every corner of the building and the highest points that are PROPERLY GROUNDED. Sometimes you need to drip water to get a good ground.
b) Power supplied from two or more *different* power substations
c) Local UPSes - different for each power feed. We're talking $150K each.
d) On site generation (diesel or gas turbine usually)
e) Heavy construction to survive tornadoes and hurricanes
f) Strong physical security procedures (the computer, inside the cage, inside the room, inside the room, in the center of the building).
g) data center floors may be located on huge springs to reduce earthquake impacts.
h) Not located an area prone to flooding, not even 100 year floods.
i) EMC DMX systems have built in batteries and capacitors with enough juice that if power is pulled, all data in cache will still be written to disk. http://www.emc.com/collateral/hardware/specification-sheet/c1166-dmx4-ss.pdf
And they usually get something else that the rest of us can't - extremely high prioritization for refueling. A trauma hospital may be higher priority, but other normal hospitals are lower priority that a telecom data center.
Did you ever wonder why your phone bill was so high? REDUNDANCY is a way of life. Chances are your telecom has automatic fail over to a redundant system 500+ miles away too. Keeping those systems and their data synchronized isn't cheap either. Fortunately, the huge data pipes are considered internal costs.