Lightning Strikes Amazon's Cloud (Really)
The Register has details on a recent EC2 outage that is being blamed on a lightning strike that zapped a power distribution unit of the data center. The interruption only lasted around 6 hours, but the irony should last much longer. "While Amazon was correcting the problem, it told customers they had the option of launching new server instances to replace those that went down. But customers were also able to wait for their original instances to come back up after power was restored to the hardware in question."
Isn't cloud computing supposed to tackle such instances?
Naive question: Are data centers usually insured for the cost of hardware replacement and/or loss of revenue in a situation like this?
There's nothing to worry about, because as we all know, Lightning never strikes twice.
Yay for savings on the surge protectors!
.
I have to wonder if those who are critical of Amazon here have ever experienced a direct lightning strike? I doubt it.
Is the message clear?
-RMS
My -1 Troll is actually a +1 funny. And my -1 flame is actually a +1 insightfull.
What irony?
Maybe I'm just tired, but I'm not sure what irony is being referred to by the poster.
While everyone is talking up the cloud and how resilient it is... this is just yet another example to never put all your eggs in one basket. If your service is so damn important that it can't go down - have it hosted in two places.
Notice, Amazon.com didn't go down... :)
Do any of you know how an instance could survive a power outage? Surely every operation is written out to disk before it's performed..so how did they design it?
Censorship is obscene. Patriotism is bigotry. Faith is a vice. Slashdot 2.0 sucks.
**typo** should be: is NOT written out
Sorry about that.
Censorship is obscene. Patriotism is bigotry. Faith is a vice. Slashdot 2.0 sucks.
In the civilized world, we just call those "walk through holes" doors.
AnimePapers.org: Anime Wallpapers Handled With Care
If you want to guarantee data integrity and consistent data between your instances, then you cannot tolerate one out of two going down. Byzantine agreement protocols can tolerate less than one third failures, so you would actually need four to tolerate one failure.
Do you care about the security of your wireless mouse?
This is clearly a case of cloud-to-cloud lightning.
If I have seen further it is by stealing the Intellectual Property of giants.
Only one of Amazon's two zones went down
There are two regions (US and EU) each with several availability zones (US currently has four.) The AZ's are designed to be isolated from one another. This outage affected one AZ in the US region.
If you are doing load balancing across instances in multiple AZ's (or even using Amazon's own Elastic Load Balancing and Auto-Scaling features) you would have been fine, since this is exactly the kind of problem they're designed to handle.
Back in the late 80's, I worked as network admin at a university. Most of the buildings on campus where relatively old, but I only had recurring problems in one of them. The building that held the English and History departments had an Equinox LM-48 in a cabinet in the back of a typing lab. One Monday morning we got a call that no-one in the building could get online. I checked the DS-15 port in the data center, and sure enough, no link, so I walked over to the lab and met the assistant dean who had the keys to let me in. When he unlocked the door, we both knew something was wrong because we could both smell the fried electronics... When I disconnected the LM-48 and picked it up, we could both hear what turned out to be pieces of serial chips rattling around inside the case. I replaced the unit with a spare and took the dead one back to my office. When I opened it up, I could see a couple (don't remember how many) of the chips had been blown up. Looking back, I probably had enough information to determine which PCs weren't grounded by which chips blew up, but that didn't occur to me then. About a month or so later, the same thing happened, but it happened on a week-night and when I heard the thunder, I knew I had just lost the replacement unit. Unfortunatlely, this was at 1am or so and I did not have keys to the English department.... So at 7am the next morning, when the assistant dean showed up, I was sitting outside his office with another replacement. He said something like "...the storm last night..." and I just nodded.
I don't remember the final resolution of the problem, but I do remember that from the 2nd strike until the problem was solved, every time I heard thunder I would run to the English building and with my newly assigned key, run upstairs and disconnect the rj-21 fanout cables. I would then leave a note on the English dept office informing them that they'd need to plug them in the in a.m. One evening, I didn't make it. I heard thunder and bolted for the English dept... I had my key in the buildings' outside door when lightning struck the building...and I knew I was too late. When I got upstairs, I could smell burnt electronics....
Probably at the same time as this was going on for me, my dad, who was a large-scale CSE had similar problems. I don't know how much 16-port line-cards for the system that he was supporting cost, but one day he had to replace eight or nine of them. The next day, UPS delivered two cases of copper-fiber-copper serial surge suppressors and he scheduled to install them. I don't think that site had problems after that.