Car Hits Utility Pole, Takes Out EC2 Datacenter
1sockchuck writes "An Amazon cloud computing data center lost power Tuesday when a vehicle struck a nearby utility pole. When utility power was lost, a transfer switch in the data center failed to properly manage the shift to backup power. Amazon said a "small number" of EC2 customers lost service for about an hour, but the downtime followed three power outages last week at data centers supporting EC2 customers. Tuesday's incident is reminiscent of a 2007 outage at a Dallas data center when a truck crash took out a power transformer."
Seriously, Amazon screwed up in a fairly major way with this.
What more upsetting is this: If Amazon doesn't have working disaster recovery, what do other websites/companies have?
Answer: Nothing. You'd be surprised how may US small-to-medium sized business are one fire/tornado/earthquake/hurricane away from bankruptcy. I'd bet it's over 80% of them.
The classic in my last job was when we had a security contractor in on the weekend hooking something up and he looped off a hot breaker in the computer room, slipped, and shorted the white phase to ground. This blew the 100A fuses both before and after the UPS and somehow caused the generator set to fault so that while we had power from the batteries, that was all we had.
It also blew the power supply on an alphaserver and put a nice burn mark in the breaker panel. So the UPS guy comes out and he doesn't have two of the right sort of fuse. Fortunately 100A fuses are just strips of steel with two holes drilled in them and he had a file, and a drill, etc. So we got going in the end.
http://michaelsmith.id.au
I expect this is just a scaled up version of the problems I deal with every day. And I'm sure I'm not the only one. Users have grown so dependent on system services and management has grown so apart from the trenches that completely unreasonable expectations are the norm. Where I work for instance it's almost impossible to even *test* backup power and failover mechanisms and procedures because users consider even minor outages in the middle of the night unacceptable and managers either don't have the clout or don't understand the problem well enough to put limits to such expectations. As a result often times the only tests such systems get happen during real emergencies, when they are actually needed. I don't know how, but I feel we should start educating our users and managers better, not to mention being realistic about risks and expectations.
The DC that my company colos a few racks in had this same thing happen about a year ago (not a car crash, just a transformer blew out). But the transfer switch failed to switch to backup power, and the DC lost power for 3 hours.
What is up with these transfer switches? Do the DCs just not test them? Or is it the sudden loss of power that freaks them out vs a controlled "ok we're cutting to backup power now" that would occur during a test? Someone with more knowledge of DC power systems might enlighten me...
Most Americans these days are over-pampered self-absorbed malcontents. If the poles are not out in front where crews can service without going on property -- or even using predefined right of ways -- too many people complain or sue for negligible property damage.
Where I grew up, the power poles ran on the property lines behind and between the houses. Once, lightning took out the transformer on the power pole [great light show and high speed spark ejection] ; and people were willing to take down the fence, put the dogs in a kennel, and remove landscaping which had encroached on the power pole so the crew could replace the transformer and other service. Today, I expect everyone shows up with a digital camera to document "property damage" to file for compensation for landscaping which has illegally encroached on the equipment.
Many places various issues prevent burying the power cable: high water table, daytime temperatures which do not cool the ground -- and the power cables, or even fire ants.
Every mans' island needs an ocean; choose your ocean carefully.
Often, mods will give a funny post "insightful" instead of "funny" because it gives the user positive karma (whereas funny does not affect karma). Not a use intended by CmdrTaco, I'd imagine, but it's a common practice.
For years, I co-located at the top-rated 365 Main data center in San Francisco, CA until they had a power failure a few years ago. Despite having 5x redundant power that was regularly tested, it apparently wasn't tested against a *brown out*. So when Pacific Gas and Electric had a brownout, it failed to trigger 2 of the 5 redundant generators. Unfortunately, the system was designed so that any *one* of the redundant generators could fail and there wouldn't be any problem.
So power was in a brownout condition, the voltage dropped from the usual 120 volts or so down to 90. Many power supplies have brownout detectors and will shut off. Many did, until the total system load dropped to the point where normal power was restored. All of this happened within a few seconds, and the brownout was fixed in just a few minutes. But at the end of it all, there was perhaps 20% of all the systems in the building shut down. The "24x7 hot hands" were beyond swamped. Techies all around the San Francisco area were pulled from whatever they were doing to converge on downtown SF. And me, 4 hours drive away, managed to restore our public-facing services on the one server (of four) I had that survived the voltage spikes before driving in. (Alas, my servers had the "higher end" power supplies with brownout detection)
And so it was a long chain of almost success of well-tested, high-quality equipment that failed all in sequence because real life didn't happen to behave like the frequently performed tests did.
When I did finally arrive, the normally quiet, meticulously clean facility was a shambles. Littered with bits of network cable, boxes of freshly-purchased computer equipment, pizza boxes, and other refuge were to be found in every corner. The aisles were crowded with techies performing disk checks and chattering tersely on cell phones. It was other-worldly.
All of my systems came up normally; simply pushing the power switch and letting the fsck run did the trick, we were fully back up and all tests performed (and the system configuration returned to normal) in about an hour.
Upon reflection, I realized that even though I had some down time, I was really in a pretty good position:
1) I had backup hosting elsewhere, with a backup from the previous night. I could have switched over, but decided not to because we had current data on one system and we figured it was better not to have anybody lose any data than to have everybody lose the morning's work.
2) I had good quality equipment; the fact that none of my equipment was damaged from the event may have been partly due to the brownout detection in the power supplies of my servers.
3) At no point did I have any less than two backups off site in two different location, so I had multiple, recent data snapshots off site. As long as the daisy chains of failure can be, it would be freakishly rare to have all of these points go down at once.
4) Even with 75% of my hosting capacity taken offline, we were able to maintain uptime throughout all this because our configuration has full redundancy within our cluster - everything is stored in at least 2 places onsite.
Moral of the story? Never, EVER have all your eggs in one basket.
I have no problem with your religion until you decide it's reason to deprive others of the truth.
Doesn't EC2 let you request hosts in any of several particular datacentres (which they call an "availability zones") just so you can plan around such location-specific catastrophes? No matter how good the redundant systems, some day a meteor will hit one datacentre and you'll be S.O.L. no matter what if you put all your proverbial eggs in that basket.
Only a fool cares about a single-datacentre outage. This is why it's called "*distributed*-systems engineering", folks.
Cherish. Live. Dream.