Car Hits Utility Pole, Takes Out EC2 Datacenter
1sockchuck writes "An Amazon cloud computing data center lost power Tuesday when a vehicle struck a nearby utility pole. When utility power was lost, a transfer switch in the data center failed to properly manage the shift to backup power. Amazon said a "small number" of EC2 customers lost service for about an hour, but the downtime followed three power outages last week at data centers supporting EC2 customers. Tuesday's incident is reminiscent of a 2007 outage at a Dallas data center when a truck crash took out a power transformer."
Whatever can go wrong will rings pretty true here. Makes for an exciting day of work for them though I suppose; unlike yours truly. /.*
*Goes back to reading
And, as a result, Farmville/Mafiawars updates on Facebook temporarily stop.
Nothing of value was lost.
Kriston
"The cloud" doesn't solve everything. Film at 11.
The poll goes perfect with the story.
The cloud is nice, but unreliable, it is.
Someday we'll hit the human carrying capacity. And the band will just play on.
Amazon for not load-testing their emergency backup power on a regular basis, not having more than one connection the power grid, and the power grid for not having redundancies. Our aging power grid is really beginning to show on so many levels that this is going to become a lot more common over the coming years.
"There might be intelligent beings created by God in outer space even if there are none here on Earth." -Anonymous
Utility poles clearly need countermeasures. Hellfire missiles and such. That'll teach 'em to mess with a poor defenseless pole.
Rhymes that keep their secrets will unfold behind the clouds.There upon the rainbow is the answer to a neverending story
In Soviet Russia, utility pole hits YOU!
Stop driving like a dork, Dave...I'm getting sleepy...
Seriously, Amazon screwed up in a fairly major way with this.
What more upsetting is this: If Amazon doesn't have working disaster recovery, what do other websites/companies have?
Answer: Nothing. You'd be surprised how may US small-to-medium sized business are one fire/tornado/earthquake/hurricane away from bankruptcy. I'd bet it's over 80% of them.
The classic in my last job was when we had a security contractor in on the weekend hooking something up and he looped off a hot breaker in the computer room, slipped, and shorted the white phase to ground. This blew the 100A fuses both before and after the UPS and somehow caused the generator set to fault so that while we had power from the batteries, that was all we had.
It also blew the power supply on an alphaserver and put a nice burn mark in the breaker panel. So the UPS guy comes out and he doesn't have two of the right sort of fuse. Fortunately 100A fuses are just strips of steel with two holes drilled in them and he had a file, and a drill, etc. So we got going in the end.
http://michaelsmith.id.au
I expect this is just a scaled up version of the problems I deal with every day. And I'm sure I'm not the only one. Users have grown so dependent on system services and management has grown so apart from the trenches that completely unreasonable expectations are the norm. Where I work for instance it's almost impossible to even *test* backup power and failover mechanisms and procedures because users consider even minor outages in the middle of the night unacceptable and managers either don't have the clout or don't understand the problem well enough to put limits to such expectations. As a result often times the only tests such systems get happen during real emergencies, when they are actually needed. I don't know how, but I feel we should start educating our users and managers better, not to mention being realistic about risks and expectations.
Stop building those things so fucking close to the roads, maybe?
Slashdot requires you to wait longer between hitting 'reply' and submitting a comment.
The DC that my company colos a few racks in had this same thing happen about a year ago (not a car crash, just a transformer blew out). But the transfer switch failed to switch to backup power, and the DC lost power for 3 hours.
What is up with these transfer switches? Do the DCs just not test them? Or is it the sudden loss of power that freaks them out vs a controlled "ok we're cutting to backup power now" that would occur during a test? Someone with more knowledge of DC power systems might enlighten me...
It's a good thing that oil rigs are better managed than data centers. Who knows what might happen if one of them ever had a problem like this?
Is it just me or is the placement of a lot of recent links aggravating? Shouldn't the link be from "Amazon cloud computing data center lost power" and not the bit about a utility pole getting struck?
I think it should be, but I'm no Angus Mickleburger.
Why couldn't they just get power from the cloud?
Your hair look like poop, Bob! - Wanker.
All a fuse is is a piece of metal that will melt fairly quickly when a given amount of current is passed through it. Idea being that it heats up and melts before the wires can. So, the bigger the current, the more robust the metal connecting it. A 100A fuse is usually a fairly large strip of steel.
Now I'll admit that just grabbing an approximate size of steel and placing it in as the GP did isn't going to yield a nice precise fuse. It may have been too high a current. However, it'd work for getting things running again and probably provide a modicum of protection in the event of a short.
Funny != insightful
That is why datacenters should have (and my company does) dual-upses, dual-transfer switches, and dual-generators. They also should not load any circuit over 50% to ensure a cascading failure won't happen if power is lost on one side.
It's also completely expected by those not sold on pure science fiction.
All it can take is a backhoe in the wrong place at the wrong time or an anchor cable dragging to cut you off from the very real single or two bits of infrastructure that people fantasise is their bit of the "cloud".
For years, I co-located at the top-rated 365 Main data center in San Francisco, CA until they had a power failure a few years ago. Despite having 5x redundant power that was regularly tested, it apparently wasn't tested against a *brown out*. So when Pacific Gas and Electric had a brownout, it failed to trigger 2 of the 5 redundant generators. Unfortunately, the system was designed so that any *one* of the redundant generators could fail and there wouldn't be any problem.
So power was in a brownout condition, the voltage dropped from the usual 120 volts or so down to 90. Many power supplies have brownout detectors and will shut off. Many did, until the total system load dropped to the point where normal power was restored. All of this happened within a few seconds, and the brownout was fixed in just a few minutes. But at the end of it all, there was perhaps 20% of all the systems in the building shut down. The "24x7 hot hands" were beyond swamped. Techies all around the San Francisco area were pulled from whatever they were doing to converge on downtown SF. And me, 4 hours drive away, managed to restore our public-facing services on the one server (of four) I had that survived the voltage spikes before driving in. (Alas, my servers had the "higher end" power supplies with brownout detection)
And so it was a long chain of almost success of well-tested, high-quality equipment that failed all in sequence because real life didn't happen to behave like the frequently performed tests did.
When I did finally arrive, the normally quiet, meticulously clean facility was a shambles. Littered with bits of network cable, boxes of freshly-purchased computer equipment, pizza boxes, and other refuge were to be found in every corner. The aisles were crowded with techies performing disk checks and chattering tersely on cell phones. It was other-worldly.
All of my systems came up normally; simply pushing the power switch and letting the fsck run did the trick, we were fully back up and all tests performed (and the system configuration returned to normal) in about an hour.
Upon reflection, I realized that even though I had some down time, I was really in a pretty good position:
1) I had backup hosting elsewhere, with a backup from the previous night. I could have switched over, but decided not to because we had current data on one system and we figured it was better not to have anybody lose any data than to have everybody lose the morning's work.
2) I had good quality equipment; the fact that none of my equipment was damaged from the event may have been partly due to the brownout detection in the power supplies of my servers.
3) At no point did I have any less than two backups off site in two different location, so I had multiple, recent data snapshots off site. As long as the daisy chains of failure can be, it would be freakishly rare to have all of these points go down at once.
4) Even with 75% of my hosting capacity taken offline, we were able to maintain uptime throughout all this because our configuration has full redundancy within our cluster - everything is stored in at least 2 places onsite.
Moral of the story? Never, EVER have all your eggs in one basket.
I have no problem with your religion until you decide it's reason to deprive others of the truth.
Who, while driving through a cloud, would ever expect to hit a utility pole? Clouds do not have utility poles. Now, tule fog has utility poles. That is not why they call it 'tule' (not a nickname for utility, but for a grass), but many a utility pole has been unduly undone because someone drove through the tule fog and into the utility pole.
If Amazon is going to put utility poles in its 'cloud', then they are really in a fog. Call it fog computing.
Doesn't EC2 let you request hosts in any of several particular datacentres (which they call an "availability zones") just so you can plan around such location-specific catastrophes? No matter how good the redundant systems, some day a meteor will hit one datacentre and you'll be S.O.L. no matter what if you put all your proverbial eggs in that basket.
Only a fool cares about a single-datacentre outage. This is why it's called "*distributed*-systems engineering", folks.
Cherish. Live. Dream.
And here was me thinking planes were the real threat to cloud computing. I'm beginning to think I don't quite get this newfangled technology lark...
Reminds me of a company I worked for. They had a data centre divided into 5 zones, each zone had a UPS. Each zone was connected to the neighbouring zone with a transfer switch and each UPS could handle 2 zones until the diesel generators kicked in. Each year for 5 years management decided that the cost of the downtime to do annual maintenance was too high so it wasn't done. Outside power finally goes away and 4 of the five zones stay up. The investigation determined that the battery ( natch the power is out) powered transfer switches on both neighbouring zones failed because the battery failed. Turns out putting in new batteries was part of the annual maintenance check list and they had a shelf life of 4+ years.
How about the company with the diesel generator that has 5 hours of fuel. They test it for 1 hour every year. On year 5 the power goes out and the generator runs for one hour before running out of fuel. Seems the test procedure didn't include refuelling the generator.
The point is that even with what you think is the best of planning and testing some time stuff happens.
Yeah, it isn't worse than maintaining a server in your room. But the thing is, it should be orders of magnitude better. One of the main reasons to use the cloud is that there, with loads of servers centralized, redundancy and the like are a lot better taken care of than what you could achieve by yourself.
Now, I don't claim that Amazon fucked up here. It seems that they only had one faulty switch and most of things worked exactly as they should have. Good enough IMO. But the argument "You have the same problems if you do it yourself" is kinda useless if one of the reasons to use the cloud is to get rid of those risks.
He hit the cloud! The Cloud! OMG..
And the 'undefined' location of services suddenly seems very much 'defined'..
When was the last time anyone heard of a TV Network going dark for an hour? A Hospital Emergency Room? IT guys always run around like self-important Star Trek Blue Shirts, but they never seem to take the proper steps to ensure -- really ensure -- their uptime.
I'm sure there are exceptions, but it just seems that they have a ways to go, compared to the real "critical systems" industries to which they are so fond of comparing themselves. Is it money, arrogance, or ignorance?
I think part of the problem might be money. TV Networks and hospital emergency rooms realize that their business and people's lives depend on their uptime. Many bosses in IT, not so much. So while I suspect most IT guys would love n+1 everything with regular tests, that takes resources, which are sometimes not allocated Note: quantification is left deliberately vague because I have 0 numbers to back up my point with. Except for that previous 0.
When was the last time anyone heard of a TV Network going dark for an hour? A Hospital Emergency Room? IT guys always run around like self-important Star Trek Blue Shirts, but they never seem to take the proper steps to ensure -- really ensure -- their uptime.
I'm sure there are exceptions, but it just seems that they have a ways to go, compared to the real "critical systems" industries to which they are so fond of comparing themselves. Is it money, arrogance, or ignorance?
Us IT guys? We run around like we are wearing self-important Star Trek Blue Shirts alright. We just don't realize that our shirts are actually Red.
Try to hack my 31337 firewall!
When was the last time anyone heard of a TV Network going dark for an hour?
Hmm, let me think. How about yesterday?
People would die in an ER room lost power, that would cost the hospital and doctors an absolute fortune. Same with TV stations, they'd lose all that ad revenue. But this isn't unheard of. I've seem plenty of screwed up channels in the last few years.
IT depts rarely have the budgets to allow for decent fall over systems. Just a bunch of UPS and the odd backup generator to allow graceful shutdowns. You can bet hospital servers are the same, they're not on the same circuits as medical equipment. You can see special power sockets for the decent supplies, with labels warning they shouldn't use them for anything other than the intended life support equipment.
Really? Going subterranean with the infrastructure is your example of how to do things right? Doesn't the underground part of California periodically move around in unpredictable & dramatic ways? And not just 10 o'clock news dramatic, but the kind of Earth ripping that scoffs at the works of man.
It wasn"t a Google street mapping car was it? Now THAT would be a good story.
Who do you think manages those TV Networks? Those Hospital Emergency Rooms? It's IT guys. The up-time tends to meet the needs. The stock markets are kept up quite well. I would be willing to bet the internal systems at Goldman Sachs, Fidelity, Bank Of America, etc. are all quite stable. Some little $100k revenue business website I run on EC2 with no geographical load balancing on different trunks is *not* as important. And usually it's not the IT guys that set the up-times, it's the management/budgeting process that makes those decisions. It's usually not worth the cost to maintain the uptimes you're talking about.
It's not a matter of I.T. guys not taking the proper steps.
It's a matter of price versus "what if". YOU try to convince a pointy haired boss to spend thousands and thousands of extra dollars on something that "may" happen.
It's often hard enough to convince higher ups to just upgrade old infrastructures that are maxed out on resources. Even if you have proof of issues or near failures. The ONLY time they will happily spend money on upgrades and making your infrastructure more robust is after there has been a critical failure and they actually see their bottom line being hurt and even then if you don't get the approval and dollars fast enough, you run the risk of "What are the chances THAT will happen again?"
More often than not, infrastructure is patches built on patches, one I.T. guy coming in trying to "correct" mistakes of his/her predecessor (who they then realize was working with an underwhelming budget), THEN realizing that it's such a mish mash of bubblegum and duct tape, that any serious fixes would require serious downtime with a complete overhaul. Otherwise you run the risk of the whole thing imploding like a blackhole.
How many I.T. guys seriously have the guts to walk up to their boss after being on the job for only a week and say, "I need 50k and you're network will be going up and down for two weeks as I rebuild and fix it all."
I tried it. I, however, had the ammunition that my company went from 3 people to 40 people in 18 months with another 20 predicted in the next 6 months and that the two box servers were maxed out AND that we were renovating a newly purchased building so we could plan everything from cabling, to telephony to security and future planning for 250+ people.
It also didn't hurt that my boss knows that I.T. is an investment when done right and NOT an expense. Even then with everything on my side it still took 3 months of planning, proving, mapping, designing and quoting from vendor after vendor before approval went through.
Good.. Bad.. I'm the guy with the gun.
I think money has a huge deal in this too. I love when I tell someone in order to make their $200k current setup 24-7 with five 9's (or greater) uptime that it will costs millions (usually due to ridiculous network costs across sites), they quickly sign off to keep things the way they are. But yet I take the flak still when it does go down.
Or worse, they don't realize that a 5 nines uptime doesn't mean that the system never comes down for maintenance. Then failure ensues from that disaster.
So a car just happens to take out a strategic Amazon datacenter? By any chance, was the car a Mini Cooper? Were the paramedics able to attach a neck-brace to the driver over his black turtleneck? And what's with the strange email send moments before? "The JOB will be done in a flash [sent from my iPhone]"
UTF-8: There and Back Again
People would die in an ER room lost power, that would cost the hospital and doctors an absolute fortune.
True? People die in hospitals from hospital-borne infections all the time. I'm not seeing a tidal wave of lawsuits that would motivate them to clean up.
Clearly this was a test run for an upcoming terrorist attack! If one downed pole can bring down a data center, imagine what 19 downed poles (the number of hijackers on 9/11) could do! It would destroy the economy and lead to famine in the land! Clearly, we need to restrict who can drive a car!! We need government tracking of all cars!!! Background checks for everybody requesting a drivers license!!!! A kill/switch on every vehicle!!!!! Its a dirty job, but someones gotta do it!
He who laughs last, thinks slowest.
Usually, TV stations (that get fined for being off the air for not using their spectrum) and hospitals (which, you know, you can die at if the power goes out depending on your circumstances) have an easier time getting money for redundancy because the bad results are more expensive than if LOLcats is down.
IT guys always run around like self-important Star Trek Blue Shirts, but they never seem to take the proper steps to ensure -- really ensure -- their uptime.
Never send someone from medical to do an engineer's job.
When was the last time anyone heard of a TV Network going dark for an hour? A Hospital Emergency Room? IT guys always run around like self-important Star Trek Blue Shirts, but they never seem to take the proper steps to ensure -- really ensure -- their uptime.
Budget.
I'm sure the IT guys could accomplish it, but it's a matter of priorities on where the money goes. Redundant grids and generators don't grow on trees.
The major hospitals where I live are usually connected to 2 and often 3 different grids--the electrical grid is / was often designed around their needs (some have been around for 100+ years, so they were around even before the grid was). Hell, my house was built before electrical service (there are still some pipes in the walls that were used for gas lighting).
It's all about risk analysis.
When was the last time anyone heard of a TV Network going dark for an hour? A Hospital Emergency Room? IT guys always run around like self-important Star Trek Blue Shirts, but they never seem to take the proper steps to ensure -- really ensure -- their uptime.
I'm sure there are exceptions, but it just seems that they have a ways to go, compared to the real "critical systems" industries to which they are so fond of comparing themselves. Is it money, arrogance, or ignorance?
Probably combination of them, depending on the location, but I have seen many, many times where the money is not there to do it right - and because that happens so often, too many IT admins don't *know* what *right* is.
Well for one thing power companies go way out of their way to keep hospitals up. Also I did work in Hospital IT. Guess what?
The back up gen set didn't have the power to handle the AC!. We had be in the the machine room ready to do a power down on the System 38 if the temp got to high. That was just during a test!
During a real power outage we where to shut down the S38 to keep the lab system on the DG Eclipse up and running.
Yes it was a long time ago.
See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
When was the last time anyone heard of a TV Network going dark for an hour? A Hospital Emergency Room? The people who set the budget for IT guys always run around like self-important Star Trek Blue Shirts, but they never seem to set the proper priorities to ensure -- really ensure -- their uptime.
There. Fixed that for you.
The reason you rarely see an ER go down for want of power is that, knowing that lives depend on it, the people responsible for providing for it are willing to spend what it takes, in capital investment and in manpower for ongoing maintenance and operation so that an acceptable level of availability is guaranteed. Amazon and (last year) Rackspace, not so much.
In a world with Windows clients who needs uptime?
For anyone too lazy to read all the comments, let alone the article, allow me to summarize them:
1) "I am a guy or know a guy in a datacenter who once sneezed and took down the entire eastern seaboard. This is perfectly understandable."
2) "I am a guy or know a guy in a datacenter. You could set off a nuclear bomb inside it and everyone could keep playing Farmville without missing a beat. This is unacceptable."
3) "This is capitalism's fault. The United States is a terrible country. People are fat and lazy."
4) "I'm pretending I don't know what cloud computing is so I can make a pun about meteorological clouds."
5) "I honestly don't even know which thread I'm posting in. The comment I'm responding to is already entirely off topic."
"The cloud" doesn't solve everything. Film at 11.
Actually, "The Cloud" totally bails you out in cases like this. Consider the rackspace outage mentioned in TFA, or ThePlanet's huge outage back in '08. If you were affected by one of those events, you were totally hosed.
On the other hand, had your app been running in EC2, you could simply relaunch your dead instance in another datacenter. You can use Amazon's automated service to do this, or you can roll your own, if you'd like.
EC2 users who had adverse outcomes due to the power outage simply failed to architect their application for their underlying hardware. Amazon is frank with users (it's all over their user guides and FAQs) that they are providing cheap instances on nodes built with commodity hardware. If you run your app on EC2, the redundancy and failover is the responsibility of your app, because AWS is not providing this. AWS is very clear about this: any given node might, at any given point in time, simply vanish. In practice, you get pretty decent uptime with EC2, but you cannot depend on this!
With a cluster of EC2 instances running across different Availability Zones and some decent monitoring/failover, it's actually pretty easy and cheap (compared with running your app in multiple physical datacenters) to achieve respectable uptime for your mission-critical apps. On the other hand, if you have single points of failure all over the place, or if you (gasp) just run your app on a single instance with no automated monitoring/failover (in other words, you have architected your app exactly how AWS recommends against), you are going to be really disappointed.
But even if you do have an application running on a single instance, if you use an EBS-backed instance, you should be able to relaunch your instance, and be back up and running as though nothing happened. Obviously your app would be down in the meantime, but you have way more flexibility than if your physical node goes down in a traditional datacenter.
They don't grade fathers, but if your daughter's a stripper, you fucked up. --Chris Rock
Insightful, more like troll... Do you have any idea how hard it is to run an ISP/Data Center? Do you work in one? I do, and let me tell you what we always say "If it was easy, everyone would do it".
TV Network going dark... yes satellite stations in markets do go down for hours when their towers are hit.
Emergency Rooms... how many times have you heard of them having to be evacuated due to power loss in severe weather. It happens.
Self Important Blue Shirts?... come on man... WTF does that even mean... you just had to add that extra bit of holier than thou...
I get so tired of all these arm chair IT experts getting modded insightful etc... when it's nothing more than a rant about something they really have no clue about. Running an ISP isn't like plugging your SOHO router in at home. ( it aint like dusting crops back home kid, tit for tat on the pun )
Granted this company obviously doesn't run proper disaster recovery drills monthly like we do, or maybe they do and simply had a coincidental switchover fail.
However, to make a lump statement then add "I'm sure there are exceptions" in a holier than thou statement and be modded insightful, just goes all over me, and if you haven't noticed the Internet IS a 'critical system' in todays world. Not everyone uses it for just gaming and torrents like so many on here think the Internet is solely built for. Most people use it to run their business, pay their bills, use their phone, go to school, check their bank/credit card accounts, etc... To me that is a VERY "critical system" and your comment to me is nothing more than arrogant and ignorant.
Took our site down battleempire.com
War is not determined by who is right, but who is left.
When was the last time anyone heard of a TV Network going dark for an hour?
Hmm, let me think. How about yesterday?
This was probably the result of a server or application failure!
"really ensure -- their uptime."
That's because blueshirt IT guys are really redshirts who are expendable.
But seriously, most IT guys can't put into dollar and cents what the cost of data and power redundancy really is.
When failure happens, the BHBs want to know "how much does this cost", the problem is getting the BHB to realize what it costs BEFORE failure happens.
And while you can't prevent failure 100% of the time, you can have contingencies in place to deal with failures and mitigate against more common types of failures.
The actual cost of stuff not running during power failures is higher than most people know, but they don't ever think about it in the proper way.
At where I work, we lose power, for periods of longer than 1 hours, a couple three times a year. Often it can last several hours and occasionally a day or two.
While the power is gone, NOBODY works. ANYWHERE. The man power costs alone are huge, but hidden. Until IT can put that cost into REAL numbers (average salary per hour x hours average per outage), the PHB will never realize the need for redundancy UNTIL its too late.
UPSes that once were a solution for graceful shutdowns, now they are not, as shutting systems down is not really an option at all. We've grown dependent upon the technology to be there ... always.
Agent K: A *person* is smart. People are dumb, stupid, panicky animals, and you know it.
What kind of masochists would work in datacenters? You get no notice at all when things go right, but you get reamed when things go wrong - even things outside of your power (no pun intended). I hope you all get paid well. :)
I worked on electronic equipment made by a major player that went into network TV stations. They closely measured 'seconds of dead air' caused by their equipment. Apparently the networks thought that was important to not annoy their viewers - since many viewers have their fingers poised on remote control channel changing buttons.
I have never once read about a data center catastrophe where a fancy shmancy failover system --or ANY kind of failover system for that matter-- actually worked. Does anyone test or maintain this stuff?
Anyone got any success stories to share?
Nothing is inexplicable; only unexplained -Tom Baker, Doctor Who
My spelling is perfect but I still get my grammer wrong sometimes.
Perhaps your grammar got run over by a reindeer?
"Slashdot - News and Chat Sites Deviant". (Click "homepage" link above for details).
If anyone actually thought that "Cloud Computing" wasn't just a synonym for "co-location in a datacenter", then they're a fucking fool.
If the TV station is out due to a power failure, so are the sets in all the homes. It is time for the FM radio to be turned on. In Montreal, one station runs at 500 watts and covers around a 10km radius from it's antenna.
Leslie Satenstein Montreal Quebec Canada
Everyone has a budget to work within. Even Amazon.
This stuff probably works 90% of the time. We lose utility power several times a year for one reason or another (usually some construction gaffe). It's a small DC in a downtown Chicago high-rise, so not a lot of the complexity of mega-facilities, but the basic pieces are the same. UPSs and/or other equipment functions as designed, and nobody notices. Who would read a blog post titled "UPS and generator not a waste of money"?
That said, so many external parties are involved and the complexity is so high that failures do happen. Our longest outage in this facility was caused by a plumber working on an another floor for another tenant. He had all the credentials to get into the right spaces, and actually fixed what he was there to fix, but also somehow managed to interrupt the redundant chilled water supply on both sides of the building at the same time.