Car Hits Utility Pole, Takes Out EC2 Datacenter
1sockchuck writes "An Amazon cloud computing data center lost power Tuesday when a vehicle struck a nearby utility pole. When utility power was lost, a transfer switch in the data center failed to properly manage the shift to backup power. Amazon said a "small number" of EC2 customers lost service for about an hour, but the downtime followed three power outages last week at data centers supporting EC2 customers. Tuesday's incident is reminiscent of a 2007 outage at a Dallas data center when a truck crash took out a power transformer."
And, as a result, Farmville/Mafiawars updates on Facebook temporarily stop.
Nothing of value was lost.
Kriston
"The cloud" doesn't solve everything. Film at 11.
Amazon for not load-testing their emergency backup power on a regular basis, not having more than one connection the power grid, and the power grid for not having redundancies. Our aging power grid is really beginning to show on so many levels that this is going to become a lot more common over the coming years.
"There might be intelligent beings created by God in outer space even if there are none here on Earth." -Anonymous
Utility poles clearly need countermeasures. Hellfire missiles and such. That'll teach 'em to mess with a poor defenseless pole.
Rhymes that keep their secrets will unfold behind the clouds.There upon the rainbow is the answer to a neverending story
Seriously, Amazon screwed up in a fairly major way with this.
What more upsetting is this: If Amazon doesn't have working disaster recovery, what do other websites/companies have?
Answer: Nothing. You'd be surprised how may US small-to-medium sized business are one fire/tornado/earthquake/hurricane away from bankruptcy. I'd bet it's over 80% of them.
The classic in my last job was when we had a security contractor in on the weekend hooking something up and he looped off a hot breaker in the computer room, slipped, and shorted the white phase to ground. This blew the 100A fuses both before and after the UPS and somehow caused the generator set to fault so that while we had power from the batteries, that was all we had.
It also blew the power supply on an alphaserver and put a nice burn mark in the breaker panel. So the UPS guy comes out and he doesn't have two of the right sort of fuse. Fortunately 100A fuses are just strips of steel with two holes drilled in them and he had a file, and a drill, etc. So we got going in the end.
http://michaelsmith.id.au
This is why I long ago resolved to never, ever, ever correct someone else's grammar on slashdot. The risk in inadvertently failing to grammar is unacceptable.
If I have seen further it is by stealing the Intellectual Property of giants.
I expect this is just a scaled up version of the problems I deal with every day. And I'm sure I'm not the only one. Users have grown so dependent on system services and management has grown so apart from the trenches that completely unreasonable expectations are the norm. Where I work for instance it's almost impossible to even *test* backup power and failover mechanisms and procedures because users consider even minor outages in the middle of the night unacceptable and managers either don't have the clout or don't understand the problem well enough to put limits to such expectations. As a result often times the only tests such systems get happen during real emergencies, when they are actually needed. I don't know how, but I feel we should start educating our users and managers better, not to mention being realistic about risks and expectations.
Stop building those things so fucking close to the roads, maybe?
What about your power supply? Is that not allowed to go along a road? I am all for underground power BTW but I know that if you operate a digger and you want to find the owner of a cable the easiest way is to break it and wait for the complaints.
http://michaelsmith.id.au
The DC that my company colos a few racks in had this same thing happen about a year ago (not a car crash, just a transformer blew out). But the transfer switch failed to switch to backup power, and the DC lost power for 3 hours.
What is up with these transfer switches? Do the DCs just not test them? Or is it the sudden loss of power that freaks them out vs a controlled "ok we're cutting to backup power now" that would occur during a test? Someone with more knowledge of DC power systems might enlighten me...
It's a good thing that oil rigs are better managed than data centers. Who knows what might happen if one of them ever had a problem like this?
Why couldn't they just get power from the cloud?
Your hair look like poop, Bob! - Wanker.
All a fuse is is a piece of metal that will melt fairly quickly when a given amount of current is passed through it. Idea being that it heats up and melts before the wires can. So, the bigger the current, the more robust the metal connecting it. A 100A fuse is usually a fairly large strip of steel.
Now I'll admit that just grabbing an approximate size of steel and placing it in as the GP did isn't going to yield a nice precise fuse. It may have been too high a current. However, it'd work for getting things running again and probably provide a modicum of protection in the event of a short.
Often, mods will give a funny post "insightful" instead of "funny" because it gives the user positive karma (whereas funny does not affect karma). Not a use intended by CmdrTaco, I'd imagine, but it's a common practice.
What about your power supply? Is that not allowed to go along a road? I am all for underground power BTW but I know that if you operate a digger and you want to find the owner of a cable the easiest way is to break it and wait for the complaints.
That's also the fastest way to get rescued off a desert island or out in the woods, and why you should always carry a piece of fiber in your pocket. Should you get stranded, you simply bury the fiber, and some asshole with a backhoe will be along in about five minutes to cut it. Ask him to rescue you.
John
Redundancy costs money. If it costs more than downtime, you don't get it.
Sent from my PDP-11
For years, I co-located at the top-rated 365 Main data center in San Francisco, CA until they had a power failure a few years ago. Despite having 5x redundant power that was regularly tested, it apparently wasn't tested against a *brown out*. So when Pacific Gas and Electric had a brownout, it failed to trigger 2 of the 5 redundant generators. Unfortunately, the system was designed so that any *one* of the redundant generators could fail and there wouldn't be any problem.
So power was in a brownout condition, the voltage dropped from the usual 120 volts or so down to 90. Many power supplies have brownout detectors and will shut off. Many did, until the total system load dropped to the point where normal power was restored. All of this happened within a few seconds, and the brownout was fixed in just a few minutes. But at the end of it all, there was perhaps 20% of all the systems in the building shut down. The "24x7 hot hands" were beyond swamped. Techies all around the San Francisco area were pulled from whatever they were doing to converge on downtown SF. And me, 4 hours drive away, managed to restore our public-facing services on the one server (of four) I had that survived the voltage spikes before driving in. (Alas, my servers had the "higher end" power supplies with brownout detection)
And so it was a long chain of almost success of well-tested, high-quality equipment that failed all in sequence because real life didn't happen to behave like the frequently performed tests did.
When I did finally arrive, the normally quiet, meticulously clean facility was a shambles. Littered with bits of network cable, boxes of freshly-purchased computer equipment, pizza boxes, and other refuge were to be found in every corner. The aisles were crowded with techies performing disk checks and chattering tersely on cell phones. It was other-worldly.
All of my systems came up normally; simply pushing the power switch and letting the fsck run did the trick, we were fully back up and all tests performed (and the system configuration returned to normal) in about an hour.
Upon reflection, I realized that even though I had some down time, I was really in a pretty good position:
1) I had backup hosting elsewhere, with a backup from the previous night. I could have switched over, but decided not to because we had current data on one system and we figured it was better not to have anybody lose any data than to have everybody lose the morning's work.
2) I had good quality equipment; the fact that none of my equipment was damaged from the event may have been partly due to the brownout detection in the power supplies of my servers.
3) At no point did I have any less than two backups off site in two different location, so I had multiple, recent data snapshots off site. As long as the daisy chains of failure can be, it would be freakishly rare to have all of these points go down at once.
4) Even with 75% of my hosting capacity taken offline, we were able to maintain uptime throughout all this because our configuration has full redundancy within our cluster - everything is stored in at least 2 places onsite.
Moral of the story? Never, EVER have all your eggs in one basket.
I have no problem with your religion until you decide it's reason to deprive others of the truth.
Karma and Murphys law, a deadly combination.
"It's ok, I'm completely secure as long as my iron is off"
Funny thing, I thought "cloud" computing means that you're placed into an automatically redundant network of machines, so if there's a site wide outage it didn't interfere with the operations.
Now I see that Amazon's definition of "cloud" simply means "hosting provider". I guess in this case it means hosting provider with no DC power room, N+1 generators and regular testing to ensure the fallback systems actually work.
That kind of reminds me of a company (who will remain nameless) who did tape backups, but never verified their tapes. When the data was lost, a good percentage of the tapes didn't work.
I worked near a good datacenter. Out on smoke breaks late at night, you could hear them test fire their generators once a week. I was in there helping someone one night during a thunderstorm that sounded like it would rip the roof off, when I heard the generators spin up. The inside of the datacenter didn't miss a beat. When I left an hour later, I saw that there was no power (street lights, traffic lights, and normally illuminated buildings) for about 1/2 mile around it. The power company had it fixed by morning though. When I came back in the morning, everything was fine. Well, except my workstation in the office that didn't have redundant power.
Serious? Seriousness is well above my pay grade.
Doesn't EC2 let you request hosts in any of several particular datacentres (which they call an "availability zones") just so you can plan around such location-specific catastrophes? No matter how good the redundant systems, some day a meteor will hit one datacentre and you'll be S.O.L. no matter what if you put all your proverbial eggs in that basket.
Only a fool cares about a single-datacentre outage. This is why it's called "*distributed*-systems engineering", folks.
Cherish. Live. Dream.
What about your power supply? Is that not allowed to go along a road? I am all for underground power BTW but I know that if you operate a digger and you want to find the owner of a cable the easiest way is to break it and wait for the complaints.
That's also the fastest way to get rescued off a desert island or out in the woods, and why you should always carry a piece of fiber in your pocket. Should you get stranded, you simply bury the fiber, and some asshole with a backhoe will be along in about five minutes to cut it. Ask him to rescue you.
In that same job we had a bunch of CCTV cameras on St Kilda road in Melbourne right outside the arts center. Its a mess of tram gear and traffic signals the there is a lot of fibre under the road.
Ever stuck your fork into a plate of spaghetti, then spun it around? This guy had to bore a hole straight down right in the middle of the road. There is a number where you can "dial before your dig" but he omitted that. He wound up with ~50 metres of fibre wrapped around his borer. Quite a mess.
http://michaelsmith.id.au
When was the last time anyone heard of a TV Network going dark for an hour? A Hospital Emergency Room? IT guys always run around like self-important Star Trek Blue Shirts, but they never seem to take the proper steps to ensure -- really ensure -- their uptime.
I'm sure there are exceptions, but it just seems that they have a ways to go, compared to the real "critical systems" industries to which they are so fond of comparing themselves. Is it money, arrogance, or ignorance?
When was the last time anyone heard of a TV Network going dark for an hour?
Hmm, let me think. How about yesterday?
It's not a matter of I.T. guys not taking the proper steps.
It's a matter of price versus "what if". YOU try to convince a pointy haired boss to spend thousands and thousands of extra dollars on something that "may" happen.
It's often hard enough to convince higher ups to just upgrade old infrastructures that are maxed out on resources. Even if you have proof of issues or near failures. The ONLY time they will happily spend money on upgrades and making your infrastructure more robust is after there has been a critical failure and they actually see their bottom line being hurt and even then if you don't get the approval and dollars fast enough, you run the risk of "What are the chances THAT will happen again?"
More often than not, infrastructure is patches built on patches, one I.T. guy coming in trying to "correct" mistakes of his/her predecessor (who they then realize was working with an underwhelming budget), THEN realizing that it's such a mish mash of bubblegum and duct tape, that any serious fixes would require serious downtime with a complete overhaul. Otherwise you run the risk of the whole thing imploding like a blackhole.
How many I.T. guys seriously have the guts to walk up to their boss after being on the job for only a week and say, "I need 50k and you're network will be going up and down for two weeks as I rebuild and fix it all."
I tried it. I, however, had the ammunition that my company went from 3 people to 40 people in 18 months with another 20 predicted in the next 6 months and that the two box servers were maxed out AND that we were renovating a newly purchased building so we could plan everything from cabling, to telephony to security and future planning for 250+ people.
It also didn't hurt that my boss knows that I.T. is an investment when done right and NOT an expense. Even then with everything on my side it still took 3 months of planning, proving, mapping, designing and quoting from vendor after vendor before approval went through.
Good.. Bad.. I'm the guy with the gun.
Usually, TV stations (that get fined for being off the air for not using their spectrum) and hospitals (which, you know, you can die at if the power goes out depending on your circumstances) have an easier time getting money for redundancy because the bad results are more expensive than if LOLcats is down.
TLDR ; lol?
Why is it so hard to only have politicians for a few years, then have them go away?
When was the last time anyone heard of a TV Network going dark for an hour? A Hospital Emergency Room? The people who set the budget for IT guys always run around like self-important Star Trek Blue Shirts, but they never seem to set the proper priorities to ensure -- really ensure -- their uptime.
There. Fixed that for you.
The reason you rarely see an ER go down for want of power is that, knowing that lives depend on it, the people responsible for providing for it are willing to spend what it takes, in capital investment and in manpower for ongoing maintenance and operation so that an acceptable level of availability is guaranteed. Amazon and (last year) Rackspace, not so much.
Here's a wacky thing: the plural form of someone else is actually someone's else .
Ah, I can see the reason for your disclaimer about not having good grammar. "Someone else's" isn't plural, it's possessive! Still an interesting fact though.. does it mean the possessive form of someone else is someone's else? Looks pretty wrong to me...
Yes, I certainly meant "possessive," not "plural," and I don't claim any expertise at all with language. (I'm a math professor in part because I was always so bad at writing.)
Anyway, an English professor whom I asked about the puzzle explained to me that the correct, although archaic form is indeed someone's else. I pointed out to her that many on-line references use someone else's as the possessive form, and she explained that many on-line references are written by individuals who are catering toward the "business writer."
Evidently, the business audience isn't so much concerned with what is correct grammatically as opposed to what sounds correct because it is used most frequently. Hence, sites like dictionary.com will often list the most common usage even if it isn't technically correct.
For example, if you want to refer to the car belonging to the attorney general, it would be the attorney's general car not the attorney general's car. However, most readers would find the first form off-putting, so a business writer would prefer the second.
Of course, this leads to an endless digression as to grammar being a fixed set of rules to hold the language together as a standard or an amorphous description of common usage which must change with the times.
Well, I should probably stop commenting on this before I get too many more "offtopic" mods.