When the Power Goes Out At Google
1sockchuck writes "What happens when the power goes out in one of Google's mighty data centers? The company has issued an incident report on a Feb. 24 outage for Google App Engine, which went offline when an entire data center lost power. The post-mortem outlines what went wrong and why, lessons learned and steps taken, which include additional training and documentation for staff and new datastore configurations for App Engine. Google is earning strong reviews for its openness, which is being hailed as an excellent model for industry outage reports. At the other end of the spectrum is Australian host Datacom, where executives are denying that a Melbourne data center experienced water damage during weekend flooding, forcing tech media to document the outage via photos, user stories and emails from the NOC."
aren't there any people in the data center to tell them that yes there has been a power outage, so and so machines are affected, etc? sounds like all they have is remote monitoring and if something happens than someone has to drive to the location to see what's wrong
My lifestream was interrupted and I didn't even notice! (see http://tech.slashdot.org/story/10/03/08/0024205/Time-To-Take-the-Internet-Seriously for reference)
Flexible bare-metal recovery for Linux/UNIX
I pity EvilMuppet. Guy is a tool. There are contractual agreements that are in place to prevent pictures, aka the "rules" but when the data center blatantly LIES they are breaking the trust and violating the agreement. Case Law exists where contracts can be violated when one accuses the other of violating said contract.
That's what happened. The data center was lying about what happened to avoid responsibility for the equipment it was being paid to host. Pictures were taken and are being used to prove the company did violate the trust of the contract.
You can argue the semantics and legality of it but if this goes to court the pictures will be admissible and the data center will lose.
Obviously if the power goes out, and the service goes offline, then it WASN'T a cloud. If it's a cloud, it can't go down. If it goes down, it wasn't a cloud.
What's there to get?
Glen Beck, is that you!?
...but it was stored on Google Docs.
Even a cloud isn't effective if all the nodes go down, it's not magic.
The fact there was one.
Website Hosting
A new option for higher availability using synchronous replication for reads and writes, at the cost of significantly higher latency
Anyone know some numbers around what "significantly higher latency" means? The current performance looks to be about 200ms on average. Assuming this higher availability model doesn't commit a DB transaction until it's written to two separate datacenters, is this around 300 - 400ms for each put to the datastore?
Want to improve your Karma? Instead of "Post Anonymously", try the "Post Humously" option.
App Engine must be Googles absolutely most poorly run project. It has been suffering from outages almost weekly (the status page doesn't tell the whole truth unfortunately), unexplainable performance degradations, data corruption (!!!), stale indexes and random weirdness for as long as it has been run. I am one of those who tried for a really long time to make it work, but had to give up despite it being Google and despite all the really cool technology in it. I pity the fool who pays money for that.
The engineers who work with it are really helpful and approachable both on mailing lists and irc, and the documentation is excellent. But it doesn't help when the infrastructure around it is so flaky.
Football Odds
Dude RTFA (I know, I know, shame on me). The backup generators kicked in, but 25% of the machines in data center did not receive power before crashing.
This should be standard practice... It's like the good bits of ISO9001 with a bit more openness. When done right, ISO9001 is a good model to follow.
i don't run a data center, but manage systems that rely on the data center 18 hrs/day 6 days/week. we pass upwards of $300m through my systems. I've yet to get a satisfactory answer as to exactly what would happen if - say - a water line breaks and floods all the electrical (including the dual redundant UPS systems) in the data center.
The Kai's Semi-Updated Website Thingy
Whoosh.
OMFG! There's swinging at an outside pitch and there's try to hit one that was thrown in the fuckin' stands!!
I scream. You scream. I assume that means we're both acquainted with the problem. We proceed.
How did I end up in this article? Ah!!!
I scream. You scream. I assume that means we're both acquainted with the problem. We proceed.
Epic fail.
Any data center worth it's weight in dirt, must have UPS devices sufficient to power all servers plus all network and infrastructure equipment, as well as the HVAC systems too, for a minimum of at least 2 full hours on batteries, in case the backup generators have difficulty in getting started up and online.
Any data center without both adequate battery-UPS systems plus diesel (or natural gas or propane powered) generators is a rinky-dink, mickey-mouse amateur operation.
Sounds more like fog to me.
Davo -- Free speech, free software, AND free beer.
...a fairy dies.
Of COURSE there are people onsite. Most likely they have anywhere from a dozen to a hundred people onsite. But what's that going to do for you in the case of a large-scale problem?
The otherwise top rated 365 Main facility in San Francisco went down a few years ago. They had all the shizz, multipoint redundant power, multiple data feeds, earthquake-resistant building, the works. Yet, their equipment wasn't well equipped to handle what actually took them down - a recurring brown-out. It confused their equipment, which failed to "see" the situation as one requiring emergency power, causing the whole building to go dark.
So there you are, with perhaps 25 staff a 4-story building with tens of thousands of servers, the power is out, nobody can figure out why, and the phone lines are so loaded it's worthless. Even when the power comes back on, it's not like you are going to get "hot hands" in anything less than a week!
Hey, even with all the best planning, disasters like this DO happen! I had to spend 2 wracking days driving to S.F. (several hours drive) to witness a disaster zone. HUNDREDS of techs just like myself carefully nursing their servers back to health, running disk checks, talking in tense tones on cell phones, etc.
But what pissed me off (and why I don't host with them anymore) was the overly terse statement that was obviously carefully reviewed to make it damned hard to sue them. Was I ever going to sue them? Probably not, maybe just ask for a break on that month's hosting or something. I mean, I just want the damned stuff to work, and I appreciate that even in the best of situations, things *can* go wrong.
So now I host with Herakles data center which is just as nice as the S.F. facility, except that it's closer, and it's even noticably cheaper. Redundant power, redundant network feeds, just like 365 main. (Better: they had redundancy all the way into my cage, 365 Main just had redundancy to the cage's main power feed)
And, after a year or two of hosting with Herakles, they had a "brown-out" situation, where one of their main Cisco routers went partially dark, working well enough that their redundant router didn't kick in right away, leaving some routes up and others down while they tried to figure out what was going on.
When all was said and done, they simply sent out a statement of "Here's what happened, it violates some of your TOS agreements, and here's a claim form". It was so nice, and so open, that out of sheer goodwill, I didn't bother to fill out a claim form, and can't praise them highly enough!
I have no problem with your religion until you decide it's reason to deprive others of the truth.
try hiring some staff with telco experiance instead of kids with a perfect GPA scores from stanford and design the fraking thing better !
I think it would do them good, considering the recent downtime with Assassin's Creed 2. Has anyone seen any info on that outage?
lol no UPS = fail
I support the Slashcott and will not be reading or commenting from 2/10/14 to 2/17/14. Beta is steaming pile of dog shit
That's the downside, anytime you acknowledge a mistake you're then looking like you have more than the idiots that have hundreds of mistakes that they don't disclose until caught making.
So, rewatching season 2? The addiction is terrible. I recommend a dosis of Flashforward ...
WTF am I doing replying to an AC at 5 A.M on a Friday night?
lol you = don't know how datacenters work
Do not fold, spindle or mutilate.
Actually no, Google doesn't use UPS systems if this is one of their designs that uses one small sealed lead acid battery per server.
this is my sig
Power failures are expected, what you can do is have plans for when they occur - batteries, generators, service migration to other sites, etc, etc
Too small scale, too complex, too much human intervention and too unreliable. Minimum of 2 datacenters on opposite sides of the world and you only send half the traffic to each. When the first vanishes the second picks up the traffic. The exact mechanism depends on the level of service you want to provide.
Deleted
Don't have all your shit in one data center, maybe? I'd have thought that one would be pretty fundamental. Of course, knowing Google they're going to decide that what they really need is power generation right on site, then they'll just pop off and invent nuclear fusion before lunch.
I'm trying to teach myself to set people on fire with my mind... Is it hot in here?
I have to say, I think ElectricTurtle is right. If the generators came online as they're claiming, how could it be that 25% of the load dropped during transfer?? There's more to this story than is being told, and instead, they're focusing on how they came back online rather than why they went offline in the first place. I'd be willing to bet you that heads are rolling behind closed doors. If there were properly functioning UPSs in the building (either the large ones or the server-mounted batteries Google sometimes likes), then there shouldn't have been any outage on the transfer to generator.
I've heard a few rumors that they're re-thinking this strategy. I'm betting this event might keep those conversations going.
Did you ever actually see a big flood? Freaking awesome power, like a fleet of bulldozers. Smashes stuff, rips houses off foundations, knocks huge trees over, will tumble multiple ton boulders ahead of it, etc. Just depends on how big the flood is. We had one late last year here, six inches of rain in a couple of hours, just tore stuff up all over. The "building" that can withstand a flood of significant size exists, it is called a submarine. Most buildings of the normal kind just aren't designed to deal with anything that destructive. Some can resist minor floods, but not too many.
We decided to move three of our divisions into one facility, those included to business facing units and the I.T. division.
I was charged with laying out the design for data, telecom and electrical for the project. Also had engineering of our little NOC.
Nice setup - redundant power in the I.T. division, nice big APC UPS for the entire room, had it's own 480V power drop, dual HVAC units, a natural gas fired generator. It's nice to have the money to do this.
Since we were a state agency we had to use state DNS services. And one day the city had a massive power outage. We were up and running happy as a clam but we found the Achilles heel in all our plans. Without DNS we couldn't get in or out. I had floated the idea of maintaining our own DNS server but nobody wanted to hear that. We had the decent network connection, and the redundant power (Yes, we even placed a UPS/Generator backed up outlet in the MDF for Cox's Marconi router) so why the hell not replicate the state DNS services?
Let that be a lesson. We tried to plan for all contingencies and we completely missed our dependence on an outside state agency. Of course since a river runs right behind we also raised the NOC floor by about a foot.
Depends on who's doing the shopping.
If you're looking for a serious hosting facility, then incident response should be one of the things you look at. If they haven't had an incident*, then you have no idea how they'll handle it when (not if!) one happens. They can hand you all the documentation in the world, but that can't speak to execution.
* that they've admitted to
You are also assuming that all datacenters have and need UPSes. This is simply not the case. More and more facilities are going to flywheel generators as maintaining batteries for transfer time between mains and generator power is insanely expensive in floor space, labor, and replacement costs. Nothing in any of the linked content says what kind of generators they have, or anything about a UPS. Based on the simple fact that Google can afford and makes it a priority to hire too notch talent and build things the right way, are you really telling me that you believe you and ElectricTurtle are smarter than the combined brainpower set loose by Google for building and maintaining this facility?
Do not fold, spindle or mutilate.
That's a nice try at another troll.
You demonstrated that you don't know enough about modern data center design based on your 4 word comment. No further information was necessary.
Plenty of people who have worked in data centers wouldn't know this, so the fact that you may have worked in one is a moot point.
See the reply to the guy who also doesn't know this stuff that was trying to stick up for you. http://slashdot.org/comments.pl?sid=1575066&cid=31403320
Do not fold, spindle or mutilate.
However, all of this is moot, since even if they had a flywheel setup as you're speculating, it still doesn't explain why 25% of the floor went down. If the equipment was installed, maintained and loaded properly, they should've been able to get to the generators with no problem.
are you really telling me that you believe you and ElectricTurtle are smarter than the combined brainpower set loose by Google for building and maintaining this facility?
No, I'm telling you that I manage a data center, and I know first hand how they work (or in this case, should work). I fail to see an adequate explanation of how this was unavoidable.
Obviously if the power goes out, and the service goes offline, then it WASN'T a cloud. If it's a cloud, it can't go down. If it goes down, it wasn't a cloud.
The cloud got too big and it rained.
$Political_Pundit_I_Disagree_With, is that you!?
Fixed that for you
No, I think I got it right this time.
Posts not to be taken literally. Almost everything is sarcasm.
You would do better to see his reply to your reply. He's already putting you in your place so well that any similar effort by me would be redundant.
I support the Slashcott and will not be reading or commenting from 2/10/14 to 2/17/14. Beta is steaming pile of dog shit
The argument is simply that going without adequate battery power to handle transfer switching is asinine and you seem to think that's normal data-center behavior. You would be the only one that thinks that would be properly redundancy and all the data-centers I'm in have battery backed transformers to handle the load while they switch to alternate power.
The most expensive data center I'm in even goes so far as to have an hour of battery time to handle generator failures during a power outage.
ElectricTurtle and Critical Facilities both have comments that mirror my own experience and echo every data center best practice. People without this power are asking for problems. Google tried something against best practice and despite us individually not having more brain power than Google, collectively the likes of IBM, Microsoft, and every other large corporation with many large data centers have come to this conclusion. Many and I'm looking as those lovely Texas data centers keep trying to buck the best practice and surprise surprise, it bites them in the ass.
That said, Google has a great track record so I'm not going to call any of their practices into question, it sounds like the event was mishandled and that's why there was a service outage. Sometimes events are mishandled due to unforeseen circumstances or something didn't have their morning cup of coffee. That's why companies do post-mortems and the fact that Google was so open about it is a good sign that the same situation won't lead to another outage which is what matters given their stellar uptime.
That's because they are focussing on what went wrong. Power losses, including ones that take down the whole data center, are accepted risks and part of the reason they have a redundant data centers and failover procedures.
The failure wasn't that they had a partial loss at a datacenter. The failure was that the impact of that loss wasn't mitigated properly by the systems that were supposed to be in place to do that.
I read the post-mortem and I think they completely missed the mark. Power failed to some machines. They only noticed because "...traffic has problems..." They should have been monitoring the power to detect this situation. They didn't say whether they have the data center power supply on a UPS or not. If it was, it was dying and no one noticed. If they had been monitoring the power they might have avoided the whole mess.
What I inferred was that the real problem wasn't that they failed - complete failure they would have recovered from. Unfortunately, they did not understand what their state was when only some of them failed, and did not figure out how to recover.
Repeat to yourself: "All is well, All is well, All is well" and everything will be exactly like you wish it to be.
Note originators of response model are not responsible for anyone being taken away to a psychiatric facility because of a belief response model user is psychotic
-------- This space intentionally left blank --------
Power losses, including ones that take down the whole data center, are accepted risks and part of the reason they have a redundant data centers and failover procedures. The failure wasn't that they had a partial loss at a datacenter. The failure was that the impact of that loss wasn't mitigated properly by the systems that were supposed to be in place to do that.
I must respectfully disagree. Power losses that take down the whole data center are definitely NOT accepted risks. The entire reasoning for spending millions upon millions of dollars to have UPS systems, Static Switches, Automatic Throwover Switches, Diesel Generators with thousands of gallons of fuel, etc isn't because you think downtime is acceptable, it's because downtime is not an option.
We almost agree though. I do agree that the failure was in improper mitigation of the risk as opposed to mitigation of the outage once it happened. There is no reason given that explains why 25% of their floor(s) went down, and in a properly run data center (at least a Tier 3 or higher), there is no reason any of the Critical Load should ever go down.
I was under the impression that Google's servers all had small individual batteries in each chassis to provide power during generator spin-up in lieu of full-on UPSs. Maybe some of them didn't last as long as they were supposed to? Or maybe the generator took longer to warm up than it should have>
Why, no, I haven't meta-moderated lately. Thanks for asking!
They'll come out with it when Apple releases iFusion...
Yeah, I've seen those reports, and I am curious as to whether or not they have the server-mounted batteries at all of the data centers, or just some of them. You and I are thinking right along the same lines. If indeed they did have the server-mounted batteries, I'd be curious to know why they didn't hold all the load. It seems to me that either the server-mounted battery strategy is less reliable than traditional UPS Systems, or as you suggested, perhaps something happened with the generators (those generators should have spun up and been carrying load within 15 seconds). Either way, something didn't work as designed, and TFA doesn't touch on any of it.
Yep. I mean, as it's been stated in other comments, I think Google's way of hedging its bets is to have redundant data centers, so I think they correctly focused on the procedural issues.
However... as a current programmer and former IT guy, I'd like to know more about what caused the failures in the first place.
Why, no, I haven't meta-moderated lately. Thanks for asking!