Slashdot Mirror


When the Power Goes Out At Google

1sockchuck writes "What happens when the power goes out in one of Google's mighty data centers? The company has issued an incident report on a Feb. 24 outage for Google App Engine, which went offline when an entire data center lost power. The post-mortem outlines what went wrong and why, lessons learned and steps taken, which include additional training and documentation for staff and new datastore configurations for App Engine. Google is earning strong reviews for its openness, which is being hailed as an excellent model for industry outage reports. At the other end of the spectrum is Australian host Datacom, where executives are denying that a Melbourne data center experienced water damage during weekend flooding, forcing tech media to document the outage via photos, user stories and emails from the NOC."

28 of 135 comments (clear)

  1. what about having people onsite? by alen · · Score: 2, Insightful

    aren't there any people in the data center to tell them that yes there has been a power outage, so and so machines are affected, etc? sounds like all they have is remote monitoring and if something happens than someone has to drive to the location to see what's wrong

    1. Re:what about having people onsite? by nedlohs · · Score: 3, Insightful

      Who cares?

      Power failures are expected, what you can do is have plans for when they occur - batteries, generators, service migration to other sites, etc, etc. Those plans (and the execution of them) are what they had problems with.

    2. Re:what about having people onsite? by hedwards · · Score: 2, Insightful

      My parents once lost power for several hours because a crow got fried in one of the transformers down the street. People around here lose power from time to time when a tree falls on a line. Unplanned power outages are going to happen. Even though line reliability is probably higher now than at any time in the past, it still happens and companies like Google that rely upon it being always there should have plans.

      This isn't just about keeping the people that use Google services informed, this is an admission that there's something to fix and that they're going to fix what they can. There isn't any particular reason why they need to disclose such plans beyond being a huge player and not wanting to scare away the numerous people that count on them for important work.

    3. Re:what about having people onsite? by vakuona · · Score: 3, Insightful

      Cheap doesn't mean not properly designed! Google doesn't do redundancy on a micro scale. For them it's pointless. In fact, from what I know, Google knows their hardware will fail, so they have written their software to handle hardware failures gracefully. When something like this happens, they write a report, and get someone about to work out a fix so that the outage doesn't recur.

  2. Read the comments by RaigetheFury · · Score: 5, Insightful

    I pity EvilMuppet. Guy is a tool. There are contractual agreements that are in place to prevent pictures, aka the "rules" but when the data center blatantly LIES they are breaking the trust and violating the agreement. Case Law exists where contracts can be violated when one accuses the other of violating said contract.

    That's what happened. The data center was lying about what happened to avoid responsibility for the equipment it was being paid to host. Pictures were taken and are being used to prove the company did violate the trust of the contract.

    You can argue the semantics and legality of it but if this goes to court the pictures will be admissible and the data center will lose.

  3. Re:and what about openess during the incident? by theIsovist · · Score: 3, Funny

    Glen Beck, is that you!?

  4. They had a perfect contingency plan for this case by juanjux · · Score: 5, Funny

    ...but it was stored on Google Docs.

  5. Significantly higher latency? by nacturation · · Score: 2, Interesting

    A new option for higher availability using synchronous replication for reads and writes, at the cost of significantly higher latency

    Anyone know some numbers around what "significantly higher latency" means? The current performance looks to be about 200ms on average. Assuming this higher availability model doesn't commit a DB transaction until it's written to two separate datacenters, is this around 300 - 400ms for each put to the datastore?

    --
    Want to improve your Karma? Instead of "Post Anonymously", try the "Post Humously" option.
  6. App Engine down again? by bjourne · · Score: 2, Insightful

    App Engine must be Googles absolutely most poorly run project. It has been suffering from outages almost weekly (the status page doesn't tell the whole truth unfortunately), unexplainable performance degradations, data corruption (!!!), stale indexes and random weirdness for as long as it has been run. I am one of those who tried for a really long time to make it work, but had to give up despite it being Google and despite all the really cool technology in it. I pity the fool who pays money for that.

    The engineers who work with it are really helpful and approachable both on mailing lists and irc, and the documentation is excellent. But it doesn't help when the infrastructure around it is so flaky.

  7. Re:Don't they have by johnncyber · · Score: 5, Informative

    Dude RTFA (I know, I know, shame on me). The backup generators kicked in, but 25% of the machines in data center did not receive power before crashing.

  8. the worst nightmare of data center peeps by filesiteguy · · Score: 3, Interesting

    i don't run a data center, but manage systems that rely on the data center 18 hrs/day 6 days/week. we pass upwards of $300m through my systems. I've yet to get a satisfactory answer as to exactly what would happen if - say - a water line breaks and floods all the electrical (including the dual redundant UPS systems) in the data center.

    1. Re:the worst nightmare of data center peeps by SmilingBoy · · Score: 2, Informative

      First, your servers will shutdown ungracefully, and then, they will be destroyed with little chance of recovery. You will then have to rebuild your systems, and restore the data from the offsite backup. This will of course take time. If this is too much off a risk, you should run a alternate datacentre mirroring your primary databases that can go live within minutes.

  9. Huh? by SlappyBastard · · Score: 2, Informative

    How did I end up in this article? Ah!!!

    --
    I scream. You scream. I assume that means we're both acquainted with the problem. We proceed.
  10. Generators plus UPS FTMFW by Anonymous Coward · · Score: 2, Insightful

    Epic fail.

    Any data center worth it's weight in dirt, must have UPS devices sufficient to power all servers plus all network and infrastructure equipment, as well as the HVAC systems too, for a minimum of at least 2 full hours on batteries, in case the backup generators have difficulty in getting started up and online.

    Any data center without both adequate battery-UPS systems plus diesel (or natural gas or propane powered) generators is a rinky-dink, mickey-mouse amateur operation.

    1. Re:Generators plus UPS FTMFW by Tynin · · Score: 3, Insightful

      You are so cute. I know very little about UPS systems, but when I was working in a datacenter that housed 5000 servers we had a two story room that was twice the size of most houses (~2000 sq ft) with rows and rows of batteries. I was told that in the event of a power outage, we had 22 minutes of battery power before everything went out. The idea of having enough for 2 hours would have been one an interesting setup considering how monstrously large this one already was. Besides, I'm unsure why you'd ever need more than that 22min since that is plenty of time for our on site staff to gracefully power down any of our major servers if the backup generator failed to kick in.

    2. Re:Generators plus UPS FTMFW by Darth_brooks · · Score: 2, Funny

      Yeah, and when the guys at the Jesus Christ of Datcenters that you describe have to do something like, say, switch from generator to utility power manually, and the document that details that process is 18 months old and refers to electrical panels that don't exist anymore, you get what you had here. A failure of fail-over procedures. If the lowliest help desk / operator can't at least understand the documentation you've written, then you've failed.

      The only equipment failure listed is a "power failure." Granted, that can be as simple as "car hits a telephone pole and knocks out a chunk of the grid, leaving your office in the dark", which should be an easily survivable event. But how do you handle a failure like "50kva inline UPS shits the bed leaving nothing but a smoking chassis that no one wants to go anywhere near?" or "HVAC unit fails on christmas eve when only a skeleton staff is on duty and fills the raised floor with 8 inches of water, shorting everything within an inch of its life and making it impossible to bring any hosted services back online?"

      There's nothing like a little bit of "we had no idea these three or four unrelated circumstances could happen simultaneously" disaster porn to make you realize that A. Outage / DR / fail-over planning is more than just throwing money at stuff (UPS's, generators, redundant lines, etc) and B. No matter how good your plan is, it will never be 100% effective.

      --
      There are some people that if they don't know, you can't tell 'em.
    3. Re:Generators plus UPS FTMFW by Richard_at_work · · Score: 2, Informative

      In your rush to criticise 'Microsoft land', you must have overlooked his closing statement regarding 'if the backup generator failed to kick in'.

      You cannot have uptime without power. A mains outage coupled with an unexpected generator failure *will* result in downtime - your decision now is whether you wish your servers to be gracefully shutdown, or just have the rug pulled from under them and hours or days of potential angst as a result. Which is it?

      And before you suggest larger UPSes for longer protection, consider why you have both a generator and a UPS in the first place - UPSes cost a lot, they cost a lot to buy, and they cost a lot to maintain, and then they cost a lot to replace after only a few years. A generator in comparison costs a lot less all round.

  11. When the Power Goes Out At Google... by binaryseraph · · Score: 3, Informative

    ...a fairy dies.

  12. Useless for large scale problems by mcrbids · · Score: 5, Interesting

    Of COURSE there are people onsite. Most likely they have anywhere from a dozen to a hundred people onsite. But what's that going to do for you in the case of a large-scale problem?

    The otherwise top rated 365 Main facility in San Francisco went down a few years ago. They had all the shizz, multipoint redundant power, multiple data feeds, earthquake-resistant building, the works. Yet, their equipment wasn't well equipped to handle what actually took them down - a recurring brown-out. It confused their equipment, which failed to "see" the situation as one requiring emergency power, causing the whole building to go dark.

    So there you are, with perhaps 25 staff a 4-story building with tens of thousands of servers, the power is out, nobody can figure out why, and the phone lines are so loaded it's worthless. Even when the power comes back on, it's not like you are going to get "hot hands" in anything less than a week!

    Hey, even with all the best planning, disasters like this DO happen! I had to spend 2 wracking days driving to S.F. (several hours drive) to witness a disaster zone. HUNDREDS of techs just like myself carefully nursing their servers back to health, running disk checks, talking in tense tones on cell phones, etc.

    But what pissed me off (and why I don't host with them anymore) was the overly terse statement that was obviously carefully reviewed to make it damned hard to sue them. Was I ever going to sue them? Probably not, maybe just ask for a break on that month's hosting or something. I mean, I just want the damned stuff to work, and I appreciate that even in the best of situations, things *can* go wrong.

    So now I host with Herakles data center which is just as nice as the S.F. facility, except that it's closer, and it's even noticably cheaper. Redundant power, redundant network feeds, just like 365 main. (Better: they had redundancy all the way into my cage, 365 Main just had redundancy to the cage's main power feed)

    And, after a year or two of hosting with Herakles, they had a "brown-out" situation, where one of their main Cisco routers went partially dark, working well enough that their redundant router didn't kick in right away, leaving some routes up and others down while they tried to figure out what was going on.

    When all was said and done, they simply sent out a statement of "Here's what happened, it violates some of your TOS agreements, and here's a claim form". It was so nice, and so open, that out of sheer goodwill, I didn't bother to fill out a claim form, and can't praise them highly enough!

    --
    I have no problem with your religion until you decide it's reason to deprive others of the truth.
    1. Re:Useless for large scale problems by Critical+Facilities · · Score: 4, Insightful

      The otherwise top rated 365 Main [365main.com] facility in San Francisco went down a few years ago. They had all the shizz, multipoint redundant power, multiple data feeds, earthquake-resistant building, the works. Yet, their equipment wasn't well equipped to handle what actually took them down - a recurring brown-out. It confused their equipment, which failed to "see" the situation as one requiring emergency power, causing the whole building to go dark.

      I think you made the right decision in changing providers. I remember that story about the 365 outage, and while I am too lazy to look up the details again, I recall it being as you're telling it. To that end, I'd simply say that they most certainly did have the proper equipment to handle the brown out, but obviously not the proper management. If you're having regular (if intermittent) power problems (brown outs, phase imbalances, voltage harmonic anomolies, spikes, etc), just roll to generator, that's what they're there for.

      I'm sick of people making the assumption that the operators of the facility were just at the mercy of a power quality issue because they have redundant power feeds and automatic transfer switches. Yes, in a perfect world, all the PLCs will function as designed, and the critical load will stay online by itself. However, it takes some foresight and some common sense sometimes to make a decision to mitigate where necessary. I direct all my guys to pre-emptively transfer to our generators if there are frequent irregularities on both of our power feeds (i.e. during a violent thunderstorm, simultaneous utility problems, etc).

      In other words, I'm agreeing with you that the service you received was unacceptable. Along with that (and in rebuttal to the parent post), I'm saying that it's not enough to talk about how they came back from the dead, but why they got there in the first place.

  13. try employing the right people by mjwalshe · · Score: 2, Interesting

    try hiring some staff with telco experiance instead of kids with a perfect GPA scores from stanford and design the fraking thing better !

  14. Re:Isn't this part of their SLA? by hedwards · · Score: 2, Insightful

    That's the downside, anytime you acknowledge a mistake you're then looking like you have more than the idiots that have hundreds of mistakes that they don't disclose until caught making.

  15. Lessons Like by Greyfox · · Score: 2, Funny

    Don't have all your shit in one data center, maybe? I'd have thought that one would be pretty fundamental. Of course, knowing Google they're going to decide that what they really need is power generation right on site, then they'll just pop off and invent nuclear fusion before lunch.

    --

    I'm trying to teach myself to set people on fire with my mind... Is it hot in here?

  16. floods by zogger · · Score: 2, Insightful

    Did you ever actually see a big flood? Freaking awesome power, like a fleet of bulldozers. Smashes stuff, rips houses off foundations, knocks huge trees over, will tumble multiple ton boulders ahead of it, etc. Just depends on how big the flood is. We had one late last year here, six inches of rain in a couple of hours, just tore stuff up all over. The "building" that can withstand a flood of significant size exists, it is called a submarine. Most buildings of the normal kind just aren't designed to deal with anything that destructive. Some can resist minor floods, but not too many.

    1. Re:floods by DragonWriter · · Score: 2, Informative

      The structure that can withstand a flood has existed for a lot longer than submersible warships - it's called a "hill". If you don't have one conveniently nearby to use you can even build an artificial one.

      An "artificial hill" intended to protect an area from floods is usually called a "levee", and while certainly extremely useful for their intended purpose, they aren't exactly an ironclad guarantee. So having contingency plans for the case where they fail isn't a bad idea.

  17. Re:Don't they have by Critical+Facilities · · Score: 3, Interesting
    First of all, the "flywheel generators" you're referring to are actually either standalone UPS systems or a part of a DRUPS (Diesel Rotary UPS). Here is some information on one of the leading manufacturers of such equipment.

    However, all of this is moot, since even if they had a flywheel setup as you're speculating, it still doesn't explain why 25% of the floor went down. If the equipment was installed, maintained and loaded properly, they should've been able to get to the generators with no problem.

    are you really telling me that you believe you and ElectricTurtle are smarter than the combined brainpower set loose by Google for building and maintaining this facility?

    No, I'm telling you that I manage a data center, and I know first hand how they work (or in this case, should work). I fail to see an adequate explanation of how this was unavoidable.

  18. Re:Don't they have by DragonWriter · · Score: 2, Insightful

    There's more to this story than is being told, and instead, they're focusing on how they came back online rather than why they went offline in the first place.

    That's because they are focussing on what went wrong. Power losses, including ones that take down the whole data center, are accepted risks and part of the reason they have a redundant data centers and failover procedures.

    The failure wasn't that they had a partial loss at a datacenter. The failure was that the impact of that loss wasn't mitigated properly by the systems that were supposed to be in place to do that.

  19. Re:Back around 2005... by Richard_at_work · · Score: 3, Insightful

    Let me add my own little story, which happened back in the good old days of June 2009.

    The company had spent the past year rearchitecting the entire IT infrastructure, as the complete core application suite for the business was, other than your standard peripheral utilities like Office et al, green screen based, using a proprietary language from the early 1980s that was barely still maintained and wasn't going anywhere fast.

    It was my job to handle the systems infrastructure side of the deal, while another team handled software development and I was way ahead of them - the core business applications were still in the planning stages while the infrastructure to handle and host them was well advanced. The platform we chose was well designed, with onsite redundancy built into the base cost and easily scalable - dare I say it myself, it was a good job. The only thing I had no hand in on the hardware side was the actual building infrastructure, as we had moved to custom built offices about 5 years prior, and there was someone else on the team that handled telecoms and the building. But we had a UPS and a generator, so all seemed well in the world.

    Alongside the new infrastructure came the new business continuity plan. Well, I say 'new' - I can't really say there was an 'old' BCP. Sure, we rented space at a major BC facilities provider, but there had never been any test, and there wasn't even any written documentation as to what to do.

    Here is where I must admit my first failure - the BCP was not treated as an integral, tied-in-like-a-knot part of the infrastructure, it was a separate project running alongside. Sure, the new infrastructure was designed to take a local server failure through redundancy, or even allow ease of moving to an offsite location. That part of it was all in place. My failure was in not ensuring that the offsite location actually existed as the new infrastructure grew.

    However, by the start of 2009, the basic infrastructure needs of the BCP were well known, costed and presented to the company board of directors. And there it sat. Every month I would ask them if it had been signed off, if I could spend the money. Every month I received a negative answer, it just hadn't been discussed at these busy directors meetings.

    And that was my second failure. I had no sponsor in those meetings, there was basically no IT representation (the IT director had resigned after the modernisation was pushed through, he wanted no part in it as he had not been taking the business forward himself). With no sponsor, no one wanted to raise the potential spending of a hundred thousand pounds themselves. And so it sat.

    Then one day in June, we had a routine fan replacement on the UPS. The engineer was signed in, did the replacement under the watchful eye of a senior helpdesk technician, and flipped the UPS back from maintenance bypass to full protected mains. And that was when the first bang happened.

    And all the lights went dark. All the whirring stopped. All the phones stopped ringing. All the people stopped talking.

    It was blissfully quiet for a few precious seconds. And then it was painfully quiet for about another 5. And then all hell broke loose.

    The core business applications did not fair well. The 30 year old architecture essentially had no failsafe for database writes, and as the server had quit in the midst of several thousand writes, we knew we had just lost a significant amount of data.

    Its worth taking several seconds out to explain how the core application language does its job. Firstly, there is no database server, its all C-ISAM datafiles directly read from and written to by each individual application. Locks are handled by each application internally, with OS level locking preventing concurrent writes to the same record in the data file. No database engine, no transaction logging, no roll backs, no error correction, nothing. There was nothing in the language to protect those poor l