Slashdot Mirror


When the Power Goes Out At Google

1sockchuck writes "What happens when the power goes out in one of Google's mighty data centers? The company has issued an incident report on a Feb. 24 outage for Google App Engine, which went offline when an entire data center lost power. The post-mortem outlines what went wrong and why, lessons learned and steps taken, which include additional training and documentation for staff and new datastore configurations for App Engine. Google is earning strong reviews for its openness, which is being hailed as an excellent model for industry outage reports. At the other end of the spectrum is Australian host Datacom, where executives are denying that a Melbourne data center experienced water damage during weekend flooding, forcing tech media to document the outage via photos, user stories and emails from the NOC."

5 of 135 comments (clear)

  1. Read the comments by RaigetheFury · · Score: 5, Insightful

    I pity EvilMuppet. Guy is a tool. There are contractual agreements that are in place to prevent pictures, aka the "rules" but when the data center blatantly LIES they are breaking the trust and violating the agreement. Case Law exists where contracts can be violated when one accuses the other of violating said contract.

    That's what happened. The data center was lying about what happened to avoid responsibility for the equipment it was being paid to host. Pictures were taken and are being used to prove the company did violate the trust of the contract.

    You can argue the semantics and legality of it but if this goes to court the pictures will be admissible and the data center will lose.

  2. They had a perfect contingency plan for this case by juanjux · · Score: 5, Funny

    ...but it was stored on Google Docs.

  3. Re:Don't they have by johnncyber · · Score: 5, Informative

    Dude RTFA (I know, I know, shame on me). The backup generators kicked in, but 25% of the machines in data center did not receive power before crashing.

  4. Useless for large scale problems by mcrbids · · Score: 5, Interesting

    Of COURSE there are people onsite. Most likely they have anywhere from a dozen to a hundred people onsite. But what's that going to do for you in the case of a large-scale problem?

    The otherwise top rated 365 Main facility in San Francisco went down a few years ago. They had all the shizz, multipoint redundant power, multiple data feeds, earthquake-resistant building, the works. Yet, their equipment wasn't well equipped to handle what actually took them down - a recurring brown-out. It confused their equipment, which failed to "see" the situation as one requiring emergency power, causing the whole building to go dark.

    So there you are, with perhaps 25 staff a 4-story building with tens of thousands of servers, the power is out, nobody can figure out why, and the phone lines are so loaded it's worthless. Even when the power comes back on, it's not like you are going to get "hot hands" in anything less than a week!

    Hey, even with all the best planning, disasters like this DO happen! I had to spend 2 wracking days driving to S.F. (several hours drive) to witness a disaster zone. HUNDREDS of techs just like myself carefully nursing their servers back to health, running disk checks, talking in tense tones on cell phones, etc.

    But what pissed me off (and why I don't host with them anymore) was the overly terse statement that was obviously carefully reviewed to make it damned hard to sue them. Was I ever going to sue them? Probably not, maybe just ask for a break on that month's hosting or something. I mean, I just want the damned stuff to work, and I appreciate that even in the best of situations, things *can* go wrong.

    So now I host with Herakles data center which is just as nice as the S.F. facility, except that it's closer, and it's even noticably cheaper. Redundant power, redundant network feeds, just like 365 main. (Better: they had redundancy all the way into my cage, 365 Main just had redundancy to the cage's main power feed)

    And, after a year or two of hosting with Herakles, they had a "brown-out" situation, where one of their main Cisco routers went partially dark, working well enough that their redundant router didn't kick in right away, leaving some routes up and others down while they tried to figure out what was going on.

    When all was said and done, they simply sent out a statement of "Here's what happened, it violates some of your TOS agreements, and here's a claim form". It was so nice, and so open, that out of sheer goodwill, I didn't bother to fill out a claim form, and can't praise them highly enough!

    --
    I have no problem with your religion until you decide it's reason to deprive others of the truth.
    1. Re:Useless for large scale problems by Critical+Facilities · · Score: 4, Insightful

      The otherwise top rated 365 Main [365main.com] facility in San Francisco went down a few years ago. They had all the shizz, multipoint redundant power, multiple data feeds, earthquake-resistant building, the works. Yet, their equipment wasn't well equipped to handle what actually took them down - a recurring brown-out. It confused their equipment, which failed to "see" the situation as one requiring emergency power, causing the whole building to go dark.

      I think you made the right decision in changing providers. I remember that story about the 365 outage, and while I am too lazy to look up the details again, I recall it being as you're telling it. To that end, I'd simply say that they most certainly did have the proper equipment to handle the brown out, but obviously not the proper management. If you're having regular (if intermittent) power problems (brown outs, phase imbalances, voltage harmonic anomolies, spikes, etc), just roll to generator, that's what they're there for.

      I'm sick of people making the assumption that the operators of the facility were just at the mercy of a power quality issue because they have redundant power feeds and automatic transfer switches. Yes, in a perfect world, all the PLCs will function as designed, and the critical load will stay online by itself. However, it takes some foresight and some common sense sometimes to make a decision to mitigate where necessary. I direct all my guys to pre-emptively transfer to our generators if there are frequent irregularities on both of our power feeds (i.e. during a violent thunderstorm, simultaneous utility problems, etc).

      In other words, I'm agreeing with you that the service you received was unacceptable. Along with that (and in rebuttal to the parent post), I'm saying that it's not enough to talk about how they came back from the dead, but why they got there in the first place.