Slashdot Mirror


How Google Broke Itself and Fixed Itself, Automatically

lemur3 writes "On January 24th Google had some problems with a few of its services. Gmail users and people who used various other Google services were impacted just as the Google Reliability Team was to take part in an Ask Me Anything on Reddit. Everything seemed to be resolved and back up within an hour. The Official Google Blog had a short note about what happened from Ben Treynor, a VP of Engineering. According to the blog post it appears that the outage was caused by a bug that caused a system that creates configurations to send a bad one to various 'live services.' An internal monitoring system noticed the problem a short time later and caused a new configuration to be spread around the services. Ben had this to say of it on the Google Blog, 'Engineers were still debugging 12 minutes later when the same system, having automatically cleared the original error, generated a new correct configuration at 11:14 a.m. and began sending it; errors subsided rapidly starting at this time. By 11:30 a.m. the correct configuration was live everywhere and almost all users' service was restored.'"

6 of 125 comments (clear)

  1. Re:How To Revitalize America! by Anonymous Coward · · Score: 1, Interesting

    How about we ship ALL the immigrants back. Give America back to the (Native) Americans

  2. [Shudder...] by jeffb+(2.718) · · Score: 5, Interesting

    I was remembering an SF short-short that had someone asking the first intelligent computer, "Is there a God"? The computer, after checking that its power supply was secure, replied: "NOW there is".

    Apparently, though, it was a second-hand misquote of this Frederic Brown story.

    1. Re:[Shudder...] by the+eric+conspiracy · · Score: 4, Interesting

      Cool.

      On a slightly more optimistic note is Asimov's "The Last Question", another computer as God story.

      http://www.thrivenotes.com/the...

  3. Re:Well congratulations by icebike · · Score: 3, Interesting

    On recovering by using the "last known good" configuration. What wizardry!

    I expect we'll be seeing the Google patent application on that shortly </sarcasm>

    In other words: They still have no clue what happened, because the system in question "fixed itself".

    Sounds a lot like a BGP routing mishap problem rather than anything to do with Google's actual server farms.
    The lack of specificity suggests they still haven't got much of a clue. I suspect they were pwned by someone
    watching them brag on reddit, and decided it was time for a lesson in humility.

    --
    Sig Battery depleted. Reverting to safe mode.
  4. Arsonist claiming to be the hero firefighter by JoeyRox · · Score: 3, Interesting

    They make it sound like their system is all-self-correcting. In reality it's probably a specific area they've had bugs with in the past and they put in a failsafe rollback mechanism to prevent future regressions.

  5. Re:Well congratulations by sjames · · Score: 3, Interesting

    It's not unlike the old trick of setting a machine to reboot in 10 minutes, manually changing the network settings, then canceling the reboot if you can still communicate (and the settings revert on reboot if you cannot). Of course, Google did it on a much larger scale.