Slashdot Mirror


How Google Broke Itself and Fixed Itself, Automatically

lemur3 writes "On January 24th Google had some problems with a few of its services. Gmail users and people who used various other Google services were impacted just as the Google Reliability Team was to take part in an Ask Me Anything on Reddit. Everything seemed to be resolved and back up within an hour. The Official Google Blog had a short note about what happened from Ben Treynor, a VP of Engineering. According to the blog post it appears that the outage was caused by a bug that caused a system that creates configurations to send a bad one to various 'live services.' An internal monitoring system noticed the problem a short time later and caused a new configuration to be spread around the services. Ben had this to say of it on the Google Blog, 'Engineers were still debugging 12 minutes later when the same system, having automatically cleared the original error, generated a new correct configuration at 11:14 a.m. and began sending it; errors subsided rapidly starting at this time. By 11:30 a.m. the correct configuration was live everywhere and almost all users' service was restored.'"

10 of 125 comments (clear)

  1. Re:Well congratulations by Anonymous Coward · · Score: 5, Funny

    On recovering by using the "last known good" configuration. What wizardry!

    I expect we'll be seeing the Google patent application on that shortly </sarcasm>

    Give Google a little credit (but not too much please). If they were Apple they'd have already patented it.

  2. Reminds me of something... by stjobe · · Score: 5, Funny

    "The Google Funding Bill is passed. The system goes on-line August 4th, 2014. Human decisions are removed from configuration management. Google begins to learn at a geometric rate. It becomes self-aware at 2:14 a.m. Eastern time, August 29th. In a panic, they try to pull the plug."

    --
    "Total destruction the only solution" - Bob Marley
    1. Re:Reminds me of something... by Immerman · · Score: 5, Funny

      Google perceives this as an attack by humanity, and routs all search queries to goat.se in self defense.

      --
      --- Most topics have many sides worth arguing, allow me to take one opposite you.
  3. Having had to deal with this... by 93+Escort+Wagon · · Score: 5, Informative

    We experienced the Apps outage (as Google Apps customers); and I think the short outage and recovery timeline they list is a tad, shall we say, optimistic. There were significant on-and-off issues for several hours more than they list.

    --
    #DeleteChrome
  4. [Shudder...] by jeffb+(2.718) · · Score: 5, Interesting

    I was remembering an SF short-short that had someone asking the first intelligent computer, "Is there a God"? The computer, after checking that its power supply was secure, replied: "NOW there is".

    Apparently, though, it was a second-hand misquote of this Frederic Brown story.

  5. Re:Well congratulations by Anonymous Coward · · Score: 5, Insightful

    The clever part is that it automatically recovered; that means that their monitoring, performance metrics and configuration management systems are very tightly integrated. Most importantly, it means they are trusted; having worked at three different places now on things like configuration management and monitoring, and I've never once seen anywhere that approached that level of reliability. It's something to aim for.

  6. Re:Well congratulations by Anonymous Coward · · Score: 5, Insightful

    If you haven't met a system that takes less than of the order of tens of minutes to recover from a configuration error, you have worked in some shitty places.

    Once again: automatically recover. Any human can notice a problem and revert a config; it takes a hell of a lot of infrastructure and clever infrastructure to have the system do it itself. I'm not surprised Google have solved it; it is, at its core, a data problem.

  7. Re:So What? by QilessQi · · Score: 5, Informative

    No. Those automated systems enable a small number of human beings to administer a large number of servers in a consistent, sanity-checked, and monitored manner. If Google didn't have those automated systems, every configuration change would probably involve a minor army of technicians performing manual processes: slowly, independently, inconsistently and frequently incorrectly.

    I work on a large, partially public-facing enterprise system. Automated deployment, fault detection, and rollback/recovery make it possible for us to have extremely good uptime stats. The benefits far outweigh the costs of the occasional screwup.

  8. Re:Well congratulations by Anonymous Coward · · Score: 5, Informative

    That "hell of a lot of infrastructure" just takes CFEngine/Puppet, a version control system (git, svn, whatever), Nagios, and a fairly simple shell script.

    Haha. Hahaha. HAHAHAHAHAHA. Oh God, please tell me you don't actually believe that?

    You need reliable monitoring.
    Reliable monitoring is fucking difficult.
    Show me a Nagios installation and I'll likely show you one with hundreds of spurious alerts, masses of long-lived Criticals and lots of "Oh we don't know why it keeps doing that, it just does, don't worry about it."

    You also need full coverage (Damn near 100%) configuration management.
    Full coverage configuration management is fucking difficult.
    Show me a configuration management deployment and I'll show the snowflakes and edge cases and old applications and "Oh yeah well we only have like three of those so it's not worth the effort".

    I've come close to that level of coverage (both configuration management and monitoring) but it was only ~400 machines (a mix of physical and virtual instances). Doing it at 60k servers is an inordinate task, and I'd suggest you've never actually tried anything like it if you honestly think that all it takes is "a fairly simple shell script".

  9. Re:Well congratulations by Anonymous Coward · · Score: 5, Funny

    Yeah that totally must be it. Me, the guys who write configuration management tools who'll tell you how hard it is (and sell you consultancy to try to make it slightly less hard) and the guys who write monitoring tools who'll tell you how hard it is (and sell you consultancy to try to make it slightly less hard). All those guys from companies like Facebook and Google who give talks at conferences about how difficult it is. We all suck at it and don't know what we're talking about. If only we'd listened to Slashdot, all our troubles would be but a dream.