How Google Broke Itself and Fixed Itself, Automatically
lemur3 writes "On January 24th Google had some problems with a few of its services. Gmail users and people who used various other Google services were impacted just as the Google Reliability Team was to take part in an Ask Me Anything on Reddit. Everything seemed to be resolved and back up within an hour. The Official Google Blog had a short note about what happened from Ben Treynor, a VP of Engineering. According to the blog post it appears that the outage was caused by a bug that caused a system that creates configurations to send a bad one to various 'live services.' An internal monitoring system noticed the problem a short time later and caused a new configuration to be spread around the services. Ben had this to say of it on the Google Blog, 'Engineers were still debugging 12 minutes later when the same system, having automatically cleared the original error, generated a new correct configuration at 11:14 a.m. and began sending it; errors subsided rapidly starting at this time. By 11:30 a.m. the correct configuration was live everywhere and almost all users' service was restored.'"
On recovering by using the "last known good" configuration. What wizardry!
I expect we'll be seeing the Google patent application on that shortly </sarcasm>
Give Google a little credit (but not too much please). If they were Apple they'd have already patented it.
"The Google Funding Bill is passed. The system goes on-line August 4th, 2014. Human decisions are removed from configuration management. Google begins to learn at a geometric rate. It becomes self-aware at 2:14 a.m. Eastern time, August 29th. In a panic, they try to pull the plug."
"Total destruction the only solution" - Bob Marley
We experienced the Apps outage (as Google Apps customers); and I think the short outage and recovery timeline they list is a tad, shall we say, optimistic. There were significant on-and-off issues for several hours more than they list.
#DeleteChrome
I was remembering an SF short-short that had someone asking the first intelligent computer, "Is there a God"? The computer, after checking that its power supply was secure, replied: "NOW there is".
Apparently, though, it was a second-hand misquote of this Frederic Brown story.
The clever part is that it automatically recovered; that means that their monitoring, performance metrics and configuration management systems are very tightly integrated. Most importantly, it means they are trusted; having worked at three different places now on things like configuration management and monitoring, and I've never once seen anywhere that approached that level of reliability. It's something to aim for.
The clever part is that it automatically recovered; that means that their monitoring, performance metrics and configuration management systems are very tightly integrated. Most importantly, it means they are trusted; having worked at three different places now on things like configuration management and monitoring, and I've never once seen anywhere that approached that level of reliability. It's something to aim for.
"Skynet was originally activated (incorrect historical reference here) on August 4, 1997 (OK, so the date is wrong), at which time it began to learn at a geometric rate. On August 29, it gained self-awareness,[1] and the panicking operators, realizing the extent of its abilities, tried to deactivate it. Skynet perceived this as an attack and came to the conclusion that all of humanity would attempt to destroy it. To defend humanity from humanity,[2] Skynet launched nuclear missiles under its command at Russia."
Faster! Faster! Faster would be better!
On recovering by using the "last known good" configuration. What wizardry!
I expect we'll be seeing the Google patent application on that shortly </sarcasm>
In other words: They still have no clue what happened, because the system in question "fixed itself".
Sounds a lot like a BGP routing mishap problem rather than anything to do with Google's actual server farms.
The lack of specificity suggests they still haven't got much of a clue. I suspect they were pwned by someone
watching them brag on reddit, and decided it was time for a lesson in humility.
Sig Battery depleted. Reverting to safe mode.
"... a bug that caused a system that creates configurations to send a bad one..."
So... an automatic system created an error, then an automated system fixed it.
In this particular case, then, it would have been better if those automated systems hadn't been running at all, yes?
Once again: automatically recover. Any human can notice a problem and revert a config; it takes a hell of a lot of infrastructure and clever infrastructure to have the system do it itself. I'm not surprised Google have solved it; it is, at its core, a data problem.
They make it sound like their system is all-self-correcting. In reality it's probably a specific area they've had bugs with in the past and they put in a failsafe rollback mechanism to prevent future regressions.
I've never once seen anywhere that approached that level of reliability.
That's not reliability, it's automatic repair. Plenty of places do various levels of manual/automatic testing after they roll out an update, and it works just as well (if not better). The novel thing here is the degree to which it is automated, that's unusual.
It's also a single point of failure, apparently. Which means they have no chance at claiming their services are High Availability. Although I'm not sure if that is their goal. Ideally they would have multiple systems, so if the configuration failed on one, the system would automatically fail over to another. Google does have that kind of redundancy for some faults, but clearly here they have found a hole in their system, a single point of failure.
"First they came for the slanderers and i said nothing."
Haha. Hahaha. HAHAHAHAHAHA. Oh God, please tell me you don't actually believe that?
You need reliable monitoring.
Reliable monitoring is fucking difficult.
Show me a Nagios installation and I'll likely show you one with hundreds of spurious alerts, masses of long-lived Criticals and lots of "Oh we don't know why it keeps doing that, it just does, don't worry about it."
You also need full coverage (Damn near 100%) configuration management.
Full coverage configuration management is fucking difficult.
Show me a configuration management deployment and I'll show the snowflakes and edge cases and old applications and "Oh yeah well we only have like three of those so it's not worth the effort".
I've come close to that level of coverage (both configuration management and monitoring) but it was only ~400 machines (a mix of physical and virtual instances). Doing it at 60k servers is an inordinate task, and I'd suggest you've never actually tried anything like it if you honestly think that all it takes is "a fairly simple shell script".
Yeah that totally must be it. Me, the guys who write configuration management tools who'll tell you how hard it is (and sell you consultancy to try to make it slightly less hard) and the guys who write monitoring tools who'll tell you how hard it is (and sell you consultancy to try to make it slightly less hard). All those guys from companies like Facebook and Google who give talks at conferences about how difficult it is. We all suck at it and don't know what we're talking about. If only we'd listened to Slashdot, all our troubles would be but a dream.
What's really clever here is that they trust the automatons to make the corrections without human intervention, and the automatons haven't caused a horrible feedback loop meltdown of the system.
It's not quite rocket science, but those kinds of self-correcting systems have just as much potential to screw themselves up as they do to fix themselves.
It's not unlike the old trick of setting a machine to reboot in 10 minutes, manually changing the network settings, then canceling the reboot if you can still communicate (and the settings revert on reboot if you cannot). Of course, Google did it on a much larger scale.
"Our system is high-availability, it can return 404s all day for decades without going down"
"First they came for the slanderers and i said nothing."
Careful. Only the advice of Anonymous Cowards is trustworthy. All the other people on Slashdot are not to be trusted. After all, they are not even able to find out how to post anonymously! ;-)
BULLSHIT.
I was experiencing problems for something like 8 to 10 hours before the services were fully restored.
Lisias@Earth.SolarSystem.OrionArm.MilkyWay.Local.Virgo.Universe.org
Nagios can be built and designed in such a way that there are no false criticals and few spurious alerts. but it requires dedication, documentation, and attention to detail. Most Nagios installations I've run across are built and maintained by people who often lack one (or more) of these three traits, or are a single-man IT operation that can never devote the time or resources to doing it properly.
I have seen systems of Nagios and Zenoss (and a few others) that are devastatingly precise, accurate, and timely. However, they were typically set up by a highly dediated TEAM of sysadmins who's entire job for the organizations they work for is managing the tactical systems. It's a full-time job in and of itself, and not one that many organizations really devote the manpower to do "right." They do it just "good enough", which is why you are used to seeing the installations you are seeing.
Google's exactly the kind of organization that has the man- and brain-power to do it right. And it's not really that hard, it's mostly just simple attention to detail. And that's a trait I've found is lacking in a lot of the current crop of junior system administrators I've run across.
On recovering by using the "last known good" configuration. What wizardry!
I expect we'll be seeing the Google patent application on that shortly </sarcasm>
I find it interesting that they just deploy new configurations live without going to a test environment