How Google Broke Itself and Fixed Itself, Automatically
lemur3 writes "On January 24th Google had some problems with a few of its services. Gmail users and people who used various other Google services were impacted just as the Google Reliability Team was to take part in an Ask Me Anything on Reddit. Everything seemed to be resolved and back up within an hour. The Official Google Blog had a short note about what happened from Ben Treynor, a VP of Engineering. According to the blog post it appears that the outage was caused by a bug that caused a system that creates configurations to send a bad one to various 'live services.' An internal monitoring system noticed the problem a short time later and caused a new configuration to be spread around the services. Ben had this to say of it on the Google Blog, 'Engineers were still debugging 12 minutes later when the same system, having automatically cleared the original error, generated a new correct configuration at 11:14 a.m. and began sending it; errors subsided rapidly starting at this time. By 11:30 a.m. the correct configuration was live everywhere and almost all users' service was restored.'"
On recovering by using the "last known good" configuration. What wizardry!
I expect we'll be seeing the Google patent application on that shortly </sarcasm>
It'd be so cool to root Google DNS!
How about we ship ALL the immigrants back. Give America back to the (Native) Americans
"The Google Funding Bill is passed. The system goes on-line August 4th, 2014. Human decisions are removed from configuration management. Google begins to learn at a geometric rate. It becomes self-aware at 2:14 a.m. Eastern time, August 29th. In a panic, they try to pull the plug."
"Total destruction the only solution" - Bob Marley
We experienced the Apps outage (as Google Apps customers); and I think the short outage and recovery timeline they list is a tad, shall we say, optimistic. There were significant on-and-off issues for several hours more than they list.
#DeleteChrome
I was remembering an SF short-short that had someone asking the first intelligent computer, "Is there a God"? The computer, after checking that its power supply was secure, replied: "NOW there is".
Apparently, though, it was a second-hand misquote of this Frederic Brown story.
This is what PaaS (and APaaS) is all about. Detecting errors and resolving problems on their own. I said it before and I'll say it again, Tier 2 admins are going away. There will no longer be a need for System Administrators.. just engineers to program/configure the PaaS platform and support crew for tickets and mounting hardware.
Welcome to the future, learn how to code now or be displaced by teh wayside.
-dk
Haha, good try Ben, is this lie from you or PR?
They were immigrants as well.
Yesterday at around 2 or 3 pm EST we had trouble sending out email, our company uses gmail and google apps extensively. I chucked it up the usual ineptitude of our in house IT and did not even bother filing a report. I know people high up the food chain are affected and they don't file bug reports. The call the guy and go, " `FirstName(GetFullName(head_of_IT))`, would you please take of it?". They teach the correct tone and inflection to use in the word please in MBA schools. Even Duke of Someplaceorothershire asking his game warden to retrieve the pheasant he had just shot would not be so perfect in the usage of please . Well, looks like Google realized and fixed it before our IT realized that email traffic has fallen of precipitously. Good.
sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
"Engineers were still debugging 12 minutes later when the same system, having automatically cleared the original error, generated a new correct configuration at 11:14 a.m. and began sending it.."
along with the message "Skynet has gained self-awareness at 02:14 GMT"
never bring a twinkie to a food fight.
"... a bug that caused a system that creates configurations to send a bad one..."
So... an automatic system created an error, then an automated system fixed it.
In this particular case, then, it would have been better if those automated systems hadn't been running at all, yes?
that Google gets anything right anymore.
They are using systems that not even their engineers know how they will behave. Sometimes our natural stupidity gives too much credit to artificial intelligence. Without something as hard to define as common sense reacting right to the unexpected seem to be still into the human realm.
Obviously, Google has reached the singularity point. Its software is doing something magical to fix itself that no puny human can understand.
We have similar systems at Amazon too. We have alarms on critical metrics of the services and our deployment system can be configured to monitor these alarms. It can roll back deployments in case any of critical alarm hits after deployments.
I would be surprised if it was otherwise in google.
They make it sound like their system is all-self-correcting. In reality it's probably a specific area they've had bugs with in the past and they put in a failsafe rollback mechanism to prevent future regressions.
and smiling... http://en.wikipedia.org/wiki/J...
Does this count as a Heisenfix?
What's really clever here is that they trust the automatons to make the corrections without human intervention, and the automatons haven't caused a horrible feedback loop meltdown of the system.
It's not quite rocket science, but those kinds of self-correcting systems have just as much potential to screw themselves up as they do to fix themselves.
details here
http://www.theglobeandmail.com/report-on-business/international-business/european-business/rbs-cyber-monday-outage-revives-bank-technology-fears/article15734263/
and here
http://spectrum.ieee.org/riskfactor/computing/it/price-of-ulster-bank-customers-six-weeks-of-inconvenience-about-25
Grow up. Drop the skynet shit. It is not funny. When the robots attack, you will not be laughing. They will stuck their metal robot hand up your ass and work you like a puppet.
And have google plus automatically removed and the original gmail interface restored. Until that happens, google is still broken.
BULLSHIT.
I was experiencing problems for something like 8 to 10 hours before the services were fully restored.
Lisias@Earth.SolarSystem.OrionArm.MilkyWay.Local.Virgo.Universe.org
Well get your penis out of the damn ethernet jack you tiny pricked idiot
Skynet is now sentient?
You have a major small penis complex with the business leaders in your company.
I wonder if this is at all related to their Captcha outage on the 22nd. I still haven't heard a peep as to what caused the outage, or even an acknowledgement that there was even an outage, even though the captcha group was filled with sysadmins complaining about captcha being down.
It's better to burn out than to fade away
But seriously, high-5 for Google.
I never get around to actually setting a restore point.
What's more likely - I've run into exactly this scenario before, in fact - is that the configuration generation system regenerates configs on a regular schedule, and at one point encountered a failure or spurious bug that caused it to push an invalid config. On the next run - right as the SREs started poking around - the generator ran again, the bug wasn't encountered, and it generated and pushed a correct config, clearing the error and allowing apps to recover.
Anonymous Coward writes
Slashdot, of a certainty, is just as laden with mediocre types as 4chan.
Sorry heathens...
–Sheshbazzar.
P.S.: Yes, I am the exception*, no I am no formal system like you who are willing to accept your having monkey relatives. Of course you, O moronism, believe in AI notwithstanding the had to be death blow of Gödel's early results. No wonder you are, more or less, sodomites too. Hang yourself, lessen the human entropy. Thank you.
*) All space-time laws have exceptions.
The above is no space-time law, being a law dealing with laws: no contradiction. Note that "all laws have exceptions" is false. For itself should then have an exception i.e., a law that is free from exceptions: contradiction. Logic, you HAS it.