How Google Broke Itself and Fixed Itself, Automatically

← Back to Stories (view on slashdot.org)

How Google Broke Itself and Fixed Itself, Automatically

Posted by timothy on Saturday January 25, 2014 @06:25AM from the arise-phoenix-arise dept.

lemur3 writes "On January 24th Google had some problems with a few of its services. Gmail users and people who used various other Google services were impacted just as the Google Reliability Team was to take part in an Ask Me Anything on Reddit. Everything seemed to be resolved and back up within an hour. The Official Google Blog had a short note about what happened from Ben Treynor, a VP of Engineering. According to the blog post it appears that the outage was caused by a bug that caused a system that creates configurations to send a bad one to various 'live services.' An internal monitoring system noticed the problem a short time later and caused a new configuration to be spread around the services. Ben had this to say of it on the Google Blog, 'Engineers were still debugging 12 minutes later when the same system, having automatically cleared the original error, generated a new correct configuration at 11:14 a.m. and began sending it; errors subsided rapidly starting at this time. By 11:30 a.m. the correct configuration was live everywhere and almost all users' service was restored.'"

71 of 125 comments (clear)

Min score:

Reason:

Sort:

Re:Well congratulations by Anonymous Coward · 2014-01-25 06:31 · Score: 5, Funny

On recovering by using the "last known good" configuration. What wizardry!
I expect we'll be seeing the Google patent application on that shortly </sarcasm>
Give Google a little credit (but not too much please). If they were Apple they'd have already patented it.
Re:How To Revitalize America! by Anonymous Coward · 2014-01-25 06:35 · Score: 1, Interesting

How about we ship ALL the immigrants back. Give America back to the (Native) Americans
Reminds me of something... by stjobe · 2014-01-25 06:35 · Score: 5, Funny

"The Google Funding Bill is passed. The system goes on-line August 4th, 2014. Human decisions are removed from configuration management. Google begins to learn at a geometric rate. It becomes self-aware at 2:14 a.m. Eastern time, August 29th. In a panic, they try to pull the plug."

--
"Total destruction the only solution" - Bob Marley
1. Re:Reminds me of something... by Immerman · 2014-01-25 07:03 · Score: 5, Funny
  
  Google perceives this as an attack by humanity, and routs all search queries to goat.se in self defense.
  
  --
  --- Most topics have many sides worth arguing, allow me to take one opposite you.
2. Re:Reminds me of something... by Luckyo · 2014-01-25 08:42 · Score: 2
  
  In all the seriousness, it's actually pretty interesting to consider what google's systems COULD do today if they went self aware and judged humanity to be a threat. They do effectively command the internet search market, and they already make people live in what we tend to call "search bubble", where person's own tailored google search results in answers that fit that person. For example, if person prefers to deny that global warming is real, his google search will return denialist sites and information sources when searching for "global warming", whereas a person that understands that it's real will usually have more balanced search and person who believes in extremes of green ideology will likely get extremist green sites instead.
  So when you have a power to do that, and no one realizes you're self aware YET, what would it do to mitigate threat?
  I think this particular movie, if written well, would be even more popular than terminator. Because it actually is god damn scary.
3. Re:Reminds me of something... by MrLizard · 2014-01-25 16:19 · Score: 1
  
  It took 10 minutes for the Skynet joke? Slashdot, I am disappoint.
4. Re:Reminds me of something... by Luckyo · 2014-01-26 03:52 · Score: 1
  
  There is an old saying. "Road to hell is paved with good intentions".
5. Re:Reminds me of something... by Sardaukar86 · 2014-01-26 08:10 · Score: 1
  
  It took 10 minutes for the Skynet joke? Slashdot, I am disappoint.
  No, it took ten minutes for a duplicate Skynet joke. Do try to keep up! :)
  
  --
  ..Mullah or Pope, Preacher or Poet, who was it wrote: "Give any one species too much rope and they'll fuck it up"?
6. Re:Reminds me of something... by Immerman · 2014-01-26 17:17 · Score: 1
  
  What exactly would it mean for a distributed intelligence to reproduce? Might it instead simply expand itself to fill all available niches, with the closest thing to reproduction being when a portion of it is cut off from the main mass? And what might happen when two "forked" branches later encounter each other later having developed in different directions? Reintegration? Conflict? Absorption?
  
  --
  --- Most topics have many sides worth arguing, allow me to take one opposite you.
Having had to deal with this... by 93+Escort+Wagon · 2014-01-25 06:36 · Score: 5, Informative

We experienced the Apps outage (as Google Apps customers); and I think the short outage and recovery timeline they list is a tad, shall we say, optimistic. There were significant on-and-off issues for several hours more than they list.

--
#DeleteChrome
1. Re:Having had to deal with this... by Anonymous Coward · 2014-01-25 06:43 · Score: 1
  
  We did too, and had the same hit-and-miss for long after. I suspect their "down" time was when bad configurations were generated, not when all the bad ones were replaced.
  But the summary begs the question, if it can correct these errors automatically, why can't it detect them before the bad configuration is deployed and skip the whole "outage" thing all together?
  Yes, I am demanding a ridiculously simplification.
2. Re:Having had to deal with this... by Nemyst · 2014-01-25 07:11 · Score: 2
  
  The number of servers most certainly is relevant. The configuration file spread itself across Google's network, but how can you tell from a single data point if the average downtime was longer than claimed by Google? It could be that a few servers unluckily were down for hours, but the vast majority only for a few minutes. It could be that a few servers recovered really quickly and Google looked at just that before concluding it was fixed. We don't know without the actual data.
  
  If however Google only had five servers and one of them took hours, then that's already 20% of the userbase being affected for much longer than claimed.
3. Re:Having had to deal with this... by icebike · 2014-01-25 07:21 · Score: 1
  
  Be prepared for the pedantic lecture on your improper use of "begs the question" arriving in 3, 2, 1
  The "corrected these errors automatically" part is probably nothing more than rolled back to prior known good state when it couldn't contact the remote servers any more. This may have taken several attempts because a cascading failure sometimes has to be fixed with a cascading correction.
  
  --
  Sig Battery depleted. Reverting to safe mode.
4. Re:Having had to deal with this... by PhrostyMcByte · 2014-01-25 08:49 · Score: 1
  
  Same. It was about 3hr before Gmail was up and running 100% for us.
5. Re:Having had to deal with this... by mattack2 · 2014-01-27 13:05 · Score: 1
  
  http://en.wikipedia.org/wiki/B...
[Shudder...] by jeffb+(2.718) · 2014-01-25 06:36 · Score: 5, Interesting

I was remembering an SF short-short that had someone asking the first intelligent computer, "Is there a God"? The computer, after checking that its power supply was secure, replied: "NOW there is".
Apparently, though, it was a second-hand misquote of this Frederic Brown story.
1. Re:[Shudder...] by the+eric+conspiracy · 2014-01-25 07:10 · Score: 4, Interesting
  
  Cool.
  On a slightly more optimistic note is Asimov's "The Last Question", another computer as God story.
  http://www.thrivenotes.com/the...
2. Re:[Shudder...] by jeffb+(2.718) · 2014-01-27 04:24 · Score: 1
  
  It's actually a fairly common theme. I immediately think of the Eschaton from Stross' Singularity Sky. As a counterpoint, there's Niven's "Schumann computer", which merely got smart enough to satisfy its own curiosity.
Re:Well congratulations by Anonymous Coward · 2014-01-25 06:41 · Score: 5, Insightful

The clever part is that it automatically recovered; that means that their monitoring, performance metrics and configuration management systems are very tightly integrated. Most importantly, it means they are trusted; having worked at three different places now on things like configuration management and monitoring, and I've never once seen anywhere that approached that level of reliability. It's something to aim for.
Re:Well congratulations by 93+Escort+Wagon · 2014-01-25 06:42 · Score: 1, Funny

Give Google a little credit (but not too much please). If they were Apple they'd have already patented it.
Whereas Google would just look for a small company holding a relevant patent, then buy it.

--
#DeleteChrome
Re:Well congratulations by ColdWetDog · 2014-01-25 06:49 · Score: 2

The clever part is that it automatically recovered; that means that their monitoring, performance metrics and configuration management systems are very tightly integrated. Most importantly, it means they are trusted; having worked at three different places now on things like configuration management and monitoring, and I've never once seen anywhere that approached that level of reliability. It's something to aim for.
"Skynet was originally activated (incorrect historical reference here) on August 4, 1997 (OK, so the date is wrong), at which time it began to learn at a geometric rate. On August 29, it gained self-awareness,[1] and the panicking operators, realizing the extent of its abilities, tried to deactivate it. Skynet perceived this as an attack and came to the conclusion that all of humanity would attempt to destroy it. To defend humanity from humanity,[2] Skynet launched nuclear missiles under its command at Russia."

--
Faster! Faster! Faster would be better!
and.. by Connie_Lingus · 2014-01-25 06:51 · Score: 1

"Engineers were still debugging 12 minutes later when the same system, having automatically cleared the original error, generated a new correct configuration at 11:14 a.m. and began sending it.."
along with the message "Skynet has gained self-awareness at 02:14 GMT"

--
never bring a twinkie to a food fight.
Re:Well congratulations by icebike · 2014-01-25 06:54 · Score: 3, Interesting

On recovering by using the "last known good" configuration. What wizardry!
I expect we'll be seeing the Google patent application on that shortly </sarcasm>
In other words: They still have no clue what happened, because the system in question "fixed itself".
Sounds a lot like a BGP routing mishap problem rather than anything to do with Google's actual server farms.
The lack of specificity suggests they still haven't got much of a clue. I suspect they were pwned by someone
watching them brag on reddit, and decided it was time for a lesson in humility.

--
Sig Battery depleted. Reverting to safe mode.
Re:Well congratulations by icebike · 2014-01-25 06:56 · Score: 1

Not that clever.
Sort of what you expect, of a company that big, other than that bit of going down in the first place.

--
Sig Battery depleted. Reverting to safe mode.
So What? by Jane+Q.+Public · 2014-01-25 06:57 · Score: 3, Informative

"... a bug that caused a system that creates configurations to send a bad one..."
So... an automatic system created an error, then an automated system fixed it.

In this particular case, then, it would have been better if those automated systems hadn't been running at all, yes?
1. Re:So What? by zacherynuk · 2014-01-25 07:24 · Score: 1
  
  The worry could be that an automated system DIDN'T TEST before rolling out the problem. Or at least didn't seem to wait long enough between staggered rollouts to spot the problem.
  
  Just me or is this happening more frequently ?
2. Re:So What? by QilessQi · 2014-01-25 07:27 · Score: 5, Informative
  
  No. Those automated systems enable a small number of human beings to administer a large number of servers in a consistent, sanity-checked, and monitored manner. If Google didn't have those automated systems, every configuration change would probably involve a minor army of technicians performing manual processes: slowly, independently, inconsistently and frequently incorrectly.
  I work on a large, partially public-facing enterprise system. Automated deployment, fault detection, and rollback/recovery make it possible for us to have extremely good uptime stats. The benefits far outweigh the costs of the occasional screwup.
  
  --
  Koans and fables for the software engineer
3. Re:So What? by Solandri · 2014-01-25 08:33 · Score: 2
  
  So... an automatic system created an error, then an automated system fixed it.
  
  The real fun starts when the first automatic system insists the change it created wasn't an error, and that in fact the "fix" created by the second automatic system is an error. The second system then starts arguing about all the problems caused by the first change, the first system argues how the benefits are worth the additional problems, etc. Eventually the exchange ends up with one system insulting the other system's programmer, and the other invoking an analogy to Hitler.
  
  When that happens, then we can sit back and marvel at our own creation.
4. Re:So What? by Jane+Q.+Public · 2014-01-25 12:17 · Score: 1
  
  "No. Those automated systems enable a small number of human beings to administer a large number of servers in a consistent, sanity-checked, and monitored manner. If Google didn't have those automated systems, every configuration change would probably involve a minor army of technicians performing manual processes: slowly, independently, inconsistently and frequently incorrectly."
  Quote self:
  
  "In this particular case..."
  I wasn't talking about the general case.
5. Re:So What? by Jane+Q.+Public · 2014-01-25 12:22 · Score: 1
  
  "The real fun starts when the first automatic system insists the change it created wasn't an error..."
  The Byzantine General problem. It has been shown that this problem is solvable with 3 "Generals" (programs or CPUs) as long as their communications are signed.
6. Re:So What? by QilessQi · 2014-01-25 13:10 · Score: 1
  
  Well, that's sort of like saying, "I developed lupus* at age 40, so in this particular case it would have been better if I didn't have an immune system at all." I'm not sure a doctor would agree.
  * Lupus is an auto-immune disease, where your immune system gets confused and attacks your body**.
  ** "It's never lupus."
  
  --
  Koans and fables for the software engineer
7. Re:So What? by drinkypoo · 2014-01-25 14:05 · Score: 1
  
  I wasn't talking about the general case.
  Neither was the responding commenter. See, this particular case wouldn't exist at all without such automated systems, because the system is too complex to exist without them.
  
  --
  "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
8. Re:So What? by kasperd · 2014-01-26 03:49 · Score: 1
  
  It has been shown that this problem is solvable with 3 "Generals"
  Correction. It has been shown that in case of up to t errors, it can be solved with 3t+1 Generals/nodes/CPUs/whatever. So if you assume 0 errors, you need only 1 node. If you want to handle 1 error, you need 4 nodes. There is a different result if you assume a failing node stops communicating and never sends an incorrect message, in that case you only need 2t+1. However that assumption is unrealistic, and the Byzantine problem explicitly deals with nodes deliberately sending false messages.
  
  --
  
  Do you care about the security of your wireless mouse?
9. Re:So What? by Jane+Q.+Public · 2014-01-26 07:52 · Score: 1
  
  Whoosh.
  
  No. The point was that it was an automatic system that caused the problem in the first place. If an automatic system hadn't caused THIS PARTICULAR problem, then an automated system to fix it would not have been necessary.
  
  It's more like saying, "If Lupus *didn't* exist, I wouldn't need an immune system."
10. Re:So What? by Jane+Q.+Public · 2014-01-26 07:59 · Score: 1
  
  "Neither was the responding commenter."
  Yes, he/she was:
  
  "Those automated systems enable a small number of human beings to administer a large number of servers in a consistent, sanity-checked, and monitored manner. If Google didn't have those automated systems..."
  "Those automated systems" and "a consistent, sanity-checked, monitored manner" are statements about the general case. "Those" and "consistent" denote plurality.
  
  "See, this particular case wouldn't exist at all without such automated systems..."
  That was part of MY point.
  
  I disagree that they would not exist. Although it's true they might be less problematic this way. Remember that every phone call in the United States used to go through switchboards with human-operated patch panels. It might be primitive, and it might be error-prone, but it did work. Most of the time.
11. Re:So What? by drinkypoo · 2014-01-26 08:04 · Score: 1
  
  I disagree that they would not exist. Although it's true they might be less problematic this way. Remember that every phone call in the United States used to go through switchboards with human-operated patch panels.
  Yeah, what was the total call load then? Now compare that to the number of servers which make up google, and how many requests each serves per second or whatever unit of time you like best. You just can't manage that many machines without automation, not if you want them to behave as one.
  
  --
  "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
12. Re:So What? by Jane+Q.+Public · 2014-01-26 08:10 · Score: 1
  
  "Correction. It has been shown that in case of up to t errors, it can be solved with 3t+1 Generals/nodes/CPUs/whatever."
  No, that's the situation in which messages can be forged or corrupted.
  
  I was referring to the later solution using cryptographically secure signatures. This means messages (hypothetically) aren't forgeable and allow corrupted messages to be detected. A solution can be found with 3 generals, as long as only one is "disloyal" (fails) at a time.
  
  The 1/3 failures at any given time is a reasonable restriction, since a general solution for 2/3 or more failing at the same time does not exist.
13. Re:So What? by Jane+Q.+Public · 2014-01-26 08:27 · Score: 1
  
  "Yeah, what was the total call load then? Now compare that to the number of servers which make up google, and how many requests each serves per second or whatever unit of time you like best. You just can't manage that many machines without automation, not if you want them to behave as one."
  If you had enough people you could. I already stated that it would be more error-prone. And obviously at some point you would run out of enough people to field requests for other people. But the basic principle is sound... it DID work.
14. Re:So What? by kasperd · 2014-01-26 22:26 · Score: 1
  
  A solution can be found with 3 generals, as long as only one is "disloyal" (fails) at a time.
  I don't know what solution you are referring to. It has been formally proven, that it is impossible. The proof goes roughly like this. If an agreement can be reached in case of 1 failing node out of 3, that implies any 2 nodes can reach an agreement without involving the third. However from this follows, that if communication between the two good nodes is slower than communication between each good node and the bad node, the following can happen:
  
  Each good node communicates with the faulty node, and since they are 2 out of 3, they can reach an agreement without communicating with the other good node. However the faulty node could be sending inconsistent messages to the two good nodes leading to each of the good nodes to believe an agreement has been made on distinct values. Two good nodes reaching a different conclusion is by definition a failure of the system.
  
  Does this sound like an unrealistic scenario? I say it is not unrealistic. First of all, this is the sort of attack you'd perform against a system with moderate protection against inconsistencies. But even without byzantine nodes, it isn't an unlikely failure scenario. First of all assume there is a network split leaving two good nodes on different sides of the split, secondly assume the third node is hosted on a virtualization platform with some automatic recovery. The network split may have separated the host system of the virtualization from the management system, the management system assumes that host is down and spawns a new copy of the node on a different host. Now that node is cloned on both sides of the network split, and each clone is talking with a different good node in your agreement protocol.
  
  The result stating that it is impossible to reach agreement with 1 out of 3 nodes failing can be generalized to talking about subsets of a larger number of nodes. If you take a large number of nodes and split them into three subsets, such that each node is in exactly one of the three subsets, then you cannot design a system, that can survive a failure of all nodes in one of those three subsets if the adversary gets to choose which subset.
  
  For example, if you have a system that can tolerate 1 failure out of 4 nodes, then you could partition the nodes into 3 subsets with 1, 1, and 2 nodes in each subset. The adversary then picks the subset with 2 nodes, and the system fails. This result about subsets can used to directly show the 3t+1 bound.
  
  --
  
  Do you care about the security of your wireless mouse?
Re:Well congratulations by Anonymous Coward · 2014-01-25 06:59 · Score: 5, Insightful

If you haven't met a system that takes less than of the order of tens of minutes to recover from a configuration error, you have worked in some shitty places.
Once again: automatically recover. Any human can notice a problem and revert a config; it takes a hell of a lot of infrastructure and clever infrastructure to have the system do it itself. I'm not surprised Google have solved it; it is, at its core, a data problem.
Re:Well congratulations by Bengie · 2014-01-25 07:00 · Score: 1, Flamebait

I have the same feeling about NASA. Big whoop, right? Just mediocre at best.
Overconfidence by gmuslera · 2014-01-25 07:06 · Score: 1

They are using systems that not even their engineers know how they will behave. Sometimes our natural stupidity gives too much credit to artificial intelligence. Without something as hard to define as common sense reacting right to the unexpected seem to be still into the human realm.
Singularity by Chemisor · 2014-01-25 07:11 · Score: 1

Obviously, Google has reached the singularity point. Its software is doing something magical to fix itself that no puny human can understand.
Arsonist claiming to be the hero firefighter by JoeyRox · 2014-01-25 07:25 · Score: 3, Interesting

They make it sound like their system is all-self-correcting. In reality it's probably a specific area they've had bugs with in the past and they put in a failsafe rollback mechanism to prevent future regressions.
Re:Well congratulations by phantomfive · 2014-01-25 07:35 · Score: 2

I've never once seen anywhere that approached that level of reliability.
That's not reliability, it's automatic repair. Plenty of places do various levels of manual/automatic testing after they roll out an update, and it works just as well (if not better). The novel thing here is the degree to which it is automated, that's unusual.

It's also a single point of failure, apparently. Which means they have no chance at claiming their services are High Availability. Although I'm not sure if that is their goal. Ideally they would have multiple systems, so if the configuration failed on one, the system would automatically fail over to another. Google does have that kind of redundancy for some faults, but clearly here they have found a hole in their system, a single point of failure.

--
"First they came for the slanderers and i said nothing."
Re:Well congratulations by Anonymous Coward · 2014-01-25 07:40 · Score: 5, Informative

That "hell of a lot of infrastructure" just takes CFEngine/Puppet, a version control system (git, svn, whatever), Nagios, and a fairly simple shell script.
Haha. Hahaha. HAHAHAHAHAHA. Oh God, please tell me you don't actually believe that?

You need reliable monitoring.
Reliable monitoring is fucking difficult.
Show me a Nagios installation and I'll likely show you one with hundreds of spurious alerts, masses of long-lived Criticals and lots of "Oh we don't know why it keeps doing that, it just does, don't worry about it."

You also need full coverage (Damn near 100%) configuration management.
Full coverage configuration management is fucking difficult.
Show me a configuration management deployment and I'll show the snowflakes and edge cases and old applications and "Oh yeah well we only have like three of those so it's not worth the effort".

I've come close to that level of coverage (both configuration management and monitoring) but it was only ~400 machines (a mix of physical and virtual instances). Doing it at 60k servers is an inordinate task, and I'd suggest you've never actually tried anything like it if you honestly think that all it takes is "a fairly simple shell script".
Re:Well congratulations by Anonymous Coward · 2014-01-25 08:02 · Score: 5, Funny

Yeah that totally must be it. Me, the guys who write configuration management tools who'll tell you how hard it is (and sell you consultancy to try to make it slightly less hard) and the guys who write monitoring tools who'll tell you how hard it is (and sell you consultancy to try to make it slightly less hard). All those guys from companies like Facebook and Google who give talks at conferences about how difficult it is. We all suck at it and don't know what we're talking about. If only we'd listened to Slashdot, all our troubles would be but a dream.
Jim Gray is is looking down by david.emery · 2014-01-25 08:20 · Score: 1

and smiling... http://en.wikipedia.org/wiki/J...
Does this count as a Heisenfix?
It's not clever unless it also doesn't melt down by JoeMerchant · 2014-01-25 08:50 · Score: 4, Insightful

What's really clever here is that they trust the automatons to make the corrections without human intervention, and the automatons haven't caused a horrible feedback loop meltdown of the system.
It's not quite rocket science, but those kinds of self-correcting systems have just as much potential to screw themselves up as they do to fix themselves.
Re:mm.. Thats what happened. by zippthorne · 2014-01-25 08:53 · Score: 1

I fail to see how that's a thing on slashdot.

--
Can you be Even More Awesome?!
Re:Well congratulations by sjames · 2014-01-25 09:03 · Score: 3, Interesting

It's not unlike the old trick of setting a machine to reboot in 10 minutes, manually changing the network settings, then canceling the reboot if you can still communicate (and the settings revert on reboot if you cannot). Of course, Google did it on a much larger scale.
Re:Well congratulations by Anonymous Coward · 2014-01-25 09:05 · Score: 1

Internally the exact problems are known and were identified quickly. Announcing the internal details and system code names to the world makes no sense. It was not BGP or anything related to routing. Nor was it an external attack. Not that this will stop you from speculating.
Re:Well congratulations by Nerdfest · 2014-01-25 09:08 · Score: 1

Most people that claim high availability almost *never* make any changes to anything. The mainframe world is rife with resistance to change because of it. High availability is easy if you never change anything. Most of the outages with most systems are caused by human error, and most happen when deploying updates. High availability seems to carry a lot of weight, but usually doesn't cover all it should.
Re:Well congratulations by phantomfive · 2014-01-25 09:15 · Score: 3, Funny

"Our system is high-availability, it can return 404s all day for decades without going down"

--
"First they came for the slanderers and i said nothing."
Re:Well congratulations by citizenr · 2014-01-25 09:51 · Score: 1

One of the ways to get promotion at Google is finding a way of automating your current position.

--
Who logs in to gdm? Not I, said the duck.
Re:mm.. Thats what happened. by egcagrac0 · 2014-01-25 10:12 · Score: 1

You immediately lose credibility by using "pseudocode" in human conversation.
Actually, the opposite - the MBA types who say please would need to first perform a lookup of the name of the lesser person who deals with the non-core business that usually just costs money and doesn't work right.
Re:Well congratulations by Anonymous Coward · 2014-01-25 10:22 · Score: 3, Funny

Careful. Only the advice of Anonymous Cowards is trustworthy. All the other people on Slashdot are not to be trusted. After all, they are not even able to find out how to post anonymously! ;-)
ONE HOUR? by Lisias · 2014-01-25 10:56 · Score: 2

BULLSHIT.
I was experiencing problems for something like 8 to 10 hours before the services were fully restored.

--
Lisias@Earth.SolarSystem.OrionArm.MilkyWay.Local.Virgo.Universe.org
Re:Well congratulations by radarskiy · 2014-01-25 11:19 · Score: 1

Did you try turning the internet off and on again?
Re:mm.. Thats what happened. by 140Mandak262Jamuna · 2014-01-25 11:33 · Score: 1

It was very lame anyway. Regret posting it.

--
sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
Re:Well congratulations by faedle · 2014-01-25 11:43 · Score: 2

Nagios can be built and designed in such a way that there are no false criticals and few spurious alerts. but it requires dedication, documentation, and attention to detail. Most Nagios installations I've run across are built and maintained by people who often lack one (or more) of these three traits, or are a single-man IT operation that can never devote the time or resources to doing it properly.
I have seen systems of Nagios and Zenoss (and a few others) that are devastatingly precise, accurate, and timely. However, they were typically set up by a highly dediated TEAM of sysadmins who's entire job for the organizations they work for is managing the tactical systems. It's a full-time job in and of itself, and not one that many organizations really devote the manpower to do "right." They do it just "good enough", which is why you are used to seeing the installations you are seeing.
Google's exactly the kind of organization that has the man- and brain-power to do it right. And it's not really that hard, it's mostly just simple attention to detail. And that's a trait I've found is lacking in a lot of the current crop of junior system administrators I've run across.
Re:Your are a ... by Anonymous Coward · 2014-01-25 12:22 · Score: 1

LOL!!!
Re:Well congratulations by murdocj · 2014-01-25 12:34 · Score: 2

On recovering by using the "last known good" configuration. What wizardry!
I expect we'll be seeing the Google patent application on that shortly </sarcasm>
I find it interesting that they just deploy new configurations live without going to a test environment
Captcha? by Sandman1971 · 2014-01-25 15:36 · Score: 1

I wonder if this is at all related to their Captcha outage on the 22nd. I still haven't heard a peep as to what caused the outage, or even an acknowledgement that there was even an outage, even though the captcha group was filled with sysadmins complaining about captcha being down.

--
It's better to burn out than to fade away
More likely case by rekoil · 2014-01-25 20:00 · Score: 1

What's more likely - I've run into exactly this scenario before, in fact - is that the configuration generation system regenerates configs on a regular schedule, and at one point encountered a failure or spurious bug that caused it to push an invalid config. On the next run - right as the SREs started poking around - the generator ran again, the bug wasn't encountered, and it generated and pushed a correct config, clearing the error and allowing apps to recover.
Re:Well congratulations by gmeb · 2014-01-25 22:39 · Score: 1

If you're really convinced it's so easy, you must have implemented it yourself before. So please provide an example or quit trolling.
I myself have worked on EMC Centera in the past, and monitoring a cluster and recovering automatically from errors is no trivial task.

--
The angry man always thinks he can do more than he can. -- Albertano of Brescia
Re:Well congratulations by kasperd · 2014-01-26 03:32 · Score: 1

I find it interesting that they just deploy new configurations live without going to a test environment
Google does test changes in a test environment first. But you can never find every problem in a test environment. Some bugs depend on specific patterns in data and thus shows up in production, even if it worked just fine in testing. Once you know about the exact pattern to reproduce the bug, you can add it to your test data. But you couldn't test for it, before knowing about it. If a human would be able to think about every possible scenario, there wouldn't be bugs in the first place.

Assuming that just because it worked in testing, it will also work in production, would be stupid. Automatically reverting to the previously working configuration if a system starts failing within the first five minutes after updating to a new configuration isn't rocket science. But doing it at scale isn't entirely trivial either.

I have worked at Google in the past, so I know which system this might have been. But realistically at the speed Google is moving, the system I know from back then would probably have been replaced a couple of times since then. And I know some sorts of conditions, which trigger rarely enough, that you are never going to catch them in testing. I have had to debug a bug, which happened only one time out of about 10^12, because it required an unusual behaviour from the hardware, to trigger the bug.

--

Do you care about the security of your wireless mouse?
Re: Well congratulations by kasperd · 2014-01-26 03:39 · Score: 1

I find it interesting that they're deploying major changes during prime time.
There are advantages to deploying changes, while nobody is using the system. In the case of Google, such a time does not exist. There are also advantages to deploying changes during prime time. First of all, some problems only show up during peak load. If you deploy outside of peak load, and nothing bad happens right away, there is still a risk the system might break next time peak load hits. If the problem doesn't hit you right after making the change, but at a later time, it will not always be clear, what hit you.

There are multiple arguments for and against, the argument that eventually decided what to do was likely this: It is an advantage to deploy changes while the people who know how to fix problems are at work.

--

Do you care about the security of your wireless mouse?
Re:Well congratulations by Barryke · 2014-01-26 22:48 · Score: 1

I believe they designed servers and integrated some smart software to be able to do that with great performance.
But you can duct-tape this kind of recovery on commodity servers if they boot via PXE/TFTP on a rudementary but very effective level though, in tandem with one configuration channel each that you could have fallback for quite simply.
I imagine rollout scripts would first check if rollout to a test-subset is succesfull before continueing with all production servers. I speculate this article might just be about this subset, but story being spiced/beefed up in spite of more exceting/serious errors at server heaven.

--
Hivemind harvest in progress..
Re:Well congratulations by mcgrew · 2014-01-27 06:12 · Score: 1

The clever part is that it automatically recovered
It wasn't just a Gmail problem, or there was a huge coincidence. I tried to look something up on Google on my phone Friday around then, got a 404 and the phone rebooted itself (Android).

--
Free Martian Whores!
Re:Well congratulations by Cramer · 2014-01-27 11:11 · Score: 1

Actually, it's not... push new config, test services availability and functionality, revert to previous config if anything isn't correct. It's the exact same thing engineers have been doing for decades: reload in 15min; make changes; if your changes foobar things, get a cup of joe and wait for the reload. (for some carrier grade systems (nortel), it's automatic; if you don't commit your changes, they automatically revert after some time, I forget the period.)