How Google Routes Around Outages
1sockchuck writes "Making changes to Google's search infrastructure is akin to 'changing the tires on a car while you're going at 60 down the freeway,' according to Urs Holzle, who oversees the company's massive data center operations. In a Q-and-A with Data Center Knowledge, Holzle discusses Google's infrastructure, how it has engineered its system to route around hardware failures, and how it responds when something goes awry. These updates usually go unnoticed, but during system maintenance last month a software bug triggered an outage for Gmail."
Was it just me or did anyone else spend a few minutes contemplating how you actually could make a car that did allow you to change a flat while moving?
It just treats the damage as censorship and routes around it, right?
To those looking for a more in-depth description, check out the technical paper on the google file system:
http://labs.google.com/papers/gfs.html
Had to read it for a search engines course in college, it's pretty darn spiffy.
Excellent use of the car analogy, especially since it is possible to change a tire while driving a car. Youtube video at 1:48.
Slightly..ahem... OT so posting anon.
Analizing the reason first is a very good step. It could have saved me two hours today :/
Extreme Programming - Redundant Array of Inexpensive Developers
You know, the article read like a press release. Hasn't slashdot whored itself out enough lately on these kinds of things? Google is so ultra-reliable, blah blah, 24x7, blah blah, commitment, blah blah, premier service partner, blah blah... I get that kind of talk enough in staff meetings. Where's the meat already!?
Why not write an article with some nice graphics saying what happens to my request from the time I hit "Search" to the time I click a result. List off all the servers it goes through, their roles, how they're monitored, etc. Give examples of failure and show the mode decisions the software makes (and where this software is running) -- show the latencies and other performance impacts as my request bounces over failure after failure. That's what I expect when I pull up an article entitled "How Google Routes Around Outages". Something useful, professionally enriching, intellectually stimulating, etc. In short, tell me why I (should) never see a "500 Internal Server Error" from Google, but I do from just about every other major website I've used.
#fuckbeta #iamslashdot #dicemustdie
The key point:
When they get an outage, they check how it was caught and if it wasn't caught automatically, they figure out how to next time. Simple rule: They learn from their mistakes and don't put all their eggs in one basket.
+1 IDisagreeSoHeMustBeATrollOrAnAstroturferOrAShill
akin to 'changing the tires on a car while you're going at 60 down the freeway,'
This is not so hard. Just design the car with 4 axles instead of 2 and lift one off the road at a time. Helps if it can swivel for easy access to the lugnuts.
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
Isn't this how the *internet* is (at least in theory) supposed to work anyhow? Instead we have 90% of the cables that route the middle-east/europe running through the same canal. And I know of VERY few ISPs who actually make their systems redundant anymore. /sadface
Ok, granted they are not travelling 60mph, this is still pretty impressive.. I consider this on-topic, because maybe it is possible to do what the summary suggests (replace wheel in moving car). :)
Watch from 1:55 to 2:35:
Youtube video of guys replacing a wheel on a car while it is moving..
New webcomic updated on Sundays: HERE