Making Facebook Self Healing
New submitter djeps writes "I used to achieve some degree of automated problem resolution with Nagios Event Handler scripts and RabbitMQ, but Facebook has done it on a far larger scale than my old days of sysadmin. Quoting: 'When your infrastructure is the size of Facebook's, there are always broken servers and pieces of software that have gone down or are generally misbehaving. In most cases, our systems are engineered such that these issues cause little or no impact to people using the site. But sometimes small outages can become bigger outages, causing errors or poor performance on the site. If a piece of broken software or hardware does impact the site, then it's important that we fix it or replace it as quickly as possible. ... We had to find an automated way to handle these sorts of issues so that the human engineers could focus on solving and preventing the larger, more complex outages. So, I started writing scripts when I had time to automate the fixes for various types of broken servers and pieces of software.'"
We had to find an automated way to handle these sorts of issues so that the human engineers could focus on solving and preventing the larger, more complex outages.
This seems backwards to me. Surely the "larger, more complex outages" are caused by an accumulation of, or interaction between, the smaller, less complex problems/situations. If all of the smaller problems are well understood and dealt with, then those more complex problems should not arise. I think it's dangerous to assume that because the smaller problems can be transiently resolved by a script with minimal human intervention that the more complex problems need less exploration. Sure, scripts to handle the less complex issues are great, but this should not shift the focus of the human engineers to "focus on solving and preventing complex outages"; solving those often (always?) means solving the less complex issues.
How are we supposed to kill it if it's self-healing? Now it will never die!
"None can love freedom heartily, but good men; the rest love not freedom, but license." --John Milton
if it were true facebook would just self destruct
"Today, the FBAR service is developed and maintained by two full time engineers, but according to the most recent metrics, it’s doing the work of approximately 200 full time system administrators".
We had to find an automated way to handle these sorts of issues so that the human engineers could focus on solving and preventing the larger, more complex outages.
Given how glitchy Facebook was in the past, I can't help but be reminded of this comic.
...given how broken most of the site is on a daily basis.
Could they do the world a favour and write scripts to make it self-terminate instead?
I was rolling out Big Brother Network Monitor a decade ago. It was well capable of doing this.
Today, I'd use an RDB that stored output from perl:DBI cronjobs running on each machine, and another job that checked the db and made sure all that ought to be happening had reported in successfully recently. Anything that hadn't would trigger an email to someone to look into it.
Easy to develop, implement, extend, and maintain.
No, I don't want to connect to FB just to read the article. Post it somewhere else if you want it read.
"Tongue tied and twisted, just an Earth bound misfit
Damn, if I weren't so adverse to soul crushing rejection, I'd apply.
This guy was insightful and informative, so I believe what is quoted above.
And I'm surprised: I figured Facebook would be either more bureaucratic (like MS) or kinda dickishly autocratic (like Zuckerberg is rumoured to be).
I mean, when was the last time something on Facebook actually worked?
Auto ticketed errors, I am Amazed. If you did not detect sarcasm, please enter a problem ticket. You don't think that shit's automated do you?
So this is basically a script that restarts dead daemons, right?
What's the difference between this and Upstart?
http://upstart.ubuntu.com/faq.html
I'm not a lawyer, but I play one on the Internet. Blog
I was thinking more in terms of "assisted suicide".
If you want news from today, you have to come back tomorrow.
MSP I work for has been doing this for at least 3 years. It can also call people and give them a menu of actions if such is a requirement.
Part of the reason Facebook and Google can "self heal" is because failures are mostly not noticeable by end users. If a Facebook or Google machine fails, unless you are getting a 404 or a service failure message there is little to no way for you to know that the web page you have been served up is wrong, partial or out of date. This failure ambiguity provides a lot of leeway on the methods and speed required to fix a failure.
For most other services where there is a definite correct and incorrect output - like file systems or financial services - a broken service has immediate impact and fixing it is much harder.
How come friends keep disappearing only to request again, saying I dropped them. Either it's buggy or broke....
Sounds more like an api built to issue service tickets. They broke down the api access to different groups so the individual groups can create their own resolutions to known problems.
Google doesn't have nearly as many such problems. I'd think Google simply pays more for better people, rather than hiring dirt cheap PHP morons.
From the sounds of this article, Facebook and Google go about this VERY differently.
The Facebook way, it seems, is that every node in the infrastructure is possibly important. So they write and maintain all these healing scripts to deal with problems like broken processes or failed hard drives.
Google goes about the same problem in a very different way. Google's system is architected such that no node is important. Everything is massively parallel and redundant - such that you could take and destroy any server, any set of servers, even an entire data centre and blow it up with a bomb, and side from performance issues, no one would notice.
From an admin's point of view - I would much prefer Google's system. Something doesn't look right on a box? Yank it out TOTALLY, put in a new one, investigate some other time.
I would like to suggest a subtle change to the posting system: Make it so the first post on any article cannot be done as "Anonymous Coward".
I know Slashdot has a tradition of being a "free-for-all, run through a blender" but I don't think there has ever been an AC first post that has ever been anything but either:
- So lame that you wonder how a person manages to survive such a terminal case of lack of personality or creativity... or
- There is no real reason it couldn't have been posted under a login.
Stupidity really should be viciously stamped out but if we can use automated steps to reduce the "background stupid" we can then focus more energy on more invasive cases of dumb.
s/cosmonaut/confidant/
Maybe You have confused Zuckerberg with Guy Laliberté or Mark Shuttleworth. Or perhaps with Richard Branson, who builds space tourism vehicles.
The song, however, has nothing to do with space travelers.
I have a friend who works there. According to him, Facebook devotes over 100 physical servers to every 35,000 users. That is incredibly inefficient in terms of power and hardware costs. Seems to me they just throw a low of money around rather than coming up with elegant technical solutions.
And Google are dwarfing FaceBook in term of bandwith / number of servers / database / you-name-it.
When you've got a lot of servers, it's been long known that "failure is the norm" and that you're infrastructure should be designed around that *fact*.