Making Facebook Self Healing
New submitter djeps writes "I used to achieve some degree of automated problem resolution with Nagios Event Handler scripts and RabbitMQ, but Facebook has done it on a far larger scale than my old days of sysadmin. Quoting: 'When your infrastructure is the size of Facebook's, there are always broken servers and pieces of software that have gone down or are generally misbehaving. In most cases, our systems are engineered such that these issues cause little or no impact to people using the site. But sometimes small outages can become bigger outages, causing errors or poor performance on the site. If a piece of broken software or hardware does impact the site, then it's important that we fix it or replace it as quickly as possible. ... We had to find an automated way to handle these sorts of issues so that the human engineers could focus on solving and preventing the larger, more complex outages. So, I started writing scripts when I had time to automate the fixes for various types of broken servers and pieces of software.'"
How are we supposed to kill it if it's self-healing? Now it will never die!
"None can love freedom heartily, but good men; the rest love not freedom, but license." --John Milton
I disagree. Larger outages in an infrastructure like Facebook's are only rarely an accumulation of smaller issues. Think about it: what's a more likely scenario for a major site-wide issue, thousands of web servers whose hard drives die simultaneously, or a flapping route caused by a configuration issue on a router?
Think of it like our body: every day, you suffer thousands of tiny injuries and insults that your autoimmune system and skin deal with and that you never know about. This frees you up to drive yourself to the doctor if you notice a lingering cough or to call the ambulance if you sever a limb. You wouldn't argue against an immune system because it might hide larger issues from conscious attention, would you?
If I wanted a sig I would have filled in that stupid box.