Making Facebook Self Healing
New submitter djeps writes "I used to achieve some degree of automated problem resolution with Nagios Event Handler scripts and RabbitMQ, but Facebook has done it on a far larger scale than my old days of sysadmin. Quoting: 'When your infrastructure is the size of Facebook's, there are always broken servers and pieces of software that have gone down or are generally misbehaving. In most cases, our systems are engineered such that these issues cause little or no impact to people using the site. But sometimes small outages can become bigger outages, causing errors or poor performance on the site. If a piece of broken software or hardware does impact the site, then it's important that we fix it or replace it as quickly as possible. ... We had to find an automated way to handle these sorts of issues so that the human engineers could focus on solving and preventing the larger, more complex outages. So, I started writing scripts when I had time to automate the fixes for various types of broken servers and pieces of software.'"
I would like to suggest a subtle change to the posting system: Make it so the first post on any article cannot be done as "Anonymous Coward".
I know Slashdot has a tradition of being a "free-for-all, run through a blender" but I don't think there has ever been an AC first post that has ever been anything but either:
- So lame that you wonder how a person manages to survive such a terminal case of lack of personality or creativity... or
- There is no real reason it couldn't have been posted under a login.
Stupidity really should be viciously stamped out but if we can use automated steps to reduce the "background stupid" we can then focus more energy on more invasive cases of dumb.