Slashdot Mirror


Making Facebook Self Healing

New submitter djeps writes "I used to achieve some degree of automated problem resolution with Nagios Event Handler scripts and RabbitMQ, but Facebook has done it on a far larger scale than my old days of sysadmin. Quoting: 'When your infrastructure is the size of Facebook's, there are always broken servers and pieces of software that have gone down or are generally misbehaving. In most cases, our systems are engineered such that these issues cause little or no impact to people using the site. But sometimes small outages can become bigger outages, causing errors or poor performance on the site. If a piece of broken software or hardware does impact the site, then it's important that we fix it or replace it as quickly as possible. ... We had to find an automated way to handle these sorts of issues so that the human engineers could focus on solving and preventing the larger, more complex outages. So, I started writing scripts when I had time to automate the fixes for various types of broken servers and pieces of software.'"

9 of 74 comments (clear)

  1. Complexity arising from simplicity by Psychotria · · Score: 2, Insightful

    We had to find an automated way to handle these sorts of issues so that the human engineers could focus on solving and preventing the larger, more complex outages.

    This seems backwards to me. Surely the "larger, more complex outages" are caused by an accumulation of, or interaction between, the smaller, less complex problems/situations. If all of the smaller problems are well understood and dealt with, then those more complex problems should not arise. I think it's dangerous to assume that because the smaller problems can be transiently resolved by a script with minimal human intervention that the more complex problems need less exploration. Sure, scripts to handle the less complex issues are great, but this should not shift the focus of the human engineers to "focus on solving and preventing complex outages"; solving those often (always?) means solving the less complex issues.

    1. Re:Complexity arising from simplicity by aiken_d · · Score: 5, Insightful

      I disagree. Larger outages in an infrastructure like Facebook's are only rarely an accumulation of smaller issues. Think about it: what's a more likely scenario for a major site-wide issue, thousands of web servers whose hard drives die simultaneously, or a flapping route caused by a configuration issue on a router?

      Think of it like our body: every day, you suffer thousands of tiny injuries and insults that your autoimmune system and skin deal with and that you never know about. This frees you up to drive yourself to the doctor if you notice a lingering cough or to call the ambulance if you sever a limb. You wouldn't argue against an immune system because it might hide larger issues from conscious attention, would you?

      --
      If I wanted a sig I would have filled in that stupid box.
    2. Re:Complexity arising from simplicity by mclearn · · Score: 3, Informative

      TFA specifically uses an example of a failed hard drive to describe the workflow. You can see that a failed hard drive is something small, easily diagnosable, and -- in the greater scheme of things -- easily fixable.

      Now, if you recall what happened with AWS in April, they had a low-bandwidth management network that all of a sudden had all primary EBS API traffic shunted to it. This was caused by a human flipping a network switch when they shouldn't have. Something like this is not something that happens all the time, has little, if any diagnosable features, is not well-defined to have a proper workflow attached to it, and needs human engineers to correct. This is an example of a complex, large-scale problem.

      Read the article, it's actually quite interesting.

    3. Re:Complexity arising from simplicity by hardtofindanick · · Score: 3, Insightful

      It seems to me like you are creating hypothetical scenarios of total failure. Most of the practical failure scenarios can be handled gracefully when you have facebook's resources under your command. After all they are not sending men to Mars. We have studied and now well understand distributed database problems for more than 30 years. There is pretty much nothing technologically interesting about Facebook (and Twitter for that matter).

      The sad part is someone writes his ramblings and puts a flow chart or two and it becomes a story on /.

  2. NOOOOOO!! by Baloroth · · Score: 4, Funny

    How are we supposed to kill it if it's self-healing? Now it will never die!

    --
    "None can love freedom heartily, but good men; the rest love not freedom, but license." --John Milton
    1. Re:NOOOOOO!! by piripiri · · Score: 2

      Double tap for safety.

  3. Sounds like a good place to work by Maow · · Score: 3, Interesting

    Facebook is an amazing place to work for many reasons but I think my favorite part of the job is that engineers like me are encouraged to come up with our own ideas and implement them. Management here is very technical and there is very little bureaucracy, so when someone builds something that works, it gets adopted quickly. Even though Facebook is one of the biggest websites in the world it still feels like a start-up work environment because there's so much room for individual employees to have a huge impact.

    Like building infrastructure? Facebook is hiring infrastructure engineers. Apply here.

    Damn, if I weren't so adverse to soul crushing rejection, I'd apply.

    This guy was insightful and informative, so I believe what is quoted above.

    And I'm surprised: I figured Facebook would be either more bureaucratic (like MS) or kinda dickishly autocratic (like Zuckerberg is rumoured to be).

  4. Re:Upstart? by FooBarWidget · · Score: 2

    Did you even read the article? It talks about things like broken hard drives.

  5. They do it very differently by brunes69 · · Score: 2

    From the sounds of this article, Facebook and Google go about this VERY differently.

    The Facebook way, it seems, is that every node in the infrastructure is possibly important. So they write and maintain all these healing scripts to deal with problems like broken processes or failed hard drives.

    Google goes about the same problem in a very different way. Google's system is architected such that no node is important. Everything is massively parallel and redundant - such that you could take and destroy any server, any set of servers, even an entire data centre and blow it up with a bomb, and side from performance issues, no one would notice.

    From an admin's point of view - I would much prefer Google's system. Something doesn't look right on a box? Yank it out TOTALLY, put in a new one, investigate some other time.