Slashdot Mirror


Facebook Engineers Crash Data Centers In Real-World Stress Test (ieee.org)

An anonymous reader writes: In a report via IEEE Spectrum, Facebook's VP of Engineering Jay Parikh described the company's "Project Storm" -- regular takedowns of Facebook's data center intended to stress test the company's disaster recovery efforts. The first few didn't go so well, he reports. (Perhaps doing a test during a World Cup final was not such a good idea). Months and months of planning went into the initial effort, though up until the actual moment, other Facebook leaders didn't think he'd actually take out an active data center. "In 2014, Parikh decided Project Storm was ready for a real-world test: The team would take down an actual data center during a normal working day and see if they could orchestrate the traffic shift smoothly," reports IEEE Spectrum. Parikh recalls: "I was having coffee with a colleague just before the first drill. He said, 'You're not going to go through with it; you've done all the prep work, so you're done, right?' I told him, 'There's only one way to find out'" if it works. (Parikh made the remarks at this week's @Scale conference in San Jose.) Parikh says there never seemed to be a good time to perform the live takedowns. "Something always ended up happening in the world or the company. One was during the World Cup final, another during a major product launch." The report adds, "The live takedowns continue today, with the Project Storm team members coming up with crazier and crazier ambitions for just what to take offline, Parikh says.

10 of 52 comments (clear)

  1. Worth it by 110010001000 · · Score: 5, Funny

    This is totally worth it. What would happen if there was a REAL disaster (like a nuclear strike) and people couldn't check their facebook feed and post "thoughts and prayers" messages? Too terrible to think about.

    1. Re: Worth it by johnsmithperson123 · · Score: 5, Insightful

      Considering that Facebook is arguably the world's biggest news service, it actually is sort of important.

    2. Re: Worth it by bill_mcgonigle · · Score: 3, Insightful

      News DISTRIBUTION service. It's not like they provide any original content like AP, Reuters, etc.

      In that AP and Reuters are just distribution services, Facebook is arguably a larger source of original news distribution than those two.

      And kudos to their engineering team for not just paying lip service to reliability.

      --
      My God, it's Full of Source!
      OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
    3. Re:Worth it by thegarbz · · Score: 2

      This is totally worth it. What would happen if there was a REAL disaster (like a nuclear strike) and people couldn't check their facebook feed and post "thoughts and prayers" messages? Too terrible to think about.

      Or maybe given Facebook's system of being able to announce on your feed to your friends and family that you are in fact okay thus reducing panic situations, it's much more important than your prejudices make it out to be.

    4. Re:Worth it by 110010001000 · · Score: 2

      You are right. "Lolz guys..TOTALLY not incinerated in the nuclear strike today...", check out this cool cat video.

    5. Re: Worth it by johnsmithperson123 · · Score: 2

      No, it's because I find it easier to sign up for things with a Gmail account. Insecure, but my internet commenting accounts are not exactly high on my security priority list. Trust me, I've never touched Google Plus.

    6. Re:Worth it by thegarbz · · Score: 2

      Yes that's exactly what I was saying, and not everyone is in position in every case to help someone. The idea that an entire city of people will suddenly flock to another to "scramble to help" is simply absurd. The world will keep turning and no one can do 100% all the time so critising people for being on facebook is not really thinking ahead.

      Now on the flip side Italy had an earthquake the other day. My sister was in Italy, I don't know where she was, just that she was travelling through. My first reaction was to jump on Facebook and I was greeted with a lovely message of "Shaken, but we're fine". My mother on the other hand went into frigging panic mode because she couldn't call my sister after trying about 10 times (mobile phone wasn't working for what turned out to be unrelated to the earthquake, but internet connection in the hotel was unaffected hence the facebook post). Anyway when my mother eventually called me I said she's fine, no I haven't talked to her, but she's posting on Facebook pictures of rubble.

      Now this is just one anecdote, but there countless scenarios where someone may have:
      a) access to internet but no telephone
      b) desire to post to everyone at once that they are fine, rather than having to service 20 individual calls from friends and relatives.

      and both of those together are of benefit to those people affected by the disaster as it reduces the load on the local infrastructure.

  2. Somebody Finally Gets It! by chill · · Score: 5, Insightful

    Good for him! Most DR exercises I've seen are planned weeks, if not months in advance. They are more of a scheduled fail-over to a redundant site and not an actual disaster recovery test.

    In the event of an actual disaster, there would be no recovery.

    I'm heartened to see SOMEONE does it right.

    --
    Learning HOW to think is more important than learning WHAT to think.
  3. Netflix Simian Army and Microservice Architectures by MikeMoore2291 · · Score: 4, Interesting

    So Netflix has been doing this for years now... it's called the Chaos Monkey and part of their "Simian Army" that performs this kind of function but *all the time* with no schedule. This is not something FB came up with but this post seems to give them credit for this innovation. More interesting than the lack of credit to Netflix though is this adoption of a method that heavily favors a Microservice Architecture. Seeing more and more of this flexible, scalable, and highly resilient architecture and methodology being put out in industry is certainly encouraging.

  4. And nothing of value was lost by RogueWarrior65 · · Score: 3, Insightful

    Pity.