Slashdot Mirror


Facebook Engineers Crash Data Centers In Real-World Stress Test (ieee.org)

An anonymous reader writes: In a report via IEEE Spectrum, Facebook's VP of Engineering Jay Parikh described the company's "Project Storm" -- regular takedowns of Facebook's data center intended to stress test the company's disaster recovery efforts. The first few didn't go so well, he reports. (Perhaps doing a test during a World Cup final was not such a good idea). Months and months of planning went into the initial effort, though up until the actual moment, other Facebook leaders didn't think he'd actually take out an active data center. "In 2014, Parikh decided Project Storm was ready for a real-world test: The team would take down an actual data center during a normal working day and see if they could orchestrate the traffic shift smoothly," reports IEEE Spectrum. Parikh recalls: "I was having coffee with a colleague just before the first drill. He said, 'You're not going to go through with it; you've done all the prep work, so you're done, right?' I told him, 'There's only one way to find out'" if it works. (Parikh made the remarks at this week's @Scale conference in San Jose.) Parikh says there never seemed to be a good time to perform the live takedowns. "Something always ended up happening in the world or the company. One was during the World Cup final, another during a major product launch." The report adds, "The live takedowns continue today, with the Project Storm team members coming up with crazier and crazier ambitions for just what to take offline, Parikh says.

6 of 52 comments (clear)

  1. Worth it by 110010001000 · · Score: 5, Funny

    This is totally worth it. What would happen if there was a REAL disaster (like a nuclear strike) and people couldn't check their facebook feed and post "thoughts and prayers" messages? Too terrible to think about.

    1. Re: Worth it by johnsmithperson123 · · Score: 5, Insightful

      Considering that Facebook is arguably the world's biggest news service, it actually is sort of important.

    2. Re: Worth it by bill_mcgonigle · · Score: 3, Insightful

      News DISTRIBUTION service. It's not like they provide any original content like AP, Reuters, etc.

      In that AP and Reuters are just distribution services, Facebook is arguably a larger source of original news distribution than those two.

      And kudos to their engineering team for not just paying lip service to reliability.

      --
      My God, it's Full of Source!
      OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
  2. Somebody Finally Gets It! by chill · · Score: 5, Insightful

    Good for him! Most DR exercises I've seen are planned weeks, if not months in advance. They are more of a scheduled fail-over to a redundant site and not an actual disaster recovery test.

    In the event of an actual disaster, there would be no recovery.

    I'm heartened to see SOMEONE does it right.

    --
    Learning HOW to think is more important than learning WHAT to think.
  3. Netflix Simian Army and Microservice Architectures by MikeMoore2291 · · Score: 4, Interesting

    So Netflix has been doing this for years now... it's called the Chaos Monkey and part of their "Simian Army" that performs this kind of function but *all the time* with no schedule. This is not something FB came up with but this post seems to give them credit for this innovation. More interesting than the lack of credit to Netflix though is this adoption of a method that heavily favors a Microservice Architecture. Seeing more and more of this flexible, scalable, and highly resilient architecture and methodology being put out in industry is certainly encouraging.

  4. And nothing of value was lost by RogueWarrior65 · · Score: 3, Insightful

    Pity.