Slashdot Mirror


Facebook Engineers Crash Data Centers In Real-World Stress Test (ieee.org)

An anonymous reader writes: In a report via IEEE Spectrum, Facebook's VP of Engineering Jay Parikh described the company's "Project Storm" -- regular takedowns of Facebook's data center intended to stress test the company's disaster recovery efforts. The first few didn't go so well, he reports. (Perhaps doing a test during a World Cup final was not such a good idea). Months and months of planning went into the initial effort, though up until the actual moment, other Facebook leaders didn't think he'd actually take out an active data center. "In 2014, Parikh decided Project Storm was ready for a real-world test: The team would take down an actual data center during a normal working day and see if they could orchestrate the traffic shift smoothly," reports IEEE Spectrum. Parikh recalls: "I was having coffee with a colleague just before the first drill. He said, 'You're not going to go through with it; you've done all the prep work, so you're done, right?' I told him, 'There's only one way to find out'" if it works. (Parikh made the remarks at this week's @Scale conference in San Jose.) Parikh says there never seemed to be a good time to perform the live takedowns. "Something always ended up happening in the world or the company. One was during the World Cup final, another during a major product launch." The report adds, "The live takedowns continue today, with the Project Storm team members coming up with crazier and crazier ambitions for just what to take offline, Parikh says.

52 comments

  1. Worth it by 110010001000 · · Score: 5, Funny

    This is totally worth it. What would happen if there was a REAL disaster (like a nuclear strike) and people couldn't check their facebook feed and post "thoughts and prayers" messages? Too terrible to think about.

    1. Re: Worth it by johnsmithperson123 · · Score: 5, Insightful

      Considering that Facebook is arguably the world's biggest news service, it actually is sort of important.

    2. Re: Worth it by Anonymous Coward · · Score: 0

      Your twist of logic and bile doesn't stop him being correct ya silly bastard. As much as I personally loathe Facebook I can't ignore the reality of the situation.

    3. Re: Worth it by Rick+Zeman · · Score: 1

      Considering that Facebook is arguably the world's biggest news service, it actually is sort of important.

      News DISTRIBUTION service. It's not like they provide any original content like AP, Reuters, etc.

    4. Re: Worth it by RandomSurfer314 · · Score: 1

      I'm not sure you really know what "news service" means ...

    5. Re:Worth it by flappinbooger · · Score: 1

      This is totally worth it. What would happen if there was a REAL disaster (like a nuclear strike) and people couldn't check their facebook feed and post "thoughts and prayers" messages? Too terrible to think about.

      After 10,000 likes the radiation poisoning gets better

      --
      Flappinbooger isn't my real name
    6. Re: Worth it by bill_mcgonigle · · Score: 3, Insightful

      News DISTRIBUTION service. It's not like they provide any original content like AP, Reuters, etc.

      In that AP and Reuters are just distribution services, Facebook is arguably a larger source of original news distribution than those two.

      And kudos to their engineering team for not just paying lip service to reliability.

      --
      My God, it's Full of Source!
      OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
    7. Re:Worth it by thegarbz · · Score: 2

      This is totally worth it. What would happen if there was a REAL disaster (like a nuclear strike) and people couldn't check their facebook feed and post "thoughts and prayers" messages? Too terrible to think about.

      Or maybe given Facebook's system of being able to announce on your feed to your friends and family that you are in fact okay thus reducing panic situations, it's much more important than your prejudices make it out to be.

    8. Re: Worth it by 110010001000 · · Score: 1

      News Service for Morons. Kinds like G+ users.

    9. Re:Worth it by 110010001000 · · Score: 2

      You are right. "Lolz guys..TOTALLY not incinerated in the nuclear strike today...", check out this cool cat video.

    10. Re: Worth it by johnsmithperson123 · · Score: 2

      No, it's because I find it easier to sign up for things with a Gmail account. Insecure, but my internet commenting accounts are not exactly high on my security priority list. Trust me, I've never touched Google Plus.

    11. Re:Worth it by Anonymous Coward · · Score: 0

      Facebook is a massive gossip site with live sniping allowed. Nothing more.

    12. Re: Worth it by Gr8Apes · · Score: 1

      News DISTRIBUTION service. It's not like they provide any original content like AP, Reuters, etc.

      In that AP and Reuters are just distribution services, Facebook is arguably a larger source of original news distribution than those two.

      AP and Reuters both have reporters in their employ. FB does not, AFAIK. I'd guess FB also technically could probably be accused of copyright violations regarding reposting AP/Reuters stories. FB nothing more than a massive "look at me" and gossip site.

      And kudos to their engineering team for not just paying lip service to reliability.

      I would agree with this. Running real world tests is the only way to be sure.

      --
      The cesspool just got a check and balance.
    13. Re:Worth it by BringsApples · · Score: 1

      I think that what Mr. 110010001000 is pointing out is:
      People think it's easier to get an internet connection (and assume that others have an internet connection) and view/post about being ok than it is to simply pick up a phone and call your loved ones, or even stranger, leaving the house and actually checking on them.

      --
      Politics; n. : A religion whereby man is god.
    14. Re:Worth it by CharlieG · · Score: 1

      Back in the 70s (Before I was in the field professionally, but knew 'enough') I was brought by my dad to Bunker Remo, who did all the stock market data. (My dad worked on their HVAC)
      At the one site, they had TWO mainframes (yes, this was pre IBM PC era) with DRUM memory (Yes, I've seen operational drum memory!! - an I have one word of memory from the computer - discrete transistors!!!)
      Anyway, it was a cluster! One computer could take over for the other. Guy said "Oh, that's nothing, there are 2 more in Midtown (we were in the Wall St area), then 2 more in Chicago, London, Tokyo, Paris, and the last pair? Alice Springs - Just in case of Nuclear War"
      Yes folks, there are companies out there that DO invest in DR plans that include "Global Nuclear War"

      --
      -- 73 de KG2V For the Children - RKBA! "You are what you do when it counts" - the Masso
    15. Re: Worth it by plopez · · Score: 1

      "Considering that Facebook is arguably the world's biggest marketing service, it actually is sort of important."

      There you go. Fixed that for you.

      --
      putting the 'B' in LGBTQ+
    16. Re: Worth it by Anonymous Coward · · Score: 0

      I have been in that kind of situation, and the voice network was overloaded while the Internet still worked. So yeah, it sometimes is easier, plus you reach a whole bunch more people at once.

    17. Re:Worth it by thegarbz · · Score: 1

      In disasters where internet connections are major issues, so are typically phone calls. But this isn't the scenario I was talking about. In every disaster there are a higher portion of people not affected than are affected. Getting a message out casually on social media can help reduce the stress on the phone network which is better served for the actual emergencies going on at the time.

    18. Re:Worth it by BringsApples · · Score: 1

      I don't think I understand. Do you mean that if a nuclear strike or earthquake hits California, and it destroys a lot, as far out as Nevada, then the folks in Utah should still be able to say "I'm ok" to the folks in New York, via facebook? Because what I got out of what 110010001000 said was that the folks in Utah should be scrambling to help those in Nevada and California, rather than browsing facebook, and cat videos. I may be reading to much into it though.

      --
      Politics; n. : A religion whereby man is god.
    19. Re:Worth it by thegarbz · · Score: 2

      Yes that's exactly what I was saying, and not everyone is in position in every case to help someone. The idea that an entire city of people will suddenly flock to another to "scramble to help" is simply absurd. The world will keep turning and no one can do 100% all the time so critising people for being on facebook is not really thinking ahead.

      Now on the flip side Italy had an earthquake the other day. My sister was in Italy, I don't know where she was, just that she was travelling through. My first reaction was to jump on Facebook and I was greeted with a lovely message of "Shaken, but we're fine". My mother on the other hand went into frigging panic mode because she couldn't call my sister after trying about 10 times (mobile phone wasn't working for what turned out to be unrelated to the earthquake, but internet connection in the hotel was unaffected hence the facebook post). Anyway when my mother eventually called me I said she's fine, no I haven't talked to her, but she's posting on Facebook pictures of rubble.

      Now this is just one anecdote, but there countless scenarios where someone may have:
      a) access to internet but no telephone
      b) desire to post to everyone at once that they are fine, rather than having to service 20 individual calls from friends and relatives.

      and both of those together are of benefit to those people affected by the disaster as it reduces the load on the local infrastructure.

    20. Re:Worth it by BringsApples · · Score: 1
      I like your reply, really I do. You have a great point, and using the real-world example from Italy helps. However I'd like to point out that...

      The idea that an entire city of people will suddenly flock to another to "scramble to help" is simply absurd.

      ...is simply absurd. Not sure where you live, but I'm very near to Mississippi (a place where we used to get multiple hurricanes every year). When Hurricane Katrina hit, it was not possible to reach anyone there without traveling to the area (southern-most parts of Louisiana to Mississippi). The only government help in place was positioned at the Walmart(s) to prevent anyone from looting. Many, and I can't stress enough how many, people came from Alabama and Texas to help. Hell, even Mexico sent it's army.

      To think that there were people living in Alabama, very near to the devistation experience by people just a few miles away, that may have been to busy on facebook (which wasn't big at the time) to help out, would, in many people's opinion, be worse than absurd. I'm willing to bet that those folks in Italy would agree.

      So maybe we're both correct, and we can conclude that one should, if one can, post "I'm ok" to facebook, then, if one can, go help out those in need.

      --
      Politics; n. : A religion whereby man is god.
  2. Well that explains the SpaceX fiasco by Anonymous Coward · · Score: 0

    Real world stress test! See if the other satellites can pick up the slack.

  3. Somebody Finally Gets It! by chill · · Score: 5, Insightful

    Good for him! Most DR exercises I've seen are planned weeks, if not months in advance. They are more of a scheduled fail-over to a redundant site and not an actual disaster recovery test.

    In the event of an actual disaster, there would be no recovery.

    I'm heartened to see SOMEONE does it right.

    --
    Learning HOW to think is more important than learning WHAT to think.
    1. Re: Somebody Finally Gets It! by Anonymous Coward · · Score: 0

      This was not done right then.

      They should do a DR test where all of FB fails and does not come back.
      Ever. Hopefully.

    2. Re:Somebody Finally Gets It! by aaarrrgggh · · Score: 1

      Most large banks I have worked with do full DR exercises, and have since the 90's at least. Smaller banks will simulate typically, but one bank I know of actually shifted mainframes to their DR warehouse and brought things up from there.

      Now with hot-hot sites, the activity is much more trivial, but it is obviously not a universal thing across all organizations.

    3. Re:Somebody Finally Gets It! by Anonymous Coward · · Score: 0

      I worked in a different environment, where our concern was more, if we completely lose everything, can we build it all back up and get things running again.

  4. Disaster wont wait untill your not busy to strike by lapm · · Score: 1

    Thats how you should do it. Disasters dont wait around untill your not busy doing something else, they hit when they want to. Thats why i think live testing is important to see if recovery plan works or needs iteration of improvement... Best test is usually when disaster strikes right when you have hands full of something else... No time to dig up manuals, etc...

  5. DR Test are Simple by Anonymous Coward · · Score: 0

    My last DR Test Plan only had 2 lines in it
    Power=OFF
    Execute DR Plan.

    1. Re:DR Test are Simple by turbidostato · · Score: 1

      And this is, sir, why IT is its current lame shape: allowing incompetent people taking key decisions just because they happen to relate to the right people.

      Now, for a real world working DR plan:
      10 REM 'DR Master Plan'
      20 LET Power = OFF
      30 PRINT 'AIIIIEEEEEEEE!'
      40 GOTO 30

      Now: *THAT'S* how professionalism looks like.

  6. Re:facebook engineering.... by 0100010001010011 · · Score: 1

    facebook and engineering aren't something I'd put into a sentence together.

    And how do you think they got to handle the amount of data they did? The same comments rolled in when Walmart released its cloud service. Of course they have an engineering department.

    Facebook engineers are working on a scale that most people will never see.

    What's this World coming to when twitterbook has to be protected from natural disasters.

    A ubiquitous service that most people have access to is one that speeds up disaster recovery. People have already used groups to organize disaster recovery efforts on small and large scales.

    No one says you have to use it to upload food selfies.

  7. Really? by Anonymous Coward · · Score: 0

    I didn't know that Facebook still exists. Haven't most people gone somewhere else for their 'waste of time' needs already? But I suppose it's still important to stress test FB. Someone's grandma would be really sad if she couldn't post her cookie recipe after Yellowstone had blown up or an accidental thermonuclear war had broken out.

  8. Netflix Simian Army and Microservice Architectures by MikeMoore2291 · · Score: 4, Interesting

    So Netflix has been doing this for years now... it's called the Chaos Monkey and part of their "Simian Army" that performs this kind of function but *all the time* with no schedule. This is not something FB came up with but this post seems to give them credit for this innovation. More interesting than the lack of credit to Netflix though is this adoption of a method that heavily favors a Microservice Architecture. Seeing more and more of this flexible, scalable, and highly resilient architecture and methodology being put out in industry is certainly encouraging.

  9. And nothing of value was lost by RogueWarrior65 · · Score: 3, Insightful

    Pity.

  10. Fault-tolerant privacy invading data centers by JoeyRox · · Score: 1, Funny

    Just when you thought your privacy could be restored by the massive failure of a data center, a new center rises up from the ashes of the other to take the helm of stealing every personal detail about your life and the life of people you love.

    1. Re:Fault-tolerant privacy invading data centers by Anonymous Coward · · Score: 0

      stealing

      It's not really stealing if you give it to them. Constantly. For years.

  11. Couldn't have too successfull by smooth+wombat · · Score: 0

    Facebook is still up and running.

    --
    We will bankrupt ourselves in the vain search for absolute security. -- Dwight D. Eisenhower
  12. Testing ideas by WaffleMonster · · Score: 1

    It would be very useful for Facebook to stop announcing ASN 32934 for a few centuries as an experiment just to see what would happen.

    Or permanently remove authority records for facebook.com.

    You know just to see what would happen.

  13. Who's on first? by fahrbot-bot · · Score: 1

    Months and months of planning went into the initial effort, ...

    Into the take down or recovery? 'Cause the former just requires pulling a cable of some sort. :-) TFS says the team would take down a site and try to migrate the traffic, but wouldn't it be better if the disaster group and the recovery group were different teams for a "real world" stress test?

    --
    It must have been something you assimilated. . . .
  14. Re:Netflix Simian Army and Microservice Architectu by epine · · Score: 1

    This is not something FB came up with but this post seems to give them credit for this innovation.

    Actually, this is one of the better story summaries I've seen here.

    I knew what it was talking about (no unexplained mayfly buzzwords), I knew who the protagonists were, and I knew what was at stake. The only implied innovation was one of personal chutzpah, against the backdrop of an organization notorious for taking all things in collective stride (these being very, very short strides).

    Working at Facebook Sounds Like Joining a Cult

    At some level, I think we do indigenous people a disservice by referring to them as First Nations, freezing them in the amber of the era, as if they couldn't (and hadn't) kicked the shit out of their neighbours every bit as ruthlessly as the Spanish, the Dutch, or the British (secretly, it's a badge of honour, isn't it, to have kick-ass forbears?)—the main difference being that the European cultures brought with them a written language—so long, illiterate heathens—then, however they found the table set is assigned a positive integer (let's not even grant them "zero") to functionally signify that no form of tomahawk displacement came before.

    So of course Netflix didn't invent this technique.

    Clubbing a bunch of strange-looking men and taking their women is an idea that never owed much to the example of recent history, no matter how grand and savage an example your nearest neighbour might have set in the memorable recency.

  15. Been there. Broke that. by Anonymous Coward · · Score: 0

    First day at the new job they turned me loose on the product, told me to poke around the software and get familiar with the GUI. I took down the entire system within minutes after running a report that didn't place limits on how far back you could search. Needless to say, that got fixed fairly quickly.

  16. Google has been doing this for many, many years by melted · · Score: 1

    There it's called "DiRT" (stands for "Disaster Recovery Test").

  17. Re: Netflix Simian Army and Microservice Architect by Anonymous Coward · · Score: 0

    O btw, the Simian Army is open sourced!

  18. Tests marketing by Anonymous Coward · · Score: 0

    But at least HP really did blow something up:
    https://www.youtube.com/watch?v=bUwthF9x210

  19. crazier ambitions for just what to take offline.. by Anonymous Coward · · Score: 0

    here's an idea..... how about the WHOLE FUCKING THING

    and we really don't give a shit if you ever get it all back online either, so.. no pressure at all. just DOIT

  20. Netflix Simian Army by shentino · · Score: 1

    Reminds me of "Chaos Monkey" from the netflix simian army.

  21. Re: Google has been doing this for many, many year by Anonymous Coward · · Score: 0

    As does pretty much any large company. I''m honestly not sure why this is worthy of being news? Any organisation that relies on IT for its business will have appropriate DR facilities and be regulary tesying. Financials are required by regulation to perform such tests. What makes Farcebooks test any different?

  22. Playhouse by peawormsworth · · Score: 1

    This reminds me of when Pee-wee Herman fell off his bike and then got up and said: "I meant to do that"