Facebook Engineers Crash Data Centers In Real-World Stress Test (ieee.org)
An anonymous reader writes: In a report via IEEE Spectrum, Facebook's VP of Engineering Jay Parikh described the company's "Project Storm" -- regular takedowns of Facebook's data center intended to stress test the company's disaster recovery efforts. The first few didn't go so well, he reports. (Perhaps doing a test during a World Cup final was not such a good idea). Months and months of planning went into the initial effort, though up until the actual moment, other Facebook leaders didn't think he'd actually take out an active data center. "In 2014, Parikh decided Project Storm was ready for a real-world test: The team would take down an actual data center during a normal working day and see if they could orchestrate the traffic shift smoothly," reports IEEE Spectrum. Parikh recalls: "I was having coffee with a colleague just before the first drill. He said, 'You're not going to go through with it; you've done all the prep work, so you're done, right?' I told him, 'There's only one way to find out'" if it works. (Parikh made the remarks at this week's @Scale conference in San Jose.) Parikh says there never seemed to be a good time to perform the live takedowns. "Something always ended up happening in the world or the company. One was during the World Cup final, another during a major product launch." The report adds, "The live takedowns continue today, with the Project Storm team members coming up with crazier and crazier ambitions for just what to take offline, Parikh says.
This is totally worth it. What would happen if there was a REAL disaster (like a nuclear strike) and people couldn't check their facebook feed and post "thoughts and prayers" messages? Too terrible to think about.
Real world stress test! See if the other satellites can pick up the slack.
Good for him! Most DR exercises I've seen are planned weeks, if not months in advance. They are more of a scheduled fail-over to a redundant site and not an actual disaster recovery test.
In the event of an actual disaster, there would be no recovery.
I'm heartened to see SOMEONE does it right.
Learning HOW to think is more important than learning WHAT to think.
Thats how you should do it. Disasters dont wait around untill your not busy doing something else, they hit when they want to. Thats why i think live testing is important to see if recovery plan works or needs iteration of improvement... Best test is usually when disaster strikes right when you have hands full of something else... No time to dig up manuals, etc...
Trying to disseCt recent Sys Admin only way to go: gloves, condoms More gay than they give other people BUWLA, or BSD
My last DR Test Plan only had 2 lines in it
Power=OFF
Execute DR Plan.
facebook and engineering aren't something I'd put into a sentence together.
Facebook’s Project Storm originated in the wake of 2012’s Hurricane Sandy, Parikh reported. The superstorm threatened two of Facebook’s data centers, each carrying tens of terabits of traffic. Both got through Sandy unscathed, Parikh said, but watching the storm’s progress led the engineering team to consider what exactly would be the impact on Facebook’s global services if the company did indeed suddenly lose a data center or an entire region.
Jesus, get over yourselves. What's this World coming to when twitterbook has to be protected from natural disasters.
And please spare me this nonsense of it's needed. If you can post on facetwit you can call your Mom or whomever and let them know you're alright - if you are actually there. The rest of twitterface users need to get a life. Having twitterbook knocked out for a few hours would be good for them.
I didn't know that Facebook still exists. Haven't most people gone somewhere else for their 'waste of time' needs already? But I suppose it's still important to stress test FB. Someone's grandma would be really sad if she couldn't post her cookie recipe after Yellowstone had blown up or an accidental thermonuclear war had broken out.
So Netflix has been doing this for years now... it's called the Chaos Monkey and part of their "Simian Army" that performs this kind of function but *all the time* with no schedule. This is not something FB came up with but this post seems to give them credit for this innovation. More interesting than the lack of credit to Netflix though is this adoption of a method that heavily favors a Microservice Architecture. Seeing more and more of this flexible, scalable, and highly resilient architecture and methodology being put out in industry is certainly encouraging.
Pity.
Just when you thought your privacy could be restored by the massive failure of a data center, a new center rises up from the ashes of the other to take the helm of stealing every personal detail about your life and the life of people you love.
Facebook is still up and running.
We will bankrupt ourselves in the vain search for absolute security. -- Dwight D. Eisenhower
It would be very useful for Facebook to stop announcing ASN 32934 for a few centuries as an experiment just to see what would happen.
Or permanently remove authority records for facebook.com.
You know just to see what would happen.
Months and months of planning went into the initial effort, ...
Into the take down or recovery? 'Cause the former just requires pulling a cable of some sort. :-) TFS says the team would take down a site and try to migrate the traffic, but wouldn't it be better if the disaster group and the recovery group were different teams for a "real world" stress test?
It must have been something you assimilated. . . .
Actually, this is one of the better story summaries I've seen here.
I knew what it was talking about (no unexplained mayfly buzzwords), I knew who the protagonists were, and I knew what was at stake. The only implied innovation was one of personal chutzpah, against the backdrop of an organization notorious for taking all things in collective stride (these being very, very short strides).
Working at Facebook Sounds Like Joining a Cult
At some level, I think we do indigenous people a disservice by referring to them as First Nations, freezing them in the amber of the era, as if they couldn't (and hadn't) kicked the shit out of their neighbours every bit as ruthlessly as the Spanish, the Dutch, or the British (secretly, it's a badge of honour, isn't it, to have kick-ass forbears?)—the main difference being that the European cultures brought with them a written language—so long, illiterate heathens—then, however they found the table set is assigned a positive integer (let's not even grant them "zero") to functionally signify that no form of tomahawk displacement came before.
So of course Netflix didn't invent this technique.
Clubbing a bunch of strange-looking men and taking their women is an idea that never owed much to the example of recent history, no matter how grand and savage an example your nearest neighbour might have set in the memorable recency.
First day at the new job they turned me loose on the product, told me to poke around the software and get familiar with the GUI. I took down the entire system within minutes after running a report that didn't place limits on how far back you could search. Needless to say, that got fixed fairly quickly.
There it's called "DiRT" (stands for "Disaster Recovery Test").
O btw, the Simian Army is open sourced!
But at least HP really did blow something up:
https://www.youtube.com/watch?v=bUwthF9x210
here's an idea..... how about the WHOLE FUCKING THING
and we really don't give a shit if you ever get it all back online either, so.. no pressure at all. just DOIT
Reminds me of "Chaos Monkey" from the netflix simian army.
As does pretty much any large company. I''m honestly not sure why this is worthy of being news? Any organisation that relies on IT for its business will have appropriate DR facilities and be regulary tesying. Financials are required by regulation to perform such tests. What makes Farcebooks test any different?
This reminds me of when Pee-wee Herman fell off his bike and then got up and said: "I meant to do that"