The Simian Army and the Antifragile Organization
CowboyRobot writes "ACM has an article about how Netflix conducts its resilience testing. Instead of the GameDays used by sites such as Amazon and Google, Netflix uses what they call The Simian Army, based on the philosophy that 'Resilience can be improved by increasing the frequency and variety of failure and evolving the system to deal better with each new-found failure, thereby increasing anti-fragility.' While GameDay exercises are like a fire-drill, with scheduled exercises where failure is manually introduced or simulated, the Simian Army relies on failure in the live environment induced by autonomous agents known as 'monkeys.' Chaos Monkey randomly terminates virtual instances in a production environment that are serving live customer traffic. Chaos Gorilla causes an entire Amazon Availability Zone to fail. And Chaos Kong will take down an entire region of zones. 'What doesn't kill you makes you stronger' and Netflix hopes that by constantly protecting itself from internal onslaught, they will become increasingly 'anti-fragile — growing stronger from each successive stressor, disturbance, and failure.'"
This is the example that I use when explaining antifragility to my colleagues. I highly recommend Nassim Nicholas Taleb's book, "Antifragile" - at least chapters 1,2, and 7.
No wonder nobody takes Netflix seriously. What kind of tech company worries about things like reliability and robustness? That's soooo 20th century. Everyone knows that if you have more than 90% availability or too low of a bug rate it means you're not agile enough and you can't be one of those amazingly innovative social networking outfits.
Yeah i liked that video too.
But how do you write the big spec for that? The PMI would never approve.
sPh
So this is the ability to use whatever resources are available for graceful failover, allowing masses cheap/consumer grade equipment to be used instead of small amounts of expensive, reliable, enterprise gear.
Sounds like a winning strategy.
It's hard to explain for layer Antifragility are best built on layers of fragility.. meaning cells in a body are fragile but the body itself get's stronger when stressed (lifting weights, Etc.). The Netflix example is good, it's a bit like randoming pulling parts of a plane in flight and then after the crash making the next planes stronger.. it also leads to antifragility, but it's a strong stressor. .
http://www.hawknest.com/
The Black Swan chaos, Government/Hollywood takeover, the 1+ billon dollars lawsuit, EMP bombs, mass/worldwide migration to internet 2, Yellowstone and of course, the Cthulu Chaos. Probably the insider threat chaos goes around all this options.
I was looking forward to hearing about this army full of primates.
This just sounds like automated testing with a new name. Testing on live networks is maybe a little bit "innovative"; but it's really just automated testing. Now let's go synergize some more paradigms.
...catastrophic management failure, such as when executives decide to spin off half the company into an independent service called Qwikster...
The problem with this, is that it's still programmed failure. In my experience, hardware or software faults, or combinations of both, are not nearly as effective as plain old human stupidity. Oh, and government action. There is no disaster recovery plan for "Here's a warrant. Give us all your shit." There is a similar lack of recovery options for human stupidity. And let's be honest: It's more abundant in the universe than hydrogen, and infinitely harder to defend against, precisely because stupidity is far more cunning and unpredictable than intelligence could ever hope to be.
#fuckbeta #iamslashdot #dicemustdie
I think part of the reason for their heavy focus on reliability is that they are competing with the mature television industry and thus have a lot of concern for finicky customers that are considering cutting their cable/satellite plan.
why reiterating this again?
"Chaos Monkey" sounds like it ought to be the name of the next iteration of Firefox's Javascript subsystem.
Hang on.... "Chaos Monkey is a piece of software that deliberately takes out random parts of your live production system".... hmmmm.... maybe it *is* the Firefox Javascript subsystem?
(Spudley Strikes Again!)
I wonder how this behaves in the eye of the customer.
From cluster solutions i know there are those in the maintain of it that mistake a redundant system with zero downtime.
The problem is that if you take down a server , all connections to it are down. Some application gracefully swtich to an other server. Some application however first have to time out. Some applicatons crash.
THe question is, do those interruptions get reported correctly, or are people just blame the app, restart their PC?
Very few of those user-problems actually get reported, and the first line help desk just instructs them to restart, and since by then an other VM / region has taken over, everything works. But doing this on purpose is not a gooed user expierence.
Just remember, 0 downtime does not mean that there are no interruptions, to minimize these you need a differnet mindset.
Yes sir, I like uptimes of 1 to 2 years.
I originally read that as the "Syrian Army".
I'm a good cook. I'm a fantastic eater. - Steven Brust
netflix gets all this great PR for this approach - and at least in theory it's a good one - but as a customer of netflix's, the results i've experienced are actually pretty poor.
think about it, they go around shooting nodes in the head during business hours. In the long run, that's great, they can be prepared for anything, but it's still madness.
Oh and separation of services? Great. But who the hell wants to browse the netflix directory when the streaming service is down? Not me, for one.
Maybe they do this with their PC client? They surely don't seem to care about the robustness of their Android client. I think they must develop and test that monster on the latest, most powerful hardware that a corporation can buy. Then they fill it full of graphics and video until it almost breaks thus ensuring that it runs like crap on anything less. I would drop Netflix like a ton of bricks except they have licensed most of the content that I would actually want to watch while Hulu, the only competitor I am aware of has just about nothing for me.
Chaos Kong ate my cloud! I hope Mario can save my QoS levels by rescuing Pauline.
The article appears to be a slightly pretentious way of saying that Netflix does reliability testing on its live systems. They can get away with this only because it is not critically important for Netflix to be highly robust: the downside of failure is merely a degree of temporary irritation. Don't try this in the financial markets or life-support systems.