Slashdot Mirror


Chaos Monkey Released Into the Wild

Quince alPillan writes "Netflix revealed today that they've released Chaos Monkey, an open source Amazon Web Service testing tool that will randomly turn off instances in Auto Scaling Groups. 'We have found that the best defense against major unexpected failures is to fail often. By frequently causing failures, we force our services to be built in a way that is more resilient. We are excited to make a long-awaited announcement today that will help others who embrace this approach. ...source code for the founding member of the Simian Army, Chaos Monkey, is available to the community.'"

18 of 76 comments (clear)

  1. Into the wild? by dubl-u · · Score: 4, Informative

    And by "into the wild", they mean they're now letting it run on other people's sites.

    1. Re:Into the wild? by jcoy42 · · Score: 5, Funny

      This is why we don't let you write headlines.

      --
      Never trust an atom. They make up everything.
    2. Re:Into the wild? by inKubus · · Score: 5, Insightful

      Sound idea, sure. But not a substitute for good engineering. You see this issue come up again and again with these cloud services. The pressure from sales and marketing to move quickly and monetize the idea (and support lots of subscribers quickly) is not conducive to building a solid infrastructure. Netflix's approach is actually the exact opposite of Amazon's. Amazon's system is highly engineered and designed to resist failures that take down Amazon.com for it's customers. That is their number one goal. Amazon.com has not been down for a long time. AWS is an offshoot of that effort to resell their extra cycles but it's not nearly as engineered at the Amazon.com application built on top, which redirects around the globe and does lots of other things. It seems that AWS always has some new service coming out, but think about this: all those services were probably made by Amazon 3 years ago and they are just now releasing them to you..

      Netflix, on the other hand, seems to be just hacking together a site, if this is really what they primarily used to QA their application. What you're doing with this random failure thing is just statistically creating errors and finding bugs in failure handling code statistically. This means there's _up to_ an infinite number of bugs that will *not* be found with this method because they are unlikely or the tester is unlucky.

      It certainly has to do with the math of it, but it also has to do with the human culture that arises when working like this. See, with this brute force iterative programming, you are building a nest of patches. So what you are going to end up with is going to be more complicated and less functional than if you do the hard work. And that's the issue. Thinking about stuff in terms of thousands or millions of nodes is "too hard" so the aforementioned cloud providers keep coming up with "creative solutions" like this. (I remember reading about Facebook hacking mysql a few years back and shaking my head as well..) But, like "creative accounting", it may not be illegal but it may get you into trouble. You're never going to be absolutely sure the application will stay up and available. Ok, fine, so it Netflix goes down no ones going to die, but still...there's millions of dollars and subscriber goodwill at stake and that's not nothing.

      Anyway, don't think that I'm railing against creative testing, but they shouldn't think they are so clever as the release seems to imply they think they are ;)

      --
      Cool! Amazing Toys.
    3. Re:Into the wild? by arkhan_jg · · Score: 3, Informative

      Seems more about that they've just published the source code on github under the Apache licence.

      So you can run your own chaos monkey on your own amazon cloud systems, or modify it to run on your private cloud, or whatever.

      --
      Remember kids, it's all fun and games until someone commits wholesale galactic genocide.
    4. Re:Into the wild? by dave420 · · Score: 2

      That's a lot of guesswork... I don't see many links backing your positions up.

    5. Re:Into the wild? by eyrieowl · · Score: 3, Insightful

      There are a lot of things that can go wrong in failover scenarios. Unless and until they are tested in real world situations, you can't be certain the system works. I happen to know of many systems which had failover processes which were "tested", and sounded fine on paper, but when it came to the real world, they had failed to account for this or that unexpected condition which ended up leading to far more downtime that was expected. If chaos monkey is their ONLY way of arriving at a resilient service, than sure, they have a deeper issue. But if they've spent time trying to design a solid system and then they're using Chaos Monkey to make sure it's as bullet-proof as they think it is, then it's good, solid engineering for the real world. I am reminded of the book "Inviting Disaster", on technology failures. All the systems described in the book which failed were well engineered systems. But due to a series of events working in concert, disaster happened. Any one link in the chain of failures wouldn't be enough; and it is not possible to fully engineer that out of your system; and certainly not possible to test for that in controlled testing environments. But if you can start causing failures in the real world (which is a luxury you have with systems that don't actually keep people alive), you have the opportunity to eliminate those sorts of weaknesses from the system. That's what I think is the value to something like this.

    6. Re:Into the wild? by metrometro · · Score: 2

      If Netflix is hacking together a site, why is their HD streaming more reliably pleasing than any other online service, including places like Comcast, which presumably has 100x the engineers on hand? Maybe they are good at teh hacking?

    7. Re:Into the wild? by JackieBrown · · Score: 2

      It is better than Amazons on the PS3 for me. Amazon gets stuck buffering a lot for me.

    8. Re:Into the wild? by rwa2 · · Score: 2

      Meh, what's the point of good engineering if you never test it? I've heard of a quite a few wonderfully expensive and over-engineered UPS and RAID deployments that failed completely because they never bothered to actually test the procedures. The last company I worked at would often have regular "emergency power off" events where they'd do a complete shutdown of the entire datacenter triggered by various environmental factors. And you know what? More times than not they'd still find a system that somehow missed the trap and didn't get shut down properly, and plenty of caveats with the enterprise-grade UPS infrastructure.

      At one of the first companies I worked at, the idea was to engineer a cluster with no SPOF, so we'd actually invite customers (/monkeys) to go to the back and rip out / unplug something, anything, while the cluster was doing something like a distributed POVRay render. It was a pretty simple, elegant test, and a great mindset to have when designing any HA system, not just for fault tolerance, but also to architecturally enable for on-line upgradeability, scalability, and some other niceties.

  2. Very Erlang-y by Anonymous Coward · · Score: 3, Informative

    We have found that the best defense against major unexpected failures is to fail often. By frequently causing failures, we force our services to be built in a way that is more resilient.

    Sounds like what has been common in Erlang for decades.

    Off topic: when I watch the /. homepage, I am logged in. As soon as I click on a story, I become an Anonymous Coward. Did anybody else experience this bug too?

  3. Missleading title by valentinas · · Score: 4, Funny

    I though this was about monkeys...

    1. Re:Missleading title by hawguy · · Score: 2

      they just released the source to something that can be turned into a ddos tool in like 5 minutes? seriously?

      If someone else has the private keys that let them control your EC2 instances, then you probably have more to worry about than a tool that will randomly shut down your running instances.

  4. Obligatory... by CODiNE · · Score: 3, Funny
    --
    Cwm, fjord-bank glyphs vext quiz
  5. I love this thing by ghostdoc · · Score: 3, Interesting

    Not only for the idea that a serious company lets a masturbating-and-throwing-poo grinning idiot loose in their sensitive vitals, but also because it draws so many parallels with other resilient systems.

    Allergies cured by parasitical worms? Chaos Monkey Effect - you need something attacking your defences for your system to stay healthy

    Ecosystem that relies on bushfires to clear old vegetation? Chaos Monkey Effect

    Something almost Zen about not only turning an attacker's violence against them, but deliberately introducing new attackers so your system is strengthened by them.

    Well done chaps, carry on.

    --
    Business/App ideas are like arseholes: everyone's got one, they're mostly shit, but very rarely they contain a diamond
  6. Java, meh by codepunk · · Score: 4, Funny

    Leave it to some java developers to write 100k lines of code to do a shutdown -h now.

    --


    Got Code?
    1. Re:Java, meh by TubeSteak · · Score: 3, Funny

      1 line to do shutdown -h now.
      99,999 lines to build a GUI.

      That sounds about right.

      --
      [Fuck Beta]
      o0t!
  7. A Cure for "Unexpected" by retroworks · · Score: 3, Funny

    "We have found that the best defense against major unexpected failures is to fail often."

    In other words, you'll never be disappointed if you expect total incompetence. I've already achieved this same thing on my own with my Netflix account, by completely and utterly lowering my expectations.

    --
    Gently reply
  8. For a bit more background about Chaos Monkey by ZorroXXX · · Score: 2

    Jeff Atwood has an blog Working with the Chaos Monkey.

    --
    When you are sure of something, you probably are wrong (search for "Unskilled and Unaware of It").