Slashdot Mirror


Chaos Monkey Released Into the Wild

Quince alPillan writes "Netflix revealed today that they've released Chaos Monkey, an open source Amazon Web Service testing tool that will randomly turn off instances in Auto Scaling Groups. 'We have found that the best defense against major unexpected failures is to fail often. By frequently causing failures, we force our services to be built in a way that is more resilient. We are excited to make a long-awaited announcement today that will help others who embrace this approach. ...source code for the founding member of the Simian Army, Chaos Monkey, is available to the community.'"

49 of 76 comments (clear)

  1. Into the wild? by dubl-u · · Score: 4, Informative

    And by "into the wild", they mean they're now letting it run on other people's sites.

    1. Re:Into the wild? by jcoy42 · · Score: 5, Funny

      This is why we don't let you write headlines.

      --
      Never trust an atom. They make up everything.
    2. Re:Into the wild? by Anonymous Coward · · Score: 1

      I think the concept is good. That if the desire is to withstand failures of unforeseen natures, then test with random failures and observe how the software reacts to it.

      In practice, it will likely get you fired in a production environment, but I still think the idea behind it is sound.

    3. Re:Into the wild? by inKubus · · Score: 5, Insightful

      Sound idea, sure. But not a substitute for good engineering. You see this issue come up again and again with these cloud services. The pressure from sales and marketing to move quickly and monetize the idea (and support lots of subscribers quickly) is not conducive to building a solid infrastructure. Netflix's approach is actually the exact opposite of Amazon's. Amazon's system is highly engineered and designed to resist failures that take down Amazon.com for it's customers. That is their number one goal. Amazon.com has not been down for a long time. AWS is an offshoot of that effort to resell their extra cycles but it's not nearly as engineered at the Amazon.com application built on top, which redirects around the globe and does lots of other things. It seems that AWS always has some new service coming out, but think about this: all those services were probably made by Amazon 3 years ago and they are just now releasing them to you..

      Netflix, on the other hand, seems to be just hacking together a site, if this is really what they primarily used to QA their application. What you're doing with this random failure thing is just statistically creating errors and finding bugs in failure handling code statistically. This means there's _up to_ an infinite number of bugs that will *not* be found with this method because they are unlikely or the tester is unlucky.

      It certainly has to do with the math of it, but it also has to do with the human culture that arises when working like this. See, with this brute force iterative programming, you are building a nest of patches. So what you are going to end up with is going to be more complicated and less functional than if you do the hard work. And that's the issue. Thinking about stuff in terms of thousands or millions of nodes is "too hard" so the aforementioned cloud providers keep coming up with "creative solutions" like this. (I remember reading about Facebook hacking mysql a few years back and shaking my head as well..) But, like "creative accounting", it may not be illegal but it may get you into trouble. You're never going to be absolutely sure the application will stay up and available. Ok, fine, so it Netflix goes down no ones going to die, but still...there's millions of dollars and subscriber goodwill at stake and that's not nothing.

      Anyway, don't think that I'm railing against creative testing, but they shouldn't think they are so clever as the release seems to imply they think they are ;)

      --
      Cool! Amazing Toys.
    4. Re:Into the wild? by arkhan_jg · · Score: 3, Informative

      Seems more about that they've just published the source code on github under the Apache licence.

      So you can run your own chaos monkey on your own amazon cloud systems, or modify it to run on your private cloud, or whatever.

      --
      Remember kids, it's all fun and games until someone commits wholesale galactic genocide.
    5. Re:Into the wild? by dave420 · · Score: 2

      That's a lot of guesswork... I don't see many links backing your positions up.

    6. Re:Into the wild? by eyrieowl · · Score: 3, Insightful

      There are a lot of things that can go wrong in failover scenarios. Unless and until they are tested in real world situations, you can't be certain the system works. I happen to know of many systems which had failover processes which were "tested", and sounded fine on paper, but when it came to the real world, they had failed to account for this or that unexpected condition which ended up leading to far more downtime that was expected. If chaos monkey is their ONLY way of arriving at a resilient service, than sure, they have a deeper issue. But if they've spent time trying to design a solid system and then they're using Chaos Monkey to make sure it's as bullet-proof as they think it is, then it's good, solid engineering for the real world. I am reminded of the book "Inviting Disaster", on technology failures. All the systems described in the book which failed were well engineered systems. But due to a series of events working in concert, disaster happened. Any one link in the chain of failures wouldn't be enough; and it is not possible to fully engineer that out of your system; and certainly not possible to test for that in controlled testing environments. But if you can start causing failures in the real world (which is a luxury you have with systems that don't actually keep people alive), you have the opportunity to eliminate those sorts of weaknesses from the system. That's what I think is the value to something like this.

    7. Re:Into the wild? by Anonymous Coward · · Score: 1

      The superiority of Amazon's engineering culture over Netflix must be the reason why during the last major EC2 outage, Netflix managed to stay up and operational...

    8. Re:Into the wild? by Anonymous Coward · · Score: 1

      I've been using Netflix for years now - and I've only had trouble with the streaming service once, maybe twice in that time. Hulu often freaks out, Vudu the same - so I think the proof is in the pudding. Lastly, this is just an additional testing piece, and frankly it is very cool. No where did they say it was a substitute for good engineering.

    9. Re:Into the wild? by metrometro · · Score: 2

      If Netflix is hacking together a site, why is their HD streaming more reliably pleasing than any other online service, including places like Comcast, which presumably has 100x the engineers on hand? Maybe they are good at teh hacking?

    10. Re:Into the wild? by JackieBrown · · Score: 2

      It is better than Amazons on the PS3 for me. Amazon gets stuck buffering a lot for me.

    11. Re:Into the wild? by rwa2 · · Score: 2

      Meh, what's the point of good engineering if you never test it? I've heard of a quite a few wonderfully expensive and over-engineered UPS and RAID deployments that failed completely because they never bothered to actually test the procedures. The last company I worked at would often have regular "emergency power off" events where they'd do a complete shutdown of the entire datacenter triggered by various environmental factors. And you know what? More times than not they'd still find a system that somehow missed the trap and didn't get shut down properly, and plenty of caveats with the enterprise-grade UPS infrastructure.

      At one of the first companies I worked at, the idea was to engineer a cluster with no SPOF, so we'd actually invite customers (/monkeys) to go to the back and rip out / unplug something, anything, while the cluster was doing something like a distributed POVRay render. It was a pretty simple, elegant test, and a great mindset to have when designing any HA system, not just for fault tolerance, but also to architecturally enable for on-line upgradeability, scalability, and some other niceties.

    12. Re:Into the wild? by luis_a_espinal · · Score: 1
      That this post was modded 5 is a sad testament to slashdot.

      Sound idea, sure. But not a substitute for good engineering.

      That argument only makes sense if it were the case that Netflix is using it in lieue of good engineering. But, it isn't, so...

      Also, this is a false dichotomy. Chaos Monkey is in great part a form of fault injection, which itself is part of good engineering.

      You see this issue come up again and again with these cloud services.

      Like amazon EC2?

      The pressure from sales and marketing to move quickly and monetize the idea (and support lots of subscribers quickly) is not conducive to building a solid infrastructure. Netflix's approach is actually the exact opposite of Amazon's.

      You know this from a fact, or is it pure speculation?

      Amazon's system is highly engineered and designed to resist failures that take down Amazon.com for it's customers. That is their number one goal. Amazon.com has not been down for a long time. AWS is an offshoot of that effort to resell their extra cycles but it's not nearly as engineered at the Amazon.com application built on top, which redirects around the globe and does lots of other things. It seems that AWS always has some new service coming out, but think about this: all those services were probably made by Amazon 3 years ago and they are just now releasing them to you..

      Great non sequitur.

      Netflix, on the other hand, seems to be just hacking together a site, if this is really what they primarily used to QA their application.

      Seems? Seems? First you state in very certain terms that Netflix is doing the exact opposite to Amazon. And then you say that Netflix modus operandi seems hackey? You built an entire argument against Netflix from what it seems to you?

      What you're doing with this random failure thing is just statistically creating errors and finding bugs in failure handling code statistically.

      And this is bad because? Ever heard of failt injection? I don't know man, but fault injection has always been part of good engineering workbooks.

      This means there's _up to_ an infinite number of bugs that will *not* be found with this method because they are unlikely or the tester is unlucky.

      1. This method (and fault injection in general) is not meant to discover all the bugs, nor is it being billed by Netflix for that purpose. The argument makes for a good strawman, though.

      2. Fault injection or not, for any large piece of software built, there will always be bugs that will remain undiscovered. Always. This is independent of whether fault injection is used or not as part of development/QA processes.

      When you use a fault injection method independent of a developer's POV, the objective is to create scenarios where bugs will manifest themselves during the development process. This is distinct from a QA/Tester that verifies software according to established test scenarios. It is equally distinct from stress/load testing.

      How different this is from manually injecting a fault in a system to see if it can cope with it? Say, kill -9 your test database while your app writes to it to see how it handles the error and brings an appropriate error page to the user (as opposed to an ugly http server 500 page)? Bring it back to see if your app can reconnect to it for future transactions? Kill your LDAP server while users are logging into your app to verify that already logged users are not affected by login failures (it shouldn't but most systems fail miserably at this.) Force your thread/connection pool to be size 1 and flood it with requests, injecting out-of-capacity failure, to see how your system manages it? Does it drop death? Can it recover?

      What this Netflix tool is doing is simply automating the process of fault injection. What develope

    13. Re:Into the wild? by luis_a_espinal · · Score: 1

      That's a lot of guesswork... I don't see many links backing your positions up.

      His positions are superficial and emotional, that's all.

    14. Re:Into the wild? by inKubus · · Score: 1

      Thanks for taking the time to reply to my post, I appreciate it.

      --
      Cool! Amazing Toys.
    15. Re:Into the wild? by inKubus · · Score: 1

      To clarify what I specifically wrote in my post, Amazon.com (Amazon's application, where they make the money), has not been down in a long time. The Virgina EC2 outage only affected the excess capacity they resell to AWS customers. I'm not singling out Netflix and I'm not saying that this is a bad or horrible or un-useful tool. I appreciate all the stuff Netflix is open-sourcing.

      --
      Cool! Amazing Toys.
  2. Re:This ... by Anonymous Coward · · Score: 1

    panic(cpu 0): Enraged Monkey Error: Out of bananas!

  3. Very Erlang-y by Anonymous Coward · · Score: 3, Informative

    We have found that the best defense against major unexpected failures is to fail often. By frequently causing failures, we force our services to be built in a way that is more resilient.

    Sounds like what has been common in Erlang for decades.

    Off topic: when I watch the /. homepage, I am logged in. As soon as I click on a story, I become an Anonymous Coward. Did anybody else experience this bug too?

    1. Re:Very Erlang-y by colinrichardday · · Score: 1

      I don't have such a problem.

    2. Re:Very Erlang-y by NVW55V · · Score: 1

      Sounds like a cookie permission problem to me. Get rid of saved cookies, look through Tools-Options-Privacy-History section-Exceptions button, find relevant Slashdot entries, check their status, delete or modify as needed.

    3. Re:Very Erlang-y by FatdogHaiku · · Score: 1

      Off topic: when I watch the /. homepage, I am logged in. As soon as I click on a story, I become an Anonymous Coward. Did anybody else experience this bug too?

      Some would see this as a super power... of course they're already trolls, but mighty trolls with super powers.
      Seriously, I've seen something like this in FF with some privacy plugins, but it's been awhile.

      --
      You have the right to remain sentient. If you give up the right to remain sentient, you will be elected to public office
    4. Re:Very Erlang-y by 19thNervousBreakdown · · Score: 1

      You're probably disabling subdomain cookies. For instance right now we're not on slashdot.org, we're on it.slashdot.org.

      --
      <xml><I><am><so><damn>Web 2.0</damn></so></am></I></xml>
  4. Missleading title by valentinas · · Score: 4, Funny

    I though this was about monkeys...

    1. Re:Missleading title by Anonymous Coward · · Score: 1

      I'm just wondering what makes Chaos Monkey different than Timetwister, and how much mana it costs.

    2. Re:Missleading title by gman003 · · Score: 1

      Freakin' PING can be turned into a DDOS tool in like five seconds. Doesn't mean it shouldn't be distributed.

      Also, I imagine it needs some form of authentication to actually turn your site off. Which means you'd have to already have a privileged username/password for each site you want to attack. Pretty poor DDOS tool.

    3. Re:Missleading title by hawguy · · Score: 2

      they just released the source to something that can be turned into a ddos tool in like 5 minutes? seriously?

      If someone else has the private keys that let them control your EC2 instances, then you probably have more to worry about than a tool that will randomly shut down your running instances.

    4. Re:Missleading title by azalin · · Score: 1

      No.
      Please try to read the summary again. If anyone has gained the access level required to run this software, you are already f*cked beyond rescue.

    5. Re:Missleading title by slashmydots · · Score: 1

      You would get approximately the same result if you let an actual monkey loose in your server room though.

  5. Chaos Monkey? by Anonymous Coward · · Score: 1

    One black, one red, one green, one blue and one white mana + X ,where X is random. Throw any in play instant through the room: if it lands face-down eat a banana. If it lands face-up the instant is played a normal.

  6. The Truth by paleo2002 · · Score: 1, Informative

    War, famine, violence, addiction, pollution . . . truly, WE are the Chaos Monkeys!

  7. Better go underground ... by geofgibson · · Score: 1

    Now we see the beginning of the Army of the 12 Monkeys. We're doomed ...

  8. Obligatory... by CODiNE · · Score: 3, Funny
    --
    Cwm, fjord-bank glyphs vext quiz
    1. Re:Obligatory... by honestmonkey · · Score: 1

      Don't know if you noticed this:

      We kept our system flags in an area of very low memory reserved for the system globals, starting at address 256 ($100 in hexadecimal)

      100 bucks for an address?

      Cool story, though.

      --
      Everything you know is wrong, Just forget the words and sing along.
  9. I love this thing by ghostdoc · · Score: 3, Interesting

    Not only for the idea that a serious company lets a masturbating-and-throwing-poo grinning idiot loose in their sensitive vitals, but also because it draws so many parallels with other resilient systems.

    Allergies cured by parasitical worms? Chaos Monkey Effect - you need something attacking your defences for your system to stay healthy

    Ecosystem that relies on bushfires to clear old vegetation? Chaos Monkey Effect

    Something almost Zen about not only turning an attacker's violence against them, but deliberately introducing new attackers so your system is strengthened by them.

    Well done chaps, carry on.

    --
    Business/App ideas are like arseholes: everyone's got one, they're mostly shit, but very rarely they contain a diamond
    1. Re:I love this thing by bdabautcb · · Score: 1

      I was going to attack your attack of bushfires... until I re-read your allergy attack sentence and realized you have it right. Good work, fellow ecology nerd.

      --
      Koalas. They're telepathic. Plus, they control the weather. -Margaret
  10. Java, meh by codepunk · · Score: 4, Funny

    Leave it to some java developers to write 100k lines of code to do a shutdown -h now.

    --


    Got Code?
    1. Re:Java, meh by TubeSteak · · Score: 3, Funny

      1 line to do shutdown -h now.
      99,999 lines to build a GUI.

      That sounds about right.

      --
      [Fuck Beta]
      o0t!
    2. Re:Java, meh by codepunk · · Score: 1

      Personally I am shocked they did not write it in Scala.

      --


      Got Code?
  11. oh..THAT chaos monkey by pablo_max · · Score: 1

    at first I was thinking the article was about this chaos monkey.
    http://www.wtop.com/681/2859976/Rock-throwing-chimp-plans-complex-attacks-on-visitors

  12. Excellent idea and great work by Anonymous Coward · · Score: 1

    Congrats team, give yourselves a slap in the face!

  13. Good idea for mobile devs too... by Kelson · · Score: 1

    Except they need to randomly turn off the network connection in their test envronment. It's amazing how many mobile apps assume you'll always have a solid connection and never be in an elevator, or walking between tall buildings, or the basement of a convention center, or any other place with a spotty or overloaded signal.

  14. Wait.. by moniker127 · · Score: 1

    I didn't tell anyone about the chaos monkey.... Oh. Its just some program. Carry on then.

  15. The shot heard 'round the Web... by BenJCarter · · Score: 1

    The media war is getting serious. Chaos Monkeys? How about you get Stars back?

    --
    For in politics, as in religion, it is equally absurd to aim at making proselytes by fire and sword. - Publius
  16. A Cure for "Unexpected" by retroworks · · Score: 3, Funny

    "We have found that the best defense against major unexpected failures is to fail often."

    In other words, you'll never be disappointed if you expect total incompetence. I've already achieved this same thing on my own with my Netflix account, by completely and utterly lowering my expectations.

    --
    Gently reply
  17. In other news... by jaymemaurice · · Score: 1

    Script kiddies are released on the internet to improve security by exploiting unchecked buffers and unsanitized inputs...
    Security of information at all time high.

    Errrr...

    --
    120 characters ought to be enough for anyone
  18. For a bit more background about Chaos Monkey by ZorroXXX · · Score: 2

    Jeff Atwood has an blog Working with the Chaos Monkey.

    --
    When you are sure of something, you probably are wrong (search for "Unskilled and Unaware of It").
  19. God, title sounded WAY more awesome by PJ6 · · Score: 1

    than it actually was.

    I was picturing a wild, multicolored, gene-spliced ball of fur tearing around, shoving badgers in lion's ears 'n shit.

  20. Re:Chaos Reigns. by jimi1x · · Score: 1

    In the future there is only war.

  21. Chaos Monkey start up this morning... by TheLoneGundam · · Score: 1

    Chaos Monkey start up get working Chaos Monkey is a hoot Chaos Monkey's preferred new target is instance of a group... (the rest is left as an exercise for Coulton fans)