Slashdot Mirror


Netflix Gives Data Center Tools To Fail

Nerval's Lobster writes "Netflix has released Hystrix, a library designed for managing interactions between distributed systems, complete with 'fallback' options for when those systems inevitably fail. The code for Hystrix—which Netflix tested on its own systems—can be downloaded at Github, with documentation available here, in addition to a getting-started guide and operations examples, among others. Hystrix evolved out of Netflix's need to manage an increasing rate of calls to its APIs, and resulted in (according to the company) a 'dramatic improvement in uptime and resilience has been achieved through its use.' The Netflix API receives more than 1 billion incoming calls per day, which translates into several billion outgoing calls (averaging a ratio of 1:6) to dozens of underlying systems, with peaks of over 100,000 dependency requests per second. That's according to Netflix engineer Ben Christensen, who described the incredible loads on the company's infrastructure in a February blog posting. The vast majority of those calls serve the discovery user interfaces (UIs) of the more than 800 different devices supported by Netflix."

10 of 75 comments (clear)

  1. Thank you Netflix! by siDDis · · Score: 4, Interesting

    Not only have you created an amazing tool, it is open source and the best part...it's actually well documented! Christmas came early this year!

    1. Re:Thank you Netflix! by Anonymous Coward · · Score: 5, Insightful

      (Netflix employee here, so forgive the AC)

      We don't use GPL code (and, assuming we were using GPLv2 code, given that we don't ship out server code, we wouldn't need to share it anyway), but:

      1. Netflix uses a ton of open-source technology. It's nice to be polite and give back;
      2. It's good publicity, which helps when we recruit people (which is something we do all the time);
      3. If it's good, then we'll have other people contribute to the software engineering efforts, which lowers the cost we pay to maintain and improve the software;

  2. 800 devices supported by interkin3tic · · Score: 2, Insightful

    But they can't possibly manage to bring it to Linux.

    1. Re:800 devices supported by msk · · Score: 2

      They also haven't had a version of the Android player that's worked well on the LG Optimus V since v1.2 and that version won't work any more.

    2. Re:800 devices supported by bill_mcgonigle · · Score: 3, Informative

      Probably has something to with the Silverlight deal with Microsoft.

      Close, but 'confusing cause and effect'.

      Silverlight was a facet of the DRM deal that Netflix made with the Studios. So is not releasing a Linux client (because then, y'know, there would be Netflix rippers and movies on bittorrent...).

      Amazon plays movies on Flash on Linux, so Netflix made a bad deal (or perhaps Amazon benefited from not being 'first', same as when Apple pioneered online music with iTunes and got AES AAC while Amazon later had plain MP3). There's also a libnetflixplayer.so ELF-32 on Chromebook, so there's no technical obstacle.

      Presumably those contracts have a renewal period. Accept that there's no technical problem and focus on the legal (government) problems instead.

      --
      My God, it's Full of Source!
      OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
    3. Re:800 devices supported by interkin3tic · · Score: 2

      I'd say it actually has more to do with their CEO being on the board of MS until recently.

  3. Plan for failure of components. by concealment · · Score: 4, Insightful

    One of the best changes in "design philosophy" that has happened in the past 20 years is that instead of the idea of any product as a fortress that cannot fail, products are designed to expect their components to fail, and to recovery gracefully from it.

    This leads to a more flexible and resilient product. It reminds me of the military approach, where every system has at least two backups or alternates.

  4. Re:But does it include the chaos monkey? by CrankyFool · · Score: 5, Informative

    Hystrix does not include Chaos Monkey, but Chaos Monkey was opensourced some time ago.

    (I work at Netflix)

  5. Re:Sounds like Netflix is a mess by curunir · · Score: 4, Insightful

    Your critique seems overly simplistic. An HTTP load balancer is great for HTTP calls, but not everything in a complex infrastructure is HTTP. There's queues, data stores, caches, RPC, FileSystem access (SAN, NAS or local) and more that shouldn't run behind an HTTP interface. This tool helps solve the problem and gives you health check monitoring and metrics in the process. On initial inspection, my only complaint is that it requires too much modification of application code, however it seems like it should be pretty simple to integrate with the various IoC frameworks to use AOP proxies to apply the tool declaratively based on annotations.

    And you do realize that you followed up a weak critique of a backend scalability tool with a critique about a failing of their front-end application, right? What relevance does that have?

    --
    "Don't blame me, I voted for Kodos!"
  6. Re:Confusing Title by bws111 · · Score: 2

    So what you are saying is that in the absence of these tools the data centers would NOT fail? That is just stupid. With the tools, the data centers are RECOVERING from failure, or AVOIDING failure, or some such. Take out those important words, and you convey the exact opposite meaning from that which was intended. At that is pretty much the definition of a really crappy headline.