Netflix Gives Data Center Tools To Fail
Nerval's Lobster writes "Netflix has released Hystrix, a library designed for managing interactions between distributed systems, complete with 'fallback' options for when those systems inevitably fail. The code for Hystrix—which Netflix tested on its own systems—can be downloaded at Github, with documentation available here, in addition to a getting-started guide and operations examples, among others. Hystrix evolved out of Netflix's need to manage an increasing rate of calls to its APIs, and resulted in (according to the company) a 'dramatic improvement in uptime and resilience has been achieved through its use.' The Netflix API receives more than 1 billion incoming calls per day, which translates into several billion outgoing calls (averaging a ratio of 1:6) to dozens of underlying systems, with peaks of over 100,000 dependency requests per second. That's according to Netflix engineer Ben Christensen, who described the incredible loads on the company's infrastructure in a February blog posting. The vast majority of those calls serve the discovery user interfaces (UIs) of the more than 800 different devices supported by Netflix."
Happens all the time here, so kettle, meet black !!
Not only have you created an amazing tool, it is open source and the best part...it's actually well documented! Christmas came early this year!
I hate to say it, but the only thing I take away from this is that Netflix's software is such an unwieldy mess that they need a library just to enforce application separation and provide default fall-backs when a service call does fail.
FWIW, my preferred "circuit breaker" is a load balancer... All possible requests are network calls that go through load balancers, where it goes to the most responsive server, and if your admins screw up and none of the servers are responding quickly enough to answer the health check in time, a standard HTTP service unavailable response is supplied, without hammering the busy back-end.
And with all that complexity and effort, Netflix still can't handle two movies in your queue being assigned the same number, or a mixture of reordering and deleting titles at the same time... Things it handled just fine years ago, when it was a much smaller, web-1.0 service that didn't even require javascript.
Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
But they can't possibly manage to bring it to Linux.
One of the best changes in "design philosophy" that has happened in the past 20 years is that instead of the idea of any product as a fortress that cannot fail, products are designed to expect their components to fail, and to recovery gracefully from it.
This leads to a more flexible and resilient product. It reminds me of the military approach, where every system has at least two backups or alternates.
I read that Hysterix:
One becomes Hysterixical when their data center components fail.
Hystrix does not include Chaos Monkey, but Chaos Monkey was opensourced some time ago.
(I work at Netflix)
I can't be the only one having trouble parsing the title of this article "Netflix Gives Data Center Tools To Fail". What does it mean to "give something to fail?" I thought "fail" was a verb and doesn't make sense as the target of the verb "give". I've heard of the phrase "given to failure", but that doesn't seem what's being implied here.
Yeesh.
Hystrix does not include Chaos Monkey, but Chaos Monkey was opensourced some time ago.
(I work at Netflix)
It was even covered on Slashdot.
... that goes down every time someone breaks wind in an AWS datacenter, right?
"They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety."
I wonder how many pointy-haired bosses have used Chaos Monkey to load test their own Amazon setup but accidentally hit someone else's servers (or is it somehow PHB proof?)
It depends on your definition of "someone else's". It uses your AWS credentials to kill an instance, so in the worst case, the PHB of group A in company Z could kill instances of group B in company Z; company Y would still be safe.
Good point, sounds like it might be PHB-proof