Collaborative Filtering and the Rise of Ensembles
igrigorik writes "First the Netflix challenge was won with the help of ensemble techniques, and now the GitHub challenge is over, and more than half of the top entries are also based on ensembles. Good knowledge of statistics, psychology and algorithms is still crucial, but the ensemble technique alone has the potential to make the collaborative filtering space a lot more, well, collaborative! Here's a look at the basic theory behind ensembles, how they shaped the results of the GitHub challenge, and how this pattern can be used in the future."
So can ensembles be used to create more sophisticated forms of direct democracy? That is, where everyone has input into decision-making, but where that input is vastly more complex than simple majority rule by the mob?
FWIW, this is called open source governance.
Of course having a group of people working together is a strength. If you are having a bad day or just feel like slacking off some one else is there to pick up the slack and keep the project moving. See also Division of labour http://en.wikipedia.org/wiki/Division_of_labour
There's a lot of argument over why ensemble techniques work well in general, when using them on well-posed statistical problems. But in the collaborative filtering case, they work well at least in part because there's not a canonical way of posing the problem statistically that's also tractable--- there are instead multiple ways to view the problem, which expose different information. Aggregating those views is a pretty straightforward way of getting more information.
For example, you can see the Netflix prize as a few different standard statistical problems. As a per-movie regression, predicting what Person A will rate Movie B, given ratings vector of Person A and the ratings vectors of everyone who's already rated Movie B [the per-person ratings vectors excluding B are the X's, and the ratings on B are the Y's]. Or you slice the movie-ratings matrix the other way, with per-movie ratings vectors as the X's. Add in some other views (those are the two most straightforward), aggregate all the info you get from them, and you do better than any one approach alone.
10 PRINT CHR$(205.5+RND(1)); : GOTO 10
One of the difficulties of ensemble development is weighting the logic that is being develeped. For instance, one of the problems we deal with at my job is matching incoming text to it's cleaned value. We have a list of approved words ['happy', 'sad', 'angry', 'sleepy'], and a text input of 'hap'. We need to determine which valid word 'hap' should match. Some rules I can think of for properly matching are:
:D
1.)Length of input compared to cleaned word.
2.)Number of nonpositional letter matches.
3.)Number of positional letter matches.
Depending on how rules are weighted determines what the answer will be (either sad or happy). I know at my job this weighting process requires very careful politicking.
Ok, I must admit that the contents of this article were way over my simplistic head. Can someone give me the management summary in laymen terms ?
I want that in my jquery to be bundled with windows 7
I have been a developer since the 80's. There are so many trendy cute buzzwords flying around (cloud computing...what the ****?), "mashups", "ensemble", etc. Can't we just call it what it is instead of this marketing crap? So tired of it.
Isn't this thing about encoding data different types of ways and then using a combined result?
Machine learning ensembles sounds just like monte-carlo tree search (MCTS) techniques (also called UCT), which are used in computer go (and more and more other AI problems) with great success.
The idea is that instead of trying to analyze a board position (which can be really, really difficult) using clever algorithms, you ask a random/simplistic algorithm to play out the rest of the game thousands upon thousands of times and see how many of those games it wins. The more it wins the better the positions.
Sounds crazy, but it actually works better than anything else.
(MCTS is usually thought of as using just one playout algorithm, with many random parameters; but that is still the same basic idea as ensembles using a bunch of different algorithms/models.)