Collaborative Filtering and the Rise of Ensembles
igrigorik writes "First the Netflix challenge was won with the help of ensemble techniques, and now the GitHub challenge is over, and more than half of the top entries are also based on ensembles. Good knowledge of statistics, psychology and algorithms is still crucial, but the ensemble technique alone has the potential to make the collaborative filtering space a lot more, well, collaborative! Here's a look at the basic theory behind ensembles, how they shaped the results of the GitHub challenge, and how this pattern can be used in the future."
So can ensembles be used to create more sophisticated forms of direct democracy? That is, where everyone has input into decision-making, but where that input is vastly more complex than simple majority rule by the mob?
FWIW, this is called open source governance.
Of course having a group of people working together is a strength. If you are having a bad day or just feel like slacking off some one else is there to pick up the slack and keep the project moving. See also Division of labour http://en.wikipedia.org/wiki/Division_of_labour
There's a lot of argument over why ensemble techniques work well in general, when using them on well-posed statistical problems. But in the collaborative filtering case, they work well at least in part because there's not a canonical way of posing the problem statistically that's also tractable--- there are instead multiple ways to view the problem, which expose different information. Aggregating those views is a pretty straightforward way of getting more information.
For example, you can see the Netflix prize as a few different standard statistical problems. As a per-movie regression, predicting what Person A will rate Movie B, given ratings vector of Person A and the ratings vectors of everyone who's already rated Movie B [the per-person ratings vectors excluding B are the X's, and the ratings on B are the Y's]. Or you slice the movie-ratings matrix the other way, with per-movie ratings vectors as the X's. Add in some other views (those are the two most straightforward), aggregate all the info you get from them, and you do better than any one approach alone.
10 PRINT CHR$(205.5+RND(1)); : GOTO 10
One of the difficulties of ensemble development is weighting the logic that is being develeped. For instance, one of the problems we deal with at my job is matching incoming text to it's cleaned value. We have a list of approved words ['happy', 'sad', 'angry', 'sleepy'], and a text input of 'hap'. We need to determine which valid word 'hap' should match. Some rules I can think of for properly matching are:
:D
1.)Length of input compared to cleaned word.
2.)Number of nonpositional letter matches.
3.)Number of positional letter matches.
Depending on how rules are weighted determines what the answer will be (either sad or happy). I know at my job this weighting process requires very careful politicking.
Here's a stab at 3 sentences:
Making a prediction by running multiple statistical prediction algorithms and combining their results often seems to work well. This is called an "ensemble method". Ensemble methods seem to work particularly well on collaborative filtering problems.
10 PRINT CHR$(205.5+RND(1)); : GOTO 10
Yeah, the term dates back at least to the 1990s. The classic survey paper (over 1000 citations!) on the subject is "Ensemble Methods in Machine Learning" [pdf] by Tom Dietterich (2000), for those who want to glance through a survey. Though be warned that some of its specific conclusions are now dated--- e.g. there's been a *lot* written in both statistics and machine learning since then on what boosting "really" is and why it works.
Dietterich presents the more machine-learning view of it, focused on algorithms, combination of predictions, iterative refinement, etc. The best survey from a statistical approach is probably Ch. 16 of this book by three Stanford profs, which you can probably read some of on Google Books.
10 PRINT CHR$(205.5+RND(1)); : GOTO 10
Algorithms produce better results working in committees, unlike you.