Collaborative Filtering and the Rise of Ensembles
igrigorik writes "First the Netflix challenge was won with the help of ensemble techniques, and now the GitHub challenge is over, and more than half of the top entries are also based on ensembles. Good knowledge of statistics, psychology and algorithms is still crucial, but the ensemble technique alone has the potential to make the collaborative filtering space a lot more, well, collaborative! Here's a look at the basic theory behind ensembles, how they shaped the results of the GitHub challenge, and how this pattern can be used in the future."
There's a lot of argument over why ensemble techniques work well in general, when using them on well-posed statistical problems. But in the collaborative filtering case, they work well at least in part because there's not a canonical way of posing the problem statistically that's also tractable--- there are instead multiple ways to view the problem, which expose different information. Aggregating those views is a pretty straightforward way of getting more information.
For example, you can see the Netflix prize as a few different standard statistical problems. As a per-movie regression, predicting what Person A will rate Movie B, given ratings vector of Person A and the ratings vectors of everyone who's already rated Movie B [the per-person ratings vectors excluding B are the X's, and the ratings on B are the Y's]. Or you slice the movie-ratings matrix the other way, with per-movie ratings vectors as the X's. Add in some other views (those are the two most straightforward), aggregate all the info you get from them, and you do better than any one approach alone.
10 PRINT CHR$(205.5+RND(1)); : GOTO 10
Although that's true with humans, it's a bit curious why it'd be true with algorithms. After all, the aggregation of 3 algorithms is still just an algorithm. It's not even totally clear which algorithms are ensembles and which aren't--- some non-ensemble methods could be re-analyzed using ensemble terminology, and some ensemble methods could be rewritten as unified iterative loops that don't look very ensemble-y. The jury's still out on the whole subject, as far as I can tell (I'm not an ML person, but I'm an AI person whose research bleeds into ML).
An exception is when you're aggregating information from truly different statistical problems, in which case you inherently have an ensemble problem, until someone comes up with the theory (plus tractable implementation) to view the problem as one unified statistical problem. I think collaborative filtering is currently in that stage--- there's no canonical way to pose the problem in the terminology of statistical regression/etc. that captures all aspects of it.
10 PRINT CHR$(205.5+RND(1)); : GOTO 10
"Fitness" doesn't always have to be related to the output; it can be related to the quality of a guessed input.
Consider the corollary of a poll test: a model in which "trusted" voters receive extra votes while everyone else still gets on vote. You can determine "trustworthiness" (or "karma", if you will) the same way Slashdot does - through moderation and meta-moderation, or you can use a more objective "de minimis research" criteria (like a poll test but without the punishment for failure.)
So someone voting on a school board bond election who can correctly answer questions about the stated usage of that bond, or the school district's financial bond rating, or who attends a school board meeting discussing the bond, could get 2 votes for the price of one.
This would a) allow "passionate" (albeit informed) voters to have more of a say than someone who is indifferent, and b) encourage people to do research and get involved in politics.
In a way, it's anti-democratic, but if you are going to insert any sort of elitism into the system, it might as well be a meritocracy.
Yeah, the term dates back at least to the 1990s. The classic survey paper (over 1000 citations!) on the subject is "Ensemble Methods in Machine Learning" [pdf] by Tom Dietterich (2000), for those who want to glance through a survey. Though be warned that some of its specific conclusions are now dated--- e.g. there's been a *lot* written in both statistics and machine learning since then on what boosting "really" is and why it works.
Dietterich presents the more machine-learning view of it, focused on algorithms, combination of predictions, iterative refinement, etc. The best survey from a statistical approach is probably Ch. 16 of this book by three Stanford profs, which you can probably read some of on Google Books.
10 PRINT CHR$(205.5+RND(1)); : GOTO 10