Collaborative Filtering and the Rise of Ensembles
igrigorik writes "First the Netflix challenge was won with the help of ensemble techniques, and now the GitHub challenge is over, and more than half of the top entries are also based on ensembles. Good knowledge of statistics, psychology and algorithms is still crucial, but the ensemble technique alone has the potential to make the collaborative filtering space a lot more, well, collaborative! Here's a look at the basic theory behind ensembles, how they shaped the results of the GitHub challenge, and how this pattern can be used in the future."
So can ensembles be used to create more sophisticated forms of direct democracy? That is, where everyone has input into decision-making, but where that input is vastly more complex than simple majority rule by the mob?
FWIW, this is called open source governance.
There's a lot of argument over why ensemble techniques work well in general, when using them on well-posed statistical problems. But in the collaborative filtering case, they work well at least in part because there's not a canonical way of posing the problem statistically that's also tractable--- there are instead multiple ways to view the problem, which expose different information. Aggregating those views is a pretty straightforward way of getting more information.
For example, you can see the Netflix prize as a few different standard statistical problems. As a per-movie regression, predicting what Person A will rate Movie B, given ratings vector of Person A and the ratings vectors of everyone who's already rated Movie B [the per-person ratings vectors excluding B are the X's, and the ratings on B are the Y's]. Or you slice the movie-ratings matrix the other way, with per-movie ratings vectors as the X's. Add in some other views (those are the two most straightforward), aggregate all the info you get from them, and you do better than any one approach alone.
10 PRINT CHR$(205.5+RND(1)); : GOTO 10
Although that's true with humans, it's a bit curious why it'd be true with algorithms. After all, the aggregation of 3 algorithms is still just an algorithm. It's not even totally clear which algorithms are ensembles and which aren't--- some non-ensemble methods could be re-analyzed using ensemble terminology, and some ensemble methods could be rewritten as unified iterative loops that don't look very ensemble-y. The jury's still out on the whole subject, as far as I can tell (I'm not an ML person, but I'm an AI person whose research bleeds into ML).
An exception is when you're aggregating information from truly different statistical problems, in which case you inherently have an ensemble problem, until someone comes up with the theory (plus tractable implementation) to view the problem as one unified statistical problem. I think collaborative filtering is currently in that stage--- there's no canonical way to pose the problem in the terminology of statistical regression/etc. that captures all aspects of it.
10 PRINT CHR$(205.5+RND(1)); : GOTO 10