BellKor Wins Netflix $1 Million By 20 Minutes

← Back to Stories (view on slashdot.org)

BellKor Wins Netflix $1 Million By 20 Minutes

Posted by kdawson on Monday September 21, 2009 @05:58PM from the seven-guys-and-a-million-bucks dept.

eldavojohn writes "As we discussed at the time, there was a strange development at the end of Netflix's competition in which The Ensemble passed BellKor's Pragmatic Chaos by 0.01% a mere twenty minutes after BellKor had submitted results past the ten percent mark required to win the million dollars. Unfortunately for The Ensemble, BellKor was declared the victor this morning because of that twenty-minute margin. For those of you following the story, The New York Times reports on how teams merged to form Bellkor's Pragmatic Chaos and take the lead, which sparked an arms race of teams conjoining to merge their algorithms to produce better results. Now the Netflix Prize 2 competition has been announced." The Times blog quotes Greg McAlpin, a software consultant and a leader of the Ensemble: "Having these big collaborations may be great for innovation, but it's very, very difficult. Out of thousands, you have only two that succeeded. The big lesson for me was that most of those collaborations don't work."

25 of 104 comments (clear)

Min score:

Reason:

Sort:

Bad Summary by Anonymous Coward · 2009-09-21 18:11 · Score: 5, Informative

The Ensemble beat BellKor by 0.01%... by their own reporting. According to Netflix, it was a tie. In the case of a tie, the first posted results wins.
1. Re:Bad Summary by tangent3 · 2009-09-21 20:39 · Score: 5, Informative
  
  The Ensemble beat BellKor by 0.01% on the quiz set. Basically there are 2.8 million records in the qualifying set that the teams must predict the grades of. Half of the records (which half is known only to Netflix) form the quiz set, the other half form the test set. Teams submit their prediction a limit of once a day to get a result from the quiz set, but the final decision of who won is made on the result of the test set.
  So even though Ensemble beat BellKor on the quiz set, the test set results came back dead even.
It was a tie... by rm999 · 2009-09-21 18:18 · Score: 2, Interesting

It was a tie...
In football, I can see how a 20 second difference makes the difference between winning the superbowl. In a contest like this that took thousands of man hours of some brilliant people, calling Ensemble "second place" due to a 20 second difference is just wrong. I don't know if there was a better solution, but something just seems wrong about it all.
1. Re:It was a tie... by dingen · 2009-09-21 18:21 · Score: 3, Informative
  
  Altough I think the actual 20 minute difference instead of your imaginary 20 second difference is a little less harsh, you're still right.
  
  --
  Pretty good is actually pretty bad.
2. Re:It was a tie... by aywwts4 · 2009-09-21 19:25 · Score: 3, Insightful
  
  Most football games didn't start in 2006, so proportionally 20 seconds is far too long. You didn't exaggerate near enough, someone else can do the math though. (I'm real sleepy, but the imaginary football game came down to roughly 45 milliseconds?)
  I'm really surprised Netflix didn't offer 2 million dollars to the two winning teams, or at-least some sort of consolation prize, as it was effectively a tie in a culmination of years of work.
  These people did so much work even at a million dollars they would have likely earned below minimum wage. Netflix has come a long way since 2006, and this kind of research would have cost many millions, they really can't lose here. Unless the contest took so long the code isn't useful and they have already surpassed 10% in house.
  
  --
  Web Developers: Celebrate to our roots! Animated Gifs and Tiled Backgrounds, dont let our history die!
3. Re:It was a tie... by dingen · 2009-09-21 19:47 · Score: 2, Insightful
  
  Most football games last for a few minutes more than the standard 90 minutes, depending on the number of incidents during the match. The game would never be terminated in the middle of an interesting action and no proper referee cares about a few seconds.
  
  --
  Pretty good is actually pretty bad.
nonsense by wizardforce · 2009-09-21 18:34 · Score: 5, Insightful

The big lesson for me was that most of those collaborations don't work."
Setting an arbitrary goal that only .2% of competitors could meet does not mean that most collaborations don't work. If 90% of the teams met the target, you probably wouldn't be so quick to claim that the vast majority of collaborations do work but rather that the goal wasn't high enough.

--
Sigs are too short to say anything truly profound so read the above post instead.
Funny, I learned a different lesson... by Squiggle · 2009-09-21 18:37 · Score: 5, Insightful

The big lesson for me was that big collaborations were the most successful.
In creating solutions for hard problems most of everything fails and is horribly difficult. No big surprise there. Kinda odd that was the quoted lesson...

--
Complexity Happens
1. Re:Funny, I learned a different lesson... by misnohmer · 2009-09-21 20:56 · Score: 3, Interesting
  
  I was just about to post the very same comment. By the contest rules, the contest ends the once someone comes up with a winning solution. The fact that there were 2 solutions meeting the requirement so close together and both resulting from collaborations would rather suggest the collaborations worked really well. The other collaborations simply stopped once there was a winner. Concluding from this that collaborations don't work would be like concluding that the training athletes go through prior to the Olympic games doesn't work - after all from all these entrants training hard only 1 wins in each event.
Well at least.... by russ1337 · 2009-09-21 18:49 · Score: 2, Insightful

it's still good for the CV.....
The Rules are the Rules... by Anonymous Coward · 2009-09-21 18:49 · Score: 5, Interesting

I agree that Ensemble "losing" because they posted 20 minutes later is a harsh result. However, those were the rules that Netflix set forth and Ensemble, intentionally or not, was making a risky gamble by waiting until right before the deadline to submit their project. And, perhaps the "tie goes to the earlier poster" rule makes some sense because it encourages making your submission earlier that you would otherwise and not "sniping" unless you're absolutely sure your project is better than the rest. At least as far as I can understand, the rule set forth the proper tradeoff -- Ensemble got to see the score to beat (BellKor's) before it posted; however, in exchange for that, its score needed to have been better in order to win. Had Ensemble wanted the first-mover's advantage and the win in event of a tie, it could have posted earlier than BellKor. The fact that BellKor posted only 20 minutes before the end of the competition suggests that Ensemble could have easily posted earlier without compromising its entry. That is, how much significant tinkering could have possibly been done in the last half hour of this multi-year competition?
1. Re:The Rules are the Rules... by martin-boundary · 2009-09-21 19:56 · Score: 2, Insightful
  
  I think it would qualify as harsh if the runner up had a simple algorithm, but in this case all the teams which qualified for the 10% threshold did so with complicated blends of many algorithms. There's really no way to identify whose work is more valuable and deserved most to win, from a scientific perspective.
Re:Anonymous Coward by ksatyr · 2009-09-21 18:53 · Score: 3, Insightful

The whole thing confuses me. Why are these extremely intelligent people doing research work for NetFlix that would otherwise cost them many times the price of the prize if they paid them in-house? Are there at least share options down the road? I hope the ultimate solution(s) end up in the public domain.
Re:The Objective by __aasqbs9791 · 2009-09-21 19:05 · Score: 3, Informative

IIRC, it was to improve the prediction algorithm for ratings. Basically, if you rated this movies at this level, then Netflix tries to predict you will rate these movies at this many stars each, or something to that effect. I've found the old method they used seems to generally work pretty well for me, though there are times I've been surprised. Though I'm not convinced my ratings are really all that accurate anyway. I'm pretty sure if I'm in a certain mode before I see some movies I'd rate them quite a bit differently than other times, though without some way to wipe my memories of seeing it the first time, I'm not sure how I'd actually test that.
Re:The Objective by martin-boundary · 2009-09-21 20:17 · Score: 4, Informative

Though I'm not convinced my ratings are really all that accurate anyway. I'm pretty sure if I'm in a certain mode before I see some movies I'd rate them quite a bit differently than other times, though without some way to wipe my memories of seeing it the first time, I'm not sure how I'd actually test that.

If you phrase it like that, you're somewhat missing the point. The target was to minimize an average prediction error over a large number of people, not the prediction error for a single person (eg you).
Here's an analogy which might help: Suppose you play the lottery and you try to predict 6 numbers exactly, then you'll have a vanishingly small chance of getting them right. But suppose you submit millions of sets of predictions, all different, then your chance is much larger of getting the actual 6 right.
Now the Netflix contest required predicting a few million ratings, and even if any one rating might be very far off the target, the task only required making sure that a large proportion of the predictions were pretty close to each of their targets and the remaining ones were not too far off.
The winners were able to make several million predictions such that most of them were, on average (in the RMSE sense used a lot in engineering), a distance of 0.85 from the real rating.
Even if in some instances their predictions were off by 4 (ie predict 1 when it is 5). For example, with 4 million predictions, if 1% of their predictions are off by 4, that's 40,000 instances of being off by 4, but this has to be compensated by several percent of being off by 0 if you want to get 0.85 on average.
Re:The Objective by crunchyeyeball · 2009-09-21 20:59 · Score: 5, Informative

Basically, you were asked to predict how a number of users would rate a number of movies, based on their previous ratings of other movies.
You were supplied with 100 million previous ratings (UserID, MovieID, Rating, DateOfRating), with the rating being a number beween 1 and 5 (5=best), and asked to make predictions for a seperate ("hidden") set comprising roughly 10% of the original data. You could then post a set of predictions to their website which would be automatically scored, and you'd receive a RMSE (Root Mean Squared Error) by email.
To avoid the possibility of tuning your predictions based on the RMSE, you could only post one submission per day, and the final competition-winning results would be scored against a seperate hidden set, independent of the daily scoring set.
It really was a fantastic competition, and anyone with a little coding knowledge (or SQL knowledge) could have a decent go at it. Personally, I scored an RMSE of 0.8969, or a 5.73% improvement over Netflix's benchmark Cinematch algorithm, having learnt a huge amount based on the published papers and forum postings of others in the contest, and my own incoherent theories.
In a way, everyone wins. Netflix gets a truly world-class prediction system based on the work of tens of thousands of researchers around the world hammering away for years at a time. Machine learning research moves a big step forward. BellKor et al get a big juicy cheque, and enthusiastic amateurs like myself get access to a huge amount of real-world research and data.
Re:Anonymous Coward by kelnos · 2009-09-21 21:17 · Score: 2, Interesting

It is? I only see a bit in a question about licensing (somewhat tangential) that suggests that Netflix hopes that participants will be able to build a business out of the algorithm they design, but that sounds pretty weak, and doesn't have all that much to do with what the participants got, aside from the prize money.

The contest has been going on for three and a half years, and the winning team of seven will be splitting a cool million, which gives each person just under $145k, minus taxes. Now, I don't know how much time these guys spent on it, but even if they only worked a year's worth of regular work hours over the 3.5 years, $145k per year each for seven developers is a pretty damn good bargain from Netflix's perspective for what they got (not just the new algorithm, but a lot of good PR and buzz).

I'm not saying the BellKor guys got the shaft; they were certainly compensated (not just monetarily; I'm sure their employability went up as well), and I'm sure a big part of their desire to compete was the challenge itself. But I'd bet that Netflix would've had to pay quite a bit more.

And it's not like the BellKor team did all the work; all the other teams did some of the same work independently. I imagine many (most?) of them didn't stand a chance, but let's just throw out a conservative number and say the top 5% of teams managed to improve on Netflix's existing algorithm (even if not by 10%). It's conceivable to believe that an in-house team of paid developers/researchers would end up doing an analogous iterative process, achieving smaller gains, eventually reaching the 10% goal. Depending on Netflix's hiring skills, it's possible they wouldn't reach a 10% increase without many more man-years of work.

This contest was a very smart move on Netflix's part: their only real downside is that their -- self-imposed -- competition terms will allow the contest participants to competitively license their implementations to other companies.

--
Xfce: Lighter than some, heavier than others. Just right.
I think it's a gloss on prizes as innovation-spurs by langelgjm · 2009-09-21 23:29 · Score: 2, Interesting

I think he's pointing to one of the inefficiencies of prize systems as a way to spur innovation. Thousands of people tried, spending tens or hundreds of thousands of work-hours and other resources, and only a fraction got "winning results" (yes, according to the arbitrary way that winning was defined). But the point is that the prize probably resulted in a very inefficient use of resources. We could hypothesize that the same result might have been achieved with only 25% of the resources spent on the prize - for example, by making the cost of entry non-zero, you could have eliminated teams with no chance of winning from participating.
Basically prize systems benefit from people's inability to accurately assess their real chances of winning - or put another way, prize systems free ride off of people's self-delusion.
Of course there are other factors to be considered, e.g., what would those wasted resources have gone to if they were not being used for the competition, perhaps there are incidental rewards to those resources having been used, perhaps people competed for reasons other than simply winning the prize, etc.

--
"Anyone who [rips a CD] is probably engaging in copyright infringement." - David O. Carson
Re:The Objective by daybot · 2009-09-21 23:54 · Score: 2, Funny

I'm in a certain mode before I see some movies I'd rate them quite a bit differently
Absolutely. Every single film I first saw on a plane ranks very low for me.
Re:I think it's a gloss on prizes as innovation-sp by martin-boundary · 2009-09-22 00:20 · Score: 3, Interesting

for example, by making the cost of entry non-zero, you could have eliminated teams with no chance of winning from participating.

This doesn't work. If you make the entry cost nonzero, you'll be much less efficient at doing *science*. Remember, the journey is much more important than the result. The benefits to society in disseminating knowledge of data mining technologies and good datasets largely dwarfs the knowledge of the winning entry (think Metcalfe's law).
Re:I think it's a gloss on prizes as innovation-sp by ostrich2 · 2009-09-22 01:50 · Score: 2, Funny

Your experience was very different from mine.
I found an obvious solution and wrote it down in the margin of a book. I even discovered a proof of this, but the margin was too narrow to contain it.
Re:I think it's a gloss on prizes as innovation-sp by langelgjm · 2009-09-22 01:53 · Score: 2, Insightful

The benefits to society in disseminating knowledge of data mining technologies and good datasets largely dwarfs the knowledge of the winning entry (think Metcalfe's law).
You're only considering the benefits to society that result from this particular competition. The argument about prize systems being inefficient has to do with the fact that while they generate huge interest in a particular topic (and yes, generate more returns than simply the winning entry), they also result in an inefficient allocation of resources to that one particular topic.
I.e., some of the entrants would likely have benefited society more by flipping burgers or sweeping sidewalks than by wasting their time on the Netflix prize.
The problem is somewhat reduced if you have a large number of prizes on various topics, because then people can devote their time to areas where they have more of a chance of winning, or if you make the cost of entry non-zero (it can still be very low - anyone with any real interest and talent will not be turned off by a $1 or $5 registration fee, or by a simple test to assess their capabilities).

--
"Anyone who [rips a CD] is probably engaging in copyright infringement." - David O. Carson
Re:ratings systems by retchdog · 2009-09-22 02:26 · Score: 3, Insightful

I'm sure that every schmuck with a Netflix account would be willing to adhere to your stupid rules, and saddened by your unwillingness to pontificate on how you'd change human behavior.
Seriously, this is what Netflix would be if it were invented by Stalin.

--
"They were pure niggers." – Noam Chomsky
Re:The Objective by LordKronos · 2009-09-22 02:57 · Score: 2, Interesting

Well it might not affect the average prediction as it relates to everybody else. However, from a user's perspective, the whole point of the system is to try an figure out what my taste is for movies based on how I rated those movies, match it up to other people's ratings, and try to predict what other movies I'd like. You can't statistically average out my ratings, as my ratings are the only significant factor on one side of the equation. There are no other users you can use to balance out what my tastes are. It has to go by my ratings, and if my ratings are anomalous, the results are going to suffer.
Your lottery analogy is pointless, because it demonstrates a different issue. There, the 6 actual numbers against which we are rating your submissions is a factual matter. They aren't affected by your feeling and interpretations. They are going to be the same 6 numbers, no matter whether you just got a promotion at work or your spouse was just murdered. However, your rating of movie X would probably be different after the promotion than it would be after the murder of your spouse (we're assuming you actually liked your spouse and didn't hire a hitman).
Re:ratings systems by Geoff-with-a-G · 2009-09-22 05:19 · Score: 3, Insightful

Your proposed solution would only make sense if people were forced to watch a completely random selection of movies. Once you factor in the fact that people are allowed to select which movies they want to watch, it makes sense that their ratings would cluster towards the high end of the spectrum. That is, in fact, the whole point of this ratings prediction system: to tell you, in advance, which movies you will like. If it worked perfectly, you'd never have to rate a movie below average, because you could avoid ever renting a movie which you wouldn't like.