Domain: netflixprize.com
Stories and comments across the archive that link to netflixprize.com.
Comments · 34
-
Netflix didn't just anonymize the dataVia http://www.stat.columbia.edu/~cook/movabletype/archives/2009/12/privacy_vs_know.html:
I'm not sure whether the litigators have read this particular section of the Netflix prize rules:
To prevent certain inferences being drawn about the Netflix customer base, some of the rating data for some customers in the training and qualifying sets have been deliberately perturbed in one or more of the following ways: deleting ratings; inserting alternative ratings and dates; and modifying rating dates.
So yes, you can match a set of reviews with someone else, but how will you know that it's really a person and not a random coincidence? 0.5 million review traces give plenty of opportunity for a false positive match. Netflix learned from AOL's data release disaster, which resulted in a few people getting fired.
-
Re:The Objective
People's daily moods do affect their movie ratings, and the winning algorithms account for that with parameters that vary by person and day. You can read about it in the winners' algorithm description.
-
Re:As the one who made up the names
It didn't take you long to crow about this on Netflix's Prize Forum.
-
Here's the Netflix thread where Koren leaks he won
http://www.netflixprize.com/community/viewtopic.php?id=1498
"Thanks. In fact, this is a very happy day for us - our team is top contender for winning the Grand Prize, as we have a better Test score than The Ensemble. (Probably this is the first post revealing this in the forum smile)"
Also, Yehuda Koren is at Yahoo now, not AT&T. -
Re:What contest?
The contest has been going on for several years straight, and
/. has had several stories about it. The article takes knowledge of the contest as a given.
See Wikipedia and Netflix's own site for details. -
Re:Why now?
In fact, according to the second post by Yehuda Koren in this thread, it looks like BelKor does have the best test error rate and will be declared the winner. http://www.netflixprize.com/community/viewtopic.php?id=1498
-
There is more than 1 day left
Call me crazy, but if you actually *read* the rules it says the contest is going until at least October 2nd, 2001.
Terms and Conditions in a Nutshell
* Contest begins October 2, 2006 and continues through at least October 2, 2011.
-
Re:Breaking News
Guess what guys. What you did was basic. You remembered my song history.
Not so: they remembered your song history and then recommended / played you music which they thought you might like. Less like twitter, more like Cinematch. It worked quite well, IMO.
And now your users don't trust you...
Yeah, also, if you don't live in the US, the UK or Germany it costs you money to use the good part of their service. This means that, from Canada, I can still send them my musical habits, which they will apparently send along the RIAA, but I can't benefit from doing so by learning about cool new music from them.
I hope they go down. This kind of behaviour is just stupid.
-
Re:Ever been to grad school?A million dollars? This is what happens when business people dabble in science. Artificial Intelligence grad students and professors have been studying these kinds of problems for decades.
I think that is the point - academia has been studying this for decades and has yet to produce meaningful results. I'm not saying that universities haven't contributed their fair share of technological advances through the years, but doing so in a practical and timely manner isn't exactly what they're known for. When business and/or money gets thrown into the mix, the pace of progress tends to rapidly accelerate.
X Prize Foundation
Millennium Problems
2008 Templeton Prize
Netflix could have saved a boatload of money by throwing some cash at a university with an established AI group and asking them to research the current state-of-the-art
According to the Netflix site there are currently 35558 contestants on 29326 teams from 170 different countries. They could have thrown any amount of money at any university and still not received the kind of effort they've seen to date. I'd say their million dollars is money well spent. -
Re:How is it quantified
Which improvements? The Netflix competition?
They basically have a large dataset consisting of User, Movie, Rating. Of this set, they split it into two data sets. In the smaller subset they removed the ratings and didn't release these to the public. They didn't modify the larger subset at all. They had cinematch make predictions on the smaller subset (without having been told the real predictions) and use this as the baseline. Next, people that compete in the competition make predictions on the missing data and improvements can be calculated. They calculate the percent improvement as 100 * [Submission's Error] / [Cinematch's Error]
There are a number of ways to calculate the error but for the Netflix competition they use MASE (Mean Average Squared Error). Basically you take the sum of the squared difference between what was predicted and what the real rating was then divide it by the number of ratings.
Detailed information can be found on the Netflix Prize rules page and there are a number of good posts on the forums as well.
-
Re:Numbers?The numbers in the summary don't match up with the numbers on Netflix's leaderboard: leaderboard has changed since the book was published...
-
Numbers?
The numbers in the summary don't match up with the numbers on Netflix's leaderboard:
BellKor: 9.08%
Gravity/Dinosaurs: 8.82%
BigChaos: 8.80% -
against the terms of the prize
yes this data is useful, but you can't use it in the contest:
http://www.netflix.com/TermsOfUse
see also:
http://www.netflixprize.com/community/viewtopic.php?id=98
http://www.netflixprize.com/community/viewtopic.php?id=20
http://www.netflixprize.com/community/viewtopic.php?id=14
note that this makes sense. more/better data would help ANY decent algorithm. they want a better one, and they're judging you on a baseline. so they'd naturally limit your input options. -
against the terms of the prize
yes this data is useful, but you can't use it in the contest:
http://www.netflix.com/TermsOfUse
see also:
http://www.netflixprize.com/community/viewtopic.php?id=98
http://www.netflixprize.com/community/viewtopic.php?id=20
http://www.netflixprize.com/community/viewtopic.php?id=14
note that this makes sense. more/better data would help ANY decent algorithm. they want a better one, and they're judging you on a baseline. so they'd naturally limit your input options. -
against the terms of the prize
yes this data is useful, but you can't use it in the contest:
http://www.netflix.com/TermsOfUse
see also:
http://www.netflixprize.com/community/viewtopic.php?id=98
http://www.netflixprize.com/community/viewtopic.php?id=20
http://www.netflixprize.com/community/viewtopic.php?id=14
note that this makes sense. more/better data would help ANY decent algorithm. they want a better one, and they're judging you on a baseline. so they'd naturally limit your input options. -
Re:Beating everyone?
Currently at 8th place. I've seen this story all over and I don't know why it does not link to the actual standings: http://www.netflixprize.com/leaderboard Oh wait, I know why. Because it makes the story look dated!
-
Re:no breaktrough - just blending
from: http://www.netflixprize.com//community/viewtopic.php?id=799 We have also updated the Prize leaderboard to reflect the award of the 2007 Progress Prize and have established the new accuracy requirement to qualify for the 2008 Progress Prize. Again, in accord with the Rules, the new Prize level reflects a 1% improvement over team KorBell's verified submission, requiring a 9.34% improvement over the original Cinematch accuracy level.
-
Re:no breaktrough - just blendingAlso a clarification on the progress prize: to get it you need to have at least 1% improvement over the previous result. Considering that there is only 1.57% to go there is room for only one more progress prize until it hits the Grand Prize (10% improvement over the original results). Where did you get that? The rules (http://www.netflixprize.com/) state:
To qualify for a year's $50,000 Progress Prize the accuracy of any of your submitted predictions that year must be less than or equal to the accuracy value established by the judges the preceding year.
You just have to be better. -
Re:Moving target?
No. The rules state: To qualify for the $1,000,000 Grand Prize, the accuracy of your submitted predictions on the qualifying set must be at least 10% better than the accuracy Cinematch can achieve on the same training data set at the start of the Contest. The official contest site can be found on http://www.netflixprize.com/
-
This could be a boon for the Netflix Prize
This could really boost the NVIDIA Compute Unified Device Architecture (CUDA) hence number crunching projects like the Netflix Prize, which needs a second wind right about now due to a slowdown of progress in recent weeks. This depends, of course, on the detected error not being critical to the programs compiled for the 8800 hardware by CUDA, and on NVIDIA making the returned 8800s available for CUDA programmers.
-
Significance levels and missing dataI wrote a primitive version of such a site several years ago which I called Laboratory of the States since the goal was to gather lots of demographic variables by State and present ecological correlations.
Shortly thereafter, a site called Nation Master cropped up, with a bit flashier and simpler user interface, but focused on CIA World Fact Book data, rather than the States of the US. (The same folks later did State Master using similar UI technology.)
Finally, Google tested Gapminder with an even spiffier and simpler UI -- again focusing on by Nation correlations.
Aside from the usual complaints about "The Ecological Fallacy" (a fallacy that cuts both ways BTW) there are two big pitfalls for this stuff:
- Dealing with missing data.
- Estimating statistical significance.
What I did about missing data was simply eliminate any data points where data was missing from one or both of the variables being correlated. This reduces the sample size, hence statistical significance, but it bypasses arguments over what sort of missing data should be used. The Netflix Prize is coming up with really good algorithms to compute missing data efficiently and accurately so maybe there is hope for something more effective here.
Statistical significance is more difficult to deal with. Usually one must look at tables for statistical significance of correlations under the assumption that the variables each follow a normal distribution. Unfortunately, many variables follow polynomial (like squared) or exponential distributions, so you have to do things like take the sqrt or log of one or both of the variables to try to normalize them. However, when you are looking for correlations, sometimes it its the relationship that is polynomial or exponential -- in which case you can apply sqrt or log to get the maximum correlation coefficient at the sacrifice of normality of one or both of the variables. Unfortunately, there is no simple arithmetic formula for calculating the significance level of a correlation given a non-normal distribution -- you can't just plug in the skewness, kurtosis, etc. as well as sample size and correlation coefficient, and get out a valid statistical significance. Therefore it is hard to make good statements about many very important correlations without watering them down to meaninglessness.
Also, a complaint about the "simple" user interfaces:
Some of the worst reporting from news media comes when they refuse to report statistics in terms remotely related to anything meaningful -- for example you will frequently hear statements to the effect that "California has the most orange trees in the nation." or some such. Such statistics are nonsense for the purposes of correlation studies since the size of the ecology (California state) is all you are really measuring with such statements. You have to divide by the population or divide by the total GDP or something to rationalize the ecology against other ecologies.
In Laboratory of the States, I did this with all my variables but I also left the raw variables around and allowed people to do arithmetic on them -- like dividing them -- to get their own rational comparisons if for some reason my choices were not adequate. This problem isn't as bad with Gapminder as it is with Nation Master and State Master -- but Gapm
-
Re:Who cares about prizes?
The leading team changes every few days in the Netflix prize. For the longest time, it was a guy from U Toronto called NIPS Reject, then it was the whole ML team at the same uni, then it was wxyzconsulting.com, and now it's Team Gravity. It's come to the point where successive improvements are incremental and hardly significant over the previous leader. What should be interesting now is if anyone has the big breakthrough that actually wins the prize. Check out the actual Netflix leaderboard
-
InsaneThe Netflix Prize, originally scheduled to run for 5 years, has within one month already achieved nearly half of its goal of 10% improvement on their flagship technology, Cinematch -- and so what does DARPA do?
Abandon technology prizes.
They are insane.
-
Re:banned in Quebec
From the FAQ: "Most of those countries appear are on the U.S. Treasury Office of Foreign Assets Control's list of embargoed counties for which we cannot provide economic assistance. If this list changes, we'll post a change to the rules and let you know. Quebec has other reasons." Here's why Quebec is on the list.
-
Never-ending competition
From the rules:
Contest begins October 2, 2006 and continues through at least October 2, 2011.
your submitted predictions that year must be less than or equal to the accuracy value established by the judges the preceding year.
Okay, so the contest doesn't have an ending date, and the judges can modify the winning criteria as they see fit. I hope whoever spends time on this enjoys having others profit financially from their free labor.
Dan East -
Re:Seems like a free gift for Netflix to me...
If you read NetFlix' prize site, you'll find that they give clear cut statistical requirements for winning that are well defined. It's actually quite impressive the detail into which they go; it's clear that they want real engineers on this, and that they're willing to get seriously specific in order to make sure people know what's what.
-
Official site
Press release: http://www.netflix.com/MediaCenter?id=5368
Offcial registration and competition information for the Prize: http://www.netflixprize.com/ -
What's wrong with Quebec?
From the rules at http://www.netflixprize.com/rules : "Residents of the province of Quebec in Canada are ineligible to participate." They go on to list the remaining Axis of Evil and some other countries the U.S. doesn't much care for.
But what's wrong with Quebec? I would presume that they passed some sort of IP law that would make it problematic for NetFlix if the winner were based there. -
Re:how to enter the contest?From their press release:
Complete details for registering and competing for the Netflix Prize are available at www.netflixprize.com
-
Re:Seems like a free gift for Netflix to me...
The contest site has a rules page that tells what accuracy you have to beat. You have to train your system with the dataset they provide and then test against what those people actually liked for the year after the dataset ends.
I could not find anything about whether you get to keep the rights to your code if you are not a winner. -
Re:Privacy issues?
From the netflix rules:
"To protect customer privacy, all personal information identifying individual customers has been removed and all customer ids have been replaced by randomly-assigned ids. The date of each rating and the title and year of release for each movie are provided. No other customer or movie information is provided."
Sounds familiar...
And later:
"To prevent certain inferences being drawn about the Netflix customer base, some of the rating data for some customers in the training and qualifying sets have been deliberately perturbed in one or more of the following ways: deleting ratings; inserting alternative ratings and dates; and modifying rating dates. However, the Cinematch RMSE measured on the final, perturbed dataset does not differ significantly from the RMSE measured on the unperturbed dataset for the purposes of Grand or Progress Prize qualification described below. The RMSE values reported below represent the RMSE measured on both the perturbed and unperturbed datasets to the precision specified above."
Not sure how much that actually protects end users, but they tried a bit more than AOL. -
Re:Privacy issues?
From http://www.netflixprize.com/ :
To prevent certain inferences being drawn about the Netflix customer base, some of the rating data for some customers in the training and qualifying sets have been deliberately perturbed in one or more of the following ways: deleting ratings; inserting alternative ratings and dates; and modifying rating dates.
Plus all the usual replacing of IDs and such you'd expect. Looks like they're trying to avoid a repeat of the AOL debacle at least. -
Re:database?http://www.netflixprize.com/
I submitted this to
/. with a link, but apparently wasn't the first. -
Re:database?