Slashdot Mirror


Anonymity of Netflix Prize Dataset Broken

KentuckyFC writes "The anonymity of the Netflix Prize dataset has been broken by a pair of computer scientists from the University of Texas, according to a report from the physics arXivblog. It turns out that an individual's set of ratings and the dates on which they were made are pretty unique, particularly if the ratings involve films outside the most popular 100 movies. So it's straightforward to find a match by comparing the anonymized data against publicly available ratings on the Internet Movie Database (IMDb) (abstract on the physics arxiv). The researchers used this method to find how individuals on the IMDb privately rated films on Netflix, in the process possibly working out their political affiliation, sexual preferences and a number of other personal details"

5 of 164 comments (clear)

  1. Probabilities by dj245 · · Score: 4, Insightful

    The researchers used this method to find how individuals on the IMDb privately rated films on Netflix, in the process possibly working out their political affiliation, sexual preferences and a number of other personal details"

    This is a loaded statement. The most you can determine is that if a person likes movie A, B, C and D but hated E and F, there is a higher probability they are a guy. If they liked Z but didn't like X, there is a higher probability they might be a republican than not. You're still anonymous.

    Unless, of course, you're one of the three people that liked "Glitter". Then I think they might have something on you.

    --
    Even those who arrange and design shrubberies are under considerable economic stress at this period in history.
    1. Re:Probabilities by Chapter80 · · Score: 5, Insightful
      I think you're missing the point.

      If you rate a handful of movies on ImDB, under the persona "MyNickname12345" and that can be traced to your personal MySpace page, you have made that choice. No problem.

      If you then submit 100 movie ratings to Netflix, assuming that it is PRIVATE information that will not be linked back to you, and then Netflix releases the data to the public, now the 100 movies can be correlated to you, and your name can be revealed. Researchers have shown how PRIVATE DATA released to the public can be linked to already public information. PROBLEM!

  2. Do what now? by faloi · · Score: 4, Insightful

    It doesn't sound like the anonymity of the prize set was broken through any fault of NetFlix. It sounds like some sampling of users made the mistake of rating movies on a site where the info is publicly available, and a site where it's not. All they did was correlate the two.

    So the lesson is, basically, don't post stuff that you don't want to be public to a website that makes it public, right? This is sounds roughly like blaming the DMV for figuring out a car owners likely political leanings by the bumper stickers on their car.

    --
    "It is a miracle that curiosity survives formal education." -Albert Einstein
    1. Re:Do what now? by IBBoard · · Score: 4, Insightful

      Exactly - all they did was found that there was a correlation that might mean that the people are the same on IMDB and NetFlix. There's also the possibility that they're different people and that they just voted similar on different places.

      Besides, this all relies on people voting for a) really obscure films so they can be easily identified and b) voting similarly or identically on lots of films so that they can get a better idea as to whether it is the same person based on them liking the same films the same amounts.

      Just because two people from two different data sets both like (and are the only people in the data sets to like) lemon and custard jam as well as peanut butter with chips doesn't mean they're the same person, it just means they could be the same person and have similar tastes in obscure foods.

  3. What are you rating in IMDB vs Netflix by SmallFurryCreature · · Score: 4, Insightful

    As far as I know in IMDB you are rating the overall quality of the movie, not I agree with it OR I want to see more like this.

    One example, Shindlers list, great movie, do NOT want to see it again. Same with Grave of the fireflies. Some movies just ain't for multiple viewings. They are my "favorite movies I never want to see again".

    On the other hand I got movies I can watch any day of the week, but that I would NEVER rate as highly. Cannonbal run is one such movie. It watch it far too often, but I wouldn't call it a good movie. You can always fine me ready for a Jacky Chan movie or a spagethi western.

    Is the netflix rating system a "I liked this movie and want to see more like it" system or a "This movie was brilliant and I would highly recommend it too everyone else" type of rating system?

    Granted some people get it confused, probably the same people that use the slashdot moderation system to silence views they don't like, but that only makes basing conclusions on user ratings even more problematic.

    I can rate a movie highly even if I do not agree with it, simply because it is good. And I can rate a movie I really like to watch as crap simply because I know I like watching crap.

    I don't like the godfather movies, I can see they are high quality, I just don't like them. So my rating them would be fairly high as for quality, but low for 'I want to see more like this'.

    I thought that the netflix system was "I want to see more like this" based. Surely nobody is so stupid as to think a quality rating and a "i like this" rating system are the same? Or am I completly in the wrong in seeing a difference between the two? Am I insane in thinking that you can see a movie as being a great artwork and still not liking it or viceversa?

    --

    MMO Quests are like orgasms:

    You may solo them, I prefer them in a group.