Slashdot Mirror


Anonymity of Netflix Prize Dataset Broken

KentuckyFC writes "The anonymity of the Netflix Prize dataset has been broken by a pair of computer scientists from the University of Texas, according to a report from the physics arXivblog. It turns out that an individual's set of ratings and the dates on which they were made are pretty unique, particularly if the ratings involve films outside the most popular 100 movies. So it's straightforward to find a match by comparing the anonymized data against publicly available ratings on the Internet Movie Database (IMDb) (abstract on the physics arxiv). The researchers used this method to find how individuals on the IMDb privately rated films on Netflix, in the process possibly working out their political affiliation, sexual preferences and a number of other personal details"

12 of 164 comments (clear)

  1. Sexual preferences? by tygerstripes · · Score: 4, Funny

    Who goes out of their way to rate "Anal Whores 3" online?

    --
    Meta will eat itself
    1. Re:Sexual preferences? by mh1997 · · Score: 5, Funny

      Who goes out of their way to rate "Anal Whores 3" online?
      The good thing about porn flicks, as a general rule, is that they're too bland to have really bad plots. The search for good dialogue strays too far off the beaten path established by the social mores of the target market, be that old men, college students, or perverts out on dates. There are pornos with solid plots, just rarely pornos with complicated plots.

      What they generally aren't is full of capers designed by crackheads in search of sexual relief, or a dominatrix dying to destroy the gold market with a Da Vinci alchemy machine only a cat burglar from Hoboken could steal.

      Yes, the plot of Anal Whores 3 is as convoluted as it is kitschy. Mercedes and Veronica Diamond forcibly enlist the help of happy-go-lucky and half-a-second-out-of-prison pizza delivery man Hawk (Peter North) to steal the pieces to a machine that turns lead vibrators into gold. Hawk isn't halfway to a cup of coffee with his wise cracking cohort, Tommy (Johnny Cockring) when he finds himself back in the burglary game. Casing out a heist he meets nun/professional patron of the arts/double agent/love interest Jessie Jane (vows of bestiality can put the kibosh on even the best of cinematic love interests). When you throw in a CIA agent (Dick Coburn) and a couple of double dildos, you've managed to make the world's most convoluted porno....

    2. Re:Sexual preferences? by styryx · · Score: 4, Interesting

      That's the plot of Hudson Hawk. Good flick.

    3. Re:Sexual preferences? by Minwee · · Score: 5, Funny

      Yes, they would have to have watched Hudson Hawk to do that. That narrows the field considerably.

  2. Probabilities by dj245 · · Score: 4, Insightful

    The researchers used this method to find how individuals on the IMDb privately rated films on Netflix, in the process possibly working out their political affiliation, sexual preferences and a number of other personal details"

    This is a loaded statement. The most you can determine is that if a person likes movie A, B, C and D but hated E and F, there is a higher probability they are a guy. If they liked Z but didn't like X, there is a higher probability they might be a republican than not. You're still anonymous.

    Unless, of course, you're one of the three people that liked "Glitter". Then I think they might have something on you.

    --
    Even those who arrange and design shrubberies are under considerable economic stress at this period in history.
    1. Re:Probabilities by Chapter80 · · Score: 5, Insightful
      I think you're missing the point.

      If you rate a handful of movies on ImDB, under the persona "MyNickname12345" and that can be traced to your personal MySpace page, you have made that choice. No problem.

      If you then submit 100 movie ratings to Netflix, assuming that it is PRIVATE information that will not be linked back to you, and then Netflix releases the data to the public, now the 100 movies can be correlated to you, and your name can be revealed. Researchers have shown how PRIVATE DATA released to the public can be linked to already public information. PROBLEM!

  3. Do what now? by faloi · · Score: 4, Insightful

    It doesn't sound like the anonymity of the prize set was broken through any fault of NetFlix. It sounds like some sampling of users made the mistake of rating movies on a site where the info is publicly available, and a site where it's not. All they did was correlate the two.

    So the lesson is, basically, don't post stuff that you don't want to be public to a website that makes it public, right? This is sounds roughly like blaming the DMV for figuring out a car owners likely political leanings by the bumper stickers on their car.

    --
    "It is a miracle that curiosity survives formal education." -Albert Einstein
    1. Re:Do what now? by IBBoard · · Score: 4, Insightful

      Exactly - all they did was found that there was a correlation that might mean that the people are the same on IMDB and NetFlix. There's also the possibility that they're different people and that they just voted similar on different places.

      Besides, this all relies on people voting for a) really obscure films so they can be easily identified and b) voting similarly or identically on lots of films so that they can get a better idea as to whether it is the same person based on them liking the same films the same amounts.

      Just because two people from two different data sets both like (and are the only people in the data sets to like) lemon and custard jam as well as peanut butter with chips doesn't mean they're the same person, it just means they could be the same person and have similar tastes in obscure foods.

  4. Data-mining and the actual problem by Anonymous Coward · · Score: 4, Interesting

    There are two things going on here. One, many people are asking how you could identify any personal information about people based on their movie preferences. The answer is data-mining. Very sophisticated techniques exist to do things exactly like this, i.e. take a data set and find out about the people.

    The second problem is that by deanonymizing the NetFlix data, you can start to cheat on the NetFlix prize. The requirement to win $1 million is that your recommendation engine is 10% better than the one they are currently using. However, if you can learn the exact preferences of some users in the dataset (i.e. by finding the rest of their ratings on IMDB) then you can hardcode that into your recommendation engine and get the recommendations for these users exactly right. This can boost your score even though your actual system is no better than the existing one. This is known as over-fitting to the data.

    Finally, this paper is over a year old. Can we please have some new news?

  5. Easy solution by Thanshin · · Score: 4, Funny

    Every time you feel the need to vote 10 in Glitter, also vote 10 to The Godfather.
    Every time you cheer for Brokeback Mountain, also put a 10 in Huge Knockers MXII.
    Every time you want to express your love for Dersu Uzala, vote a 10 in Spice World, with added commentaries.

    That way, everybody will know you're a security conscious computer scientist. Or a squizophrenic moron.

  6. What are you rating in IMDB vs Netflix by SmallFurryCreature · · Score: 4, Insightful

    As far as I know in IMDB you are rating the overall quality of the movie, not I agree with it OR I want to see more like this.

    One example, Shindlers list, great movie, do NOT want to see it again. Same with Grave of the fireflies. Some movies just ain't for multiple viewings. They are my "favorite movies I never want to see again".

    On the other hand I got movies I can watch any day of the week, but that I would NEVER rate as highly. Cannonbal run is one such movie. It watch it far too often, but I wouldn't call it a good movie. You can always fine me ready for a Jacky Chan movie or a spagethi western.

    Is the netflix rating system a "I liked this movie and want to see more like it" system or a "This movie was brilliant and I would highly recommend it too everyone else" type of rating system?

    Granted some people get it confused, probably the same people that use the slashdot moderation system to silence views they don't like, but that only makes basing conclusions on user ratings even more problematic.

    I can rate a movie highly even if I do not agree with it, simply because it is good. And I can rate a movie I really like to watch as crap simply because I know I like watching crap.

    I don't like the godfather movies, I can see they are high quality, I just don't like them. So my rating them would be fairly high as for quality, but low for 'I want to see more like this'.

    I thought that the netflix system was "I want to see more like this" based. Surely nobody is so stupid as to think a quality rating and a "i like this" rating system are the same? Or am I completly in the wrong in seeing a difference between the two? Am I insane in thinking that you can see a movie as being a great artwork and still not liking it or viceversa?

    --

    MMO Quests are like orgasms:

    You may solo them, I prefer them in a group.

  7. Re:only a matter of time by phobos13013 · · Score: 4, Interesting

    Actually TFA seems to suggest that the more obscure and pretentious we are, the easier it is the track us. If we become homogeneous drones voting on the top 100 films, we are safe! Even so, I don't plan to become a homogeneous drone...

    --
    ...and it should be known by now