Anonymity of Netflix Prize Dataset Broken
KentuckyFC writes "The anonymity of the Netflix Prize dataset has been broken by a pair of computer scientists from the University of Texas, according to a report from the physics arXivblog. It turns out that an individual's set of ratings and the dates on which they were made are pretty unique, particularly if the ratings involve films outside the most popular 100 movies. So it's straightforward to find a match by comparing the anonymized data against publicly available ratings on the Internet Movie Database (IMDb) (abstract on the physics arxiv). The researchers used this method to find how individuals on the IMDb privately rated films on Netflix, in the process possibly working out their political affiliation, sexual preferences and a number of other personal details"
Who goes out of their way to rate "Anal Whores 3" online?
Meta will eat itself
The researchers used this method to find how individuals on the IMDb privately rated films on Netflix, in the process possibly working out their political affiliation, sexual preferences and a number of other personal details"
This is a loaded statement. The most you can determine is that if a person likes movie A, B, C and D but hated E and F, there is a higher probability they are a guy. If they liked Z but didn't like X, there is a higher probability they might be a republican than not. You're still anonymous.
Unless, of course, you're one of the three people that liked "Glitter". Then I think they might have something on you.
Even those who arrange and design shrubberies are under considerable economic stress at this period in history.
The German Police will be pleased.
Privacy is becoming a fleeting thing in this interconnected world. Perhaps we should reanalyze our perspective on it all?
Karma Whoring for Fun and Profit.
It doesn't sound like the anonymity of the prize set was broken through any fault of NetFlix. It sounds like some sampling of users made the mistake of rating movies on a site where the info is publicly available, and a site where it's not. All they did was correlate the two.
So the lesson is, basically, don't post stuff that you don't want to be public to a website that makes it public, right? This is sounds roughly like blaming the DMV for figuring out a car owners likely political leanings by the bumper stickers on their car.
"It is a miracle that curiosity survives formal education." -Albert Einstein
Seems like it was only broken because the identity of the people was posted somewhere else, along with the ratings. My only question is how they connected the rankings on Netflix, to the rankings on IMDB. Does Netflix take the liberty of submitting all the users rankings to IMDB for them, and also include their name with this data? If you just have anonymous dataset A, with anonymous dataset B, you could match up users from both and figure out which person in A is the same person in B, but you still wouldn't know who the person is. However, if you now have dataset B be not anonymous, then it's not too difficult to compare movie ratings and find out who the people are.
Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
{tongueincheek}Yeah, but the question is, will knowing those personal facts generate better movie recommendations?{/tongueincheek}
When there's a significant prize at stake, researchers can try all sorts of slimy tricks to win. (I'm not saying that's the motive behind this report, but there are many "researchers" going for the prize.) And when there's significant profits at stake, a corporation will damn-fire-certainly use whatever means they can use to maximize those profits, regardless of whether it might be "ethical."
[
For those who haven't rated movies on IMDB, such as myself - and I imagine a large proportion of subscribers.
There are two things going on here. One, many people are asking how you could identify any personal information about people based on their movie preferences. The answer is data-mining. Very sophisticated techniques exist to do things exactly like this, i.e. take a data set and find out about the people.
The second problem is that by deanonymizing the NetFlix data, you can start to cheat on the NetFlix prize. The requirement to win $1 million is that your recommendation engine is 10% better than the one they are currently using. However, if you can learn the exact preferences of some users in the dataset (i.e. by finding the rest of their ratings on IMDB) then you can hardcode that into your recommendation engine and get the recommendations for these users exactly right. This can boost your score even though your actual system is no better than the existing one. This is known as over-fitting to the data.
Finally, this paper is over a year old. Can we please have some new news?
Every time you feel the need to vote 10 in Glitter, also vote 10 to The Godfather.
Every time you cheer for Brokeback Mountain, also put a 10 in Huge Knockers MXII.
Every time you want to express your love for Dersu Uzala, vote a 10 in Spice World, with added commentaries.
That way, everybody will know you're a security conscious computer scientist. Or a squizophrenic moron.
The summary is somewhat misleading- the only accounts that can be identified are those that belong to people who also rate on IMBD and who have thus chosen to make at least some of their ratings public. If person X rates 1000 movies on Netflix and has made 20 or so ratings on IMDB publically available, then it is possible to infer with some small uncertainty which of the anonymized individuals in the NetFlix database they are. Thus you have possibly figured out their ratings of the other 980 movies they rated for Netflix but did not post on IMBD. Interesting, but not earth-shattering or a serious breach of privacy, I would say.
It's psychosomatic. You need a lobotomy. I'll get a saw.
This is total hyperbole.
All they researchers are saying is that they can deduce some of your preferences based on your other preferences. Of COURSE you can do that, that was the whole point of the contest Netflix put up.
What they are _not_ saying is that they now know who you are, where you live, or anything uniquely identifying about you. So basically, you are still anonymous.
I'm starting to tire of news headlines that claim the world is on fire when someone actually just does something slightly derivative from the norm and thinks they are brilliant. The noise from these non-events mask actual brilliant achievements and make it seem that everyone is doing banal work.
As far as I know in IMDB you are rating the overall quality of the movie, not I agree with it OR I want to see more like this.
One example, Shindlers list, great movie, do NOT want to see it again. Same with Grave of the fireflies. Some movies just ain't for multiple viewings. They are my "favorite movies I never want to see again".
On the other hand I got movies I can watch any day of the week, but that I would NEVER rate as highly. Cannonbal run is one such movie. It watch it far too often, but I wouldn't call it a good movie. You can always fine me ready for a Jacky Chan movie or a spagethi western.
Is the netflix rating system a "I liked this movie and want to see more like it" system or a "This movie was brilliant and I would highly recommend it too everyone else" type of rating system?
Granted some people get it confused, probably the same people that use the slashdot moderation system to silence views they don't like, but that only makes basing conclusions on user ratings even more problematic.
I can rate a movie highly even if I do not agree with it, simply because it is good. And I can rate a movie I really like to watch as crap simply because I know I like watching crap.
I don't like the godfather movies, I can see they are high quality, I just don't like them. So my rating them would be fairly high as for quality, but low for 'I want to see more like this'.
I thought that the netflix system was "I want to see more like this" based. Surely nobody is so stupid as to think a quality rating and a "i like this" rating system are the same? Or am I completly in the wrong in seeing a difference between the two? Am I insane in thinking that you can see a movie as being a great artwork and still not liking it or viceversa?
MMO Quests are like orgasms:
You may solo them, I prefer them in a group.
...would be a lot more appreciative of this proof of concept if someone trawled Slashdot threads to see how often you feed trolls by responding to comments with a "-1" rating... :P
It is by my will alone my thoughts acquire motion; it is by the juice of the coffee bean that the thoughts acquire speed
Finding a paragraph like this in a research paper makes me call into question the motives and intentions of the 'researchers.' They seems sort of like the Jerry Springer of research (since he's just trying to help the families he has on his show...).
They imply that the person didn't like "Super Size Me" because he's probably fat (or are they trying to imply that he has a problem with gaining weight and is jealous?).
Also, they imply that because he rated two "predominantly gay theme" items as poor he must not be homosexual. Or are they implying that because he rented/rated these that he must be gay (because who would ever rent them otherwise).
The fact that they use the "there's more juicy stuff about this guy, but we can't tell because we're serious researchers" line at the end is the pièce de résistance that really shows what motivates these researchers.
Wait - you mean if I enjoyed a movie with a gay theme, people are going to assume I'm gay?
Anyone think the IMDB rating of Brokeback Mountain is going to plummet dramatically. (It is 7.8 today)
And of course, if it does, we will be able to correlate the timing of the sudden drop with the publishing of this slashdot article, allowing us to link the slashdot readership with imdb users. Now we have your Netflix ratings, IMDB ratings, AND slashdot postings all correlated...
Now do you see the POWAH inherent in that knowledge?
If you were blocking sigs, you wouldn't have to read this.
Because it isn't a Credit Card # or SSN it isn't serious?
A) Some people would rather go to jail or commit suicide than admit to something embarrassing they'd rather keep private. Privacy isn't (just) about hiding (illegal) things from the Government.
B) Demographic information is something you can never take back and can never change.
At least I can get a new credit card & SSN.
[Fuck Beta]
o0t!
I think the real problem here is being buried in misguided analysis about the meaning of anonymity and associating movie preferences to political affiliation, etc.
Here's what's really been demonstrated: private information about users of some IMDB accounts who have rated movies on both IMDB and Netflix has be made public, despite Netflix's implicit assertion that releasing anonymous data is "safe." The user himself has not really been compromised -- nobody knows his address, phone number, names of family members, etc. -- but people now know more about the IMDB account than was intentionally published. When the user publicly posted his opinions about 5 (say) favorite movies, he did not expect his private opinions about 100 others, as expressed in Netflix, to also be publicly associated with that account.
The practical impact isn't clear. If the private information were conveniently published by IMDB, so that nobody had to work very hard to view it, it might sway how likely readers are to trust a certain reviewer. The impact of that change in trust doesn't seem very meaningful, though, and in any case, the private information *isn't* conveniently published. If, under similar circumstances, there were a correlation between private information and an eBay account, then there could be a real financial impact.
Another concern is that, if other factors have already made it possible to correlate an IMDB account with a real person, then someone can make the jump to associating all this private data with that individual. For example, I might link to my IMDB profile from my blog so all my coworkers can see my public reviews, not realizing that it's now possible for them to determine what movies I've privately watched.
then we can have this discussion again over more prevalent movies that are controversial. whoever did this research paper really should have waited til Golden Compass comes out. I use GC as an example because the movie when it comes out is going to have a lot of people not pleased with it. Also Prince Caspian comes out in '08 as well. While totally off based, those two movies are going to be what defines the new IMDB in the given year.
I can say that whatever off beat movies that come out are going to have substantial rating as well, but Golden Compass is going to have the most impact, just because its going to be very disliked because of the whole history surrounding it(and yes there will more then likely be boycotts opening day)and its producers.
enough said? I think so.
Re: B, you can usually change any sort of non-biological (and, using extreme measures, some biological ones too) demographic information about yourself. There's nothing that says you can't suddenly turn from liberal to conservative or vice versa, or get married (or turn gay/lesbian), etc.
OT: is there a way to escape greaterthan/lessthan signs?
"I think an etch-a-sketch with an ethernet port would beat IE7 in web standards compliance."
If you think you can determine political affiliation based on how someone rates movies, especially in America, then you're just plain retarded.
To take an example, a left-winger might rate Michael Moore flicks poorly because one thing about Moore's stuff is he almost always seems to avoid more effective ways of making his points. They agree with the message, disagree with the methodology or style of film. On the other side of things, a libertarian, Goldwater Republican, "conservative", etc., might rate Moore's Sicko highly, because there is undoubtably something wrong and shameful with health care in America whether you believe in socialized medicine or not.
But you know what - it wouldn't surprised me if the day came when your movie ratings came back to haunt you. America and other countries do seem to be headed in that direction.
apersand-lt-semicolon results in <
apersand-gt-semicolon results in >
(no spaces or dashes.)
The comment "favotire movie I never want to see again" is one I got from a review of Grave of the Fireflies that I just happened to totally agree with. Don't read the reviews, just watch it yourselve and if you are not into Anime just set that aside for the duration of the movie, then ask yourselve again, if you can understand that comment.
It is powerfull movie, like Shindlers List, but not a happy tale. I am not talking a tear jerker movie here, I am talking a "we will all burn in hell for this" movie. Tear jerkers I can take, Christmas in August is one. Sad tale, nicely told but ultimately human. It makes you sad, not sick of humanity.
Perhaps I am just too emotional about this kinda stuff, one reason might be that I grew up with halfunderstood tales of "that was were your great-uncle was picked up". When you realize just why your grandmother had 9 brothers and sisters yet you never met any. I got one aunt, my grand-parents had 3 kids, a starvation story like GotF hits a lot closer with a history like that. (The dutch hunger winter)
I enjoy all kinds of movies and would NOT have NOT watched these two, but that doesn't mean I want to see them again. There are some people who list Shindlers List as a feel good movie because it 'ends well'. I suppose you might see it that way, I don't.
I can regonize your statements that the photography is nice and the screen writing is well done, but the plot is intresting? To you it is a plot, to me it is a sickening part of history that I am far too close to.
Perhaps it is a bit like how Richard Pryor's monologue about the 200th celebration of the US was not exactly all that cheerfull.
Terry Pratchets Nanny Ogg describers at one point the difference between merry and mirth (or something like that) she describes how she was joyfull when her child was being born but she wasn't exactly chuckling at the time. Enjoying a movie and enjoying it are two different things, at least for me. I can't describe it any clearer.
MMO Quests are like orgasms:
You may solo them, I prefer them in a group.
From what I'm getting out of this, all they've really shown is they can guess the identities of a few people by correlating movie ratings in content and time between imdb and netflix? That doesn't sound like much of an ID to me.
Finally someone else that gets it!! I use almost exactly the same words to describe Shindler's List. It's a really good movie, and I think everyone should see it once, but I never want to see it again. Ever.
And actually I'm starting to think I should avoid re-watching any movie that has ever caused such intense emotions, because I've discovered that you just can't duplicate the original experience in repeated viewings. When I was a teenager, the end of Last of the Mohicans was incredibly moving, but more recent watchings leave me feeling mostly empty -- partly because I'm now closer to Cora's age, so I'm not as attached to Alice as I was back in 1992, but also because I know what's going to happen. Ditto for Fearless. I left the movie theater a changed person, but when I saw it again on DVD a few years ago I got nothing. Meanwhile, this year's Stardust gave me that same changed person feeling, but I'm afraid to watch it again so I won't ruin the memory of the original experience.
On the "oh shit" side of the spectrum, the Usual Suspects and Fight Club aren't quite as good the second time around, but they're still very watchable and very good movies. However, the Sixth Sense is only good for two viewings (* the second is only necessary if you weren't paying close attention the first time around), and I have absolutely no desire to ever see it or any other M. Night Shyamalan movie ever again.
> The researchers used this method to find how individuals on the IMDb privately rated films on Netflix,
> in the process possibly working out their political affiliation, sexual preferences and a number of
> other personal details"
How about working out the other ratings the people made so they can include it in a trivial predictive app and submit it for the million dollar prize?
(-1: Post disagrees with my already-settled worldview) is not a valid mod option.
Bruce Willis as Hudson Hawk: "Bunny, ball ball." (mimicking Sandra Bernhard's line earlier in the movie)
[Bunny the dog looks up and squeals excitedly]: "rrrrrrr?"
[Hawk launches the rocket propelled tennis ball at the dog]
[Bunny flies out the window of the castle]
That scene alone is worth the price of admission, dammit!
(Disclaimer: I own the DVD, and I think It's easily one of the 10 best comedies ever. Personally I'd give the movie an 8/10, but for other people I'd give it about a 7/10.)
It should really read "non-anonymous ratings of movies in IMDB gives away your movie preferences".
Because really, thats all the mat their "research" has.
And in fact, if you rate movies in IMDB, and your handle can be tracked, who needs the netflix data?
This whole thing is a non-issue, and the paper is so content-slim i doubt it will be accepted anywhere (well, maybe "new scientist" will print it...)
HI O WISE PRINCE. WHT TOOK U SO DAM LONG?
Nothing was broken.
All they did was notice that you can match up review scores and dates pretty well with stuff on IMDB.
This means nothing - all you get is a possible link to a username.
A username does not equal a person.
Most usernames/accounts on IMDB/Netflix correspond to several people.
Even if you could directly say "username x is a single person and this is their complete ratings history", username x is still anonymous. It's not like anyone hacked into netflix and knows what ratings I, as a person, gave out.
More "research" that means nothing, more sensationalist crap.
There are statistical associations everywhere. Just because you can't see them doesn't mean a computer can't. Anonymity is impossible -- identity will always leak through in some form.
OT: is there a way to escape greaterthan/lessthan signs?
You have to use the HTML escape codes, which are < and > .
--
Promoting critical thinking since 1994.
To the issue of your anonymity being shattered, puh-lease. If you post information in a public forum such as IMDB and it can be correlated to information from MySpace, it wasn't a giant leap into your privacy. It was just gathering already public information. What's the big deal?
You choose to post that stuff where it could be publicly viewed. The fact that it lines up with data from Netflix only proves that NF did in fact provide a quality dataset. Big deal.
I scream. You scream. I assume that means we're both acquainted with the problem. We proceed.
I'm sorry but the continuity mistakes where Peter North's socks are on before the money shot then clearly off afterwards and when Mercedes says "Fuck me deeper!" before she is fucked ruined the film for me.
Actually, thats hardly any breach of privacy. Your ratings might be made public only if you made it public yourself on IMBD. Theoretically, your netflix profile will include more movies, but then it would not be close match, and identifying will not occur. Even if it is closest match - it's game of probabilities then: it gives person grounds for plausible deniability. In any case, that blog write up is complete bullshit: it assumes that people rate the same on imdb and netflix. It's not the case, not only because people expect latter to be private, but 1) because scales are different, and people interpret different scales differently 2) if you don't rate immediately at the same time on both sites, your experience change, and you rate same movie differently 3) i use imbd to remember which movies i saw, and use netflix to recommend me movies, which means my reason to rate is different, which makes ratings to differ. Another assumption mentioned on that blog is date match - same movies rated on same day. If I represent average person, I don't usually rate movies on both sites at the same moment, i do them in batches. My take on that "research" - it's not conclusive load crap.
The Fifth Element. I think it's because it's a mixed-genre movie. Not action, not comedy, somewhere in between.
Both star Bruce Willis, interestingly enough.
"Hey mister, are you gonna die?"
"Do you know what it's like to be called Chlamydia for a year?"
"You are a slender reed compared to that guard"
Both HH and 5E are in my top 10 movies. And the commentary on Hudson Hawk is great - they talk about how they hired the narrator from Rocky & Bullwinkle, so that you'd know the tone they were taking. Fun stuff.
"Sometimes a woman is a kind of religion, she can save your soul & set you free from all your sins" - Bad Examples
& results in &
:)
> => >, < => <
crymore slashdot faggot
Wow, it is surprising that the RIAA/MPAA has not yet offered them a job. With several layers and an infallible rhetoric, I am sure they would find a way to explain to a judge that if you have rated a movie on Netflix without owning the DVD or have a paper proof that you bought a theater ticket for the movie, then you must have downloaded it illegally... Isn't it obvious ??
Now tell me, how many people would rate on both systems and use the same rating? This implies you remember your rating. Maybe I don't know enough about the system but the system is highly suspect unless the user has rated a large quantity of movies that are in both systems and includes many far outside of the mainstream movies. Honestly, if you are doing things publicly, forget about privacy. More than this can be used to identify you. Cheers
None of the mainstream media picked up on it, but I remember thinking this sort of thing might be possible with the data lost by HMRC too. I bet Tesco would love to get their hands on it for planning where to put new stores and what to stock etc. Combined with their Clubcard database, of course.
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
For most people, it wouldn't be a serious breach of privacy. However, you can imagine a scenario where it would be.
Imagine a pastor who uses a recognizable username for many sites, including both IMDB and his church's web forums. He uses Netflix as a way to feed his secret love of movies with sexual content which his church would publicly denounce. Now these researchers could link his username to ratings for all these movies, and post the information online.
All it would take then is for a curious church member to google the pastor's username, and his previously secret habit would now be public knowledge. He could lose his reputation and his career.
I think that hypocritical religious leaders deserve to be exposed because of their chosen place in society, so this sounds fan-fucking-tastic to me.
If you have a hypocritical pastor, I'll buy the domain and host the site.
Alright, in that case you like the outcome. What about an officer in a "Don't ask, don't tell" military who rents gay movies? What about an employee who rents an informational DVD about how to change careers, and posts a review which they believed was anonymous, in which they reveal their plans to leave their job?
The issue isn't really about movie reviews. It's about the expectation of privacy at any website. If a site is offering privacy and anonymity, they need to be responsible with whatever information is entrusted to them, no matter how unimportant that information may seem. We're discovering that even anonymizing data before releasing it isn't adequate.