Anonymity of Netflix Prize Dataset Broken
KentuckyFC writes "The anonymity of the Netflix Prize dataset has been broken by a pair of computer scientists from the University of Texas, according to a report from the physics arXivblog. It turns out that an individual's set of ratings and the dates on which they were made are pretty unique, particularly if the ratings involve films outside the most popular 100 movies. So it's straightforward to find a match by comparing the anonymized data against publicly available ratings on the Internet Movie Database (IMDb) (abstract on the physics arxiv). The researchers used this method to find how individuals on the IMDb privately rated films on Netflix, in the process possibly working out their political affiliation, sexual preferences and a number of other personal details"
The summary is somewhat misleading- the only accounts that can be identified are those that belong to people who also rate on IMBD and who have thus chosen to make at least some of their ratings public. If person X rates 1000 movies on Netflix and has made 20 or so ratings on IMDB publically available, then it is possible to infer with some small uncertainty which of the anonymized individuals in the NetFlix database they are. Thus you have possibly figured out their ratings of the other 980 movies they rated for Netflix but did not post on IMBD. Interesting, but not earth-shattering or a serious breach of privacy, I would say.
It's psychosomatic. You need a lobotomy. I'll get a saw.
If I had mod-points, I'd mod you up insightful. I didn't think someone would spot where I copied the review from so fast.
not true -- obscure films help a little bit but not too much. we put up a recent draft of our paper in which the dependence on obscure movies is much reduced.
"b) voting similarly or identically on lots of films so that they can get a better idea as to whether it is the same person based on them liking the same films the same amounts."
again not true at all. one of the main claims of our paper is that our method is tolerant to an INCREDIBLE amount of noise. we have the math to back this up.
--Arvind Narayanan