Slashdot Mirror


AOL, Netflix and the End of Open Research

An anonymous reader writes "In 2006, heads rolled at AOL after the company released anonymized logs of user searches. With last week's announcement that researchers had been able to learn the identities of users in the scrubbed Netflix dataset, could the days of companies sharing data with academic researchers be numbered? Shortly after the AOL incident, Google's Eric Schmidt called the data release 'a terrible thing,' and assured the public that 'this kind of thing could not happen at Google.' Will any high tech company ever take this kind of chance again? If not, how will this impact research and and the development of future technologies that could have come from the study of real data?"

17 of 85 comments (clear)

  1. Correlations by Lachryma · · Score: 5, Insightful
    The identities were learned because the users shared their movie preference information with IMDB.

    I don't see this as a problem, yet.

  2. k-anonymity and l-diversity by omnirealm · · Score: 5, Informative

    There exist effective techniques that can anonymize the data in order to thwart attempts to correlate identities, while still preserving the statistical properties of the data that make it useful to researchers. They include k-anonymity and l-diversity:

    http://privacy.cs.cmu.edu/people/sweeney/kanonymity.html

    http://www.cs.cornell.edu/~dkifer/papers/ldiversity.pdf

    --
    An unjust law is no law at all. - St. Augustine
    1. Re:k-anonymity and l-diversity by stranger_to_himself · · Score: 3, Insightful

      From scanning those articles it looks as if they are just methods for defining levels of anonymity in a dataset, rather than providing any effective means of achieving it (please correct me if I'm wrong).

      I can't see how, for example, if I am planning a study of small area (ie zip code level) variation in the levels of some disease or other, while adjusting for, say, age, sex, and ethnicity, that I could do so without a dataset that included all of these items. How could you make the records less unique without throwing away the data?

      We have to accept that if we want meaningful research to happen, then we need some amount of data sharing and linking needs to occur. We need to rely, in medicine at least, on ethics committees to represent our best interests when it comes to striking the balance.

      It seems to me that the trend for guarding personal data like its the family silver is a relatively modern thing. If it continues, then reliable unbiassed medical research, especially disease monitoring and control will become impossible.

  3. The Impact by flaming+error · · Score: 4, Insightful

    > how will this impact research and and the development
    > of future technologies that could have come from the
    > study of real data?

    It's definitely a hindrance. Kind of like not letting cops search houses without permission.

    1. Re:The Impact by mabhatter654 · · Score: 2, Informative

      but it's the companies data, not yours. Once they strip out your name and such your privacy claims are limited. Not that people won't piece things back together using an outside database This is what happened in the Netflix case. They were able to guess user's #3956 name at ANOTHER website. They could probably keep the info off the net-at-large by only letting researchers use their equipment under NDA so not everybody has this info.

      As far as "legal searching" goes, they already do this... legally, they just pay money for private access to these databases with your name still attached! Anything tied to banking or SSN the govt already has in spades.

  4. Opt-in by chiasmus1 · · Score: 4, Interesting

    There are people who do not really care if their search results are added to the collection that is released. If Google had an opt-in option for data that they were going to release to academic researchers, I would opt-in. I imagine that there are other people who do not care who is looking at their searches. Something that companies might consider if they wanted to release search results is the option for the users to see what information gets released.

    1. Re:Opt-in by houstonbofh · · Score: 2, Insightful

      But how many would? There are "Chilling Effects" all over the place. For example, I don't want to share my data because it may not be deleted, (Gmail and facebook) and I don't want you to share my data because I don't know what you will do with it, (RIAA) and no one wants to approach the line because lawyers are too damn expensive. I think we need to reinstitute "Trial by Combat" as a defense. Nothing else has stopped frivolous legal shenanigans...

    2. Re:Opt-in by kcwhitta · · Score: 5, Insightful

      The problem with opt-in statistical gathering is that they can skew a sample, subtly biasing it. This would invalidate a lot of scientific research.

    3. Re:Opt-in by ZombieRoboNinja · · Score: 2, Insightful

      i.e., it might come as a surprise when researchers discover that NOBODY (who opted in) searches the internet for pornography, music torrents, Paris Hilton...

      Hell, out of Google's top 20 searches, you might get maybe 3 listed?

  5. Inviting drama by Rob+T+Firefly · · Score: 5, Funny

    Google's Eric Schmidt called the data release 'a terrible thing,' and assured the public that 'this kind of thing could not happen at Google.' Eric, you fool! Have you no concept of the world's tencency toward drama and hilarity? Loudly declaring "this kind of thing could never happen at Google" is like saying "at least it's not raining" or "it's a million-to-one chance" or some other damn fool thing that will prove you wrong nine times out of ten.
  6. research for the sake of? by BlowChunx · · Score: 4, Insightful

    I love this quote from TFA:
    "Companies do not make money by giving researchers access to data. "

    Wrong! Netflix released data to get a better recommendation system. The better they can pick movies for you, the more you will like their service. The $1million prize is peanuts compared to the increase in revenue a better system can bring.

    I wonder if anyone has estimated the value of the man hours invested in this contest?

  7. Responsibility and rewards. by palegray.net · · Score: 2, Interesting

    If companies don't do a thorough enough job of sanitizing statistical data before releasing it, they have to be prepared to deal with the consequences. I'm all for maintaining research access to large volumes of real-world data, but it does need to be obtained through responsible channels.

    All that said, I think an interesting question is: How can we build systems that appropriately compensate companies for access to their data, with strict enforcement of measures designed to thwart misuse of the data? Posters above have given links to research that provides frameworks for making sure data is safe for release; how would a good wrapper for such a system work to incorporate rewards for companies who participate?

  8. So... by thatskinnyguy · · Score: 3, Funny

    Shortly after the AOL incident, Google's Eric Schmidt called the data release 'a terrible thing,' and assured the public that 'this kind of thing could not happen at Google.' So...
    AOL = evil
    Netflix = evil
    Facebook = evil
    Goolgle != evil

    Thank you Eric for giving us the warm and fuzzies that Google is not evil with your two cents.
    --
    The game.
  9. Re:Researchers are to blame. by palegray.net · · Score: 2, Informative

    This is kinda like saying security researchers are to blame for discovering and publishing weaknesses in software. Responsible citizens just pretend everything is fine and wait for someone really bad to discover the same weaknesses and exploit them. Because it's so much easier chasing down criminals than it is to fix problems in the first place by adopting better security practices. I guess we could just arrest all researchers who publicize uncomfortable truths. What's the number to Adobe's legal department? I'm sure they still have a few district attorneys on speed dial...

  10. Medical records? by CheeseTroll · · Score: 3, Interesting

    This puts the idea of analyzing "anonymous" electronic medical records in an interesting light. Even without a name, SSN, or other ID that explicitly links a record to a specific person, could researchers cross-reference the data with other databases well enough to identify people via patterns in their health record? I'm guessing yes.

    For the record, it's not my intent to troll, but I do think it's something that future researchers will need to take into account to ensure people's privacy.

    --
    A post a day keeps productivity at bay.
  11. locking all people up doesn't prevent all crime by sethawoolley · · Score: 2, Insightful

    You're probably just trolling, but in case you aren't, seeing the rampant crime that is institutionalized in modern prisons, I think your argument falls flat on its face.

    Liberty doesn't have security as its price. Liberty and Security are often correlated, not directly correlated inversely as you assume.

    As more people are free to do things that don't infringe on others' security, security often goes up as the people who would be breaking security systems for their own benefit have plenty of other "acceptable" ways to reap goods, with much fewer risks to boot.

  12. ISP's already sell all your web browsing logs by Mal+Reynolds · · Score: 2, Informative

    This is just the tip of the iceberg. If you live in the US, it's likely that logs of all your web activity are being sold to clickstream companies. The data logs being sold by the ISPs seem to use the exact same sort of inadequate anonymity practices as were used by AOL.

    The problem is that no matter how well the data is cloaked, a users browser habits can easily make the anonymity worthless. As has been seen in the case of NetFlix and AOL, it's easy to figure out whom a person is by simply looking at anonymized logs. A single visit to a social networking site is often enough to make a good guess. But when a specific anomized IP address visits the same page of social networking sites, or edits social their profile at a social networking site, or reviews an item at a vendor site, the real identity of that "anonymized" IP address is completely confirmed.

    Simply cloaking an IP address will never provide anonymity. But the companies that purchase your web surfing logs would have no use for logs that weren't attached to a single user. Unless the ISPs were to keep track of and filter out every single vendor site which revealed a user's real name, there would seem to be no safe way to anonymize user logs. Since there are countless numbers of web forums, vendors, and social networking sites, it would seem technically impossible to truly provide any safe level of anonymity for user logs. Selling these logs is just a bad practice that needs to be stopped.

    I can only wonder why the EFF and other organizations haven't made a bigger deal about this. These ISPs are selling all of their user's web logs. I cannot imagine any effective way the ISP's could ever anonymize this data. More info: http://wanderingstan.com/2007-03-19/is_comcast_selling_your_clickstream_audio_transcript http://arstechnica.com/news.ars/post/20070315-your-isp-may-be-selling-your-web-clicks.html