Slashdot Mirror


AOL, Netflix and the End of Open Research

An anonymous reader writes "In 2006, heads rolled at AOL after the company released anonymized logs of user searches. With last week's announcement that researchers had been able to learn the identities of users in the scrubbed Netflix dataset, could the days of companies sharing data with academic researchers be numbered? Shortly after the AOL incident, Google's Eric Schmidt called the data release 'a terrible thing,' and assured the public that 'this kind of thing could not happen at Google.' Will any high tech company ever take this kind of chance again? If not, how will this impact research and and the development of future technologies that could have come from the study of real data?"

10 of 85 comments (clear)

  1. Correlations by Lachryma · · Score: 5, Insightful
    The identities were learned because the users shared their movie preference information with IMDB.

    I don't see this as a problem, yet.

  2. k-anonymity and l-diversity by omnirealm · · Score: 5, Informative

    There exist effective techniques that can anonymize the data in order to thwart attempts to correlate identities, while still preserving the statistical properties of the data that make it useful to researchers. They include k-anonymity and l-diversity:

    http://privacy.cs.cmu.edu/people/sweeney/kanonymity.html

    http://www.cs.cornell.edu/~dkifer/papers/ldiversity.pdf

    --
    An unjust law is no law at all. - St. Augustine
    1. Re:k-anonymity and l-diversity by stranger_to_himself · · Score: 3, Insightful

      From scanning those articles it looks as if they are just methods for defining levels of anonymity in a dataset, rather than providing any effective means of achieving it (please correct me if I'm wrong).

      I can't see how, for example, if I am planning a study of small area (ie zip code level) variation in the levels of some disease or other, while adjusting for, say, age, sex, and ethnicity, that I could do so without a dataset that included all of these items. How could you make the records less unique without throwing away the data?

      We have to accept that if we want meaningful research to happen, then we need some amount of data sharing and linking needs to occur. We need to rely, in medicine at least, on ethics committees to represent our best interests when it comes to striking the balance.

      It seems to me that the trend for guarding personal data like its the family silver is a relatively modern thing. If it continues, then reliable unbiassed medical research, especially disease monitoring and control will become impossible.

  3. The Impact by flaming+error · · Score: 4, Insightful

    > how will this impact research and and the development
    > of future technologies that could have come from the
    > study of real data?

    It's definitely a hindrance. Kind of like not letting cops search houses without permission.

  4. Opt-in by chiasmus1 · · Score: 4, Interesting

    There are people who do not really care if their search results are added to the collection that is released. If Google had an opt-in option for data that they were going to release to academic researchers, I would opt-in. I imagine that there are other people who do not care who is looking at their searches. Something that companies might consider if they wanted to release search results is the option for the users to see what information gets released.

    1. Re:Opt-in by kcwhitta · · Score: 5, Insightful

      The problem with opt-in statistical gathering is that they can skew a sample, subtly biasing it. This would invalidate a lot of scientific research.

  5. Inviting drama by Rob+T+Firefly · · Score: 5, Funny

    Google's Eric Schmidt called the data release 'a terrible thing,' and assured the public that 'this kind of thing could not happen at Google.' Eric, you fool! Have you no concept of the world's tencency toward drama and hilarity? Loudly declaring "this kind of thing could never happen at Google" is like saying "at least it's not raining" or "it's a million-to-one chance" or some other damn fool thing that will prove you wrong nine times out of ten.
  6. research for the sake of? by BlowChunx · · Score: 4, Insightful

    I love this quote from TFA:
    "Companies do not make money by giving researchers access to data. "

    Wrong! Netflix released data to get a better recommendation system. The better they can pick movies for you, the more you will like their service. The $1million prize is peanuts compared to the increase in revenue a better system can bring.

    I wonder if anyone has estimated the value of the man hours invested in this contest?

  7. So... by thatskinnyguy · · Score: 3, Funny

    Shortly after the AOL incident, Google's Eric Schmidt called the data release 'a terrible thing,' and assured the public that 'this kind of thing could not happen at Google.' So...
    AOL = evil
    Netflix = evil
    Facebook = evil
    Goolgle != evil

    Thank you Eric for giving us the warm and fuzzies that Google is not evil with your two cents.
    --
    The game.
  8. Medical records? by CheeseTroll · · Score: 3, Interesting

    This puts the idea of analyzing "anonymous" electronic medical records in an interesting light. Even without a name, SSN, or other ID that explicitly links a record to a specific person, could researchers cross-reference the data with other databases well enough to identify people via patterns in their health record? I'm guessing yes.

    For the record, it's not my intent to troll, but I do think it's something that future researchers will need to take into account to ensure people's privacy.

    --
    A post a day keeps productivity at bay.