Slashdot Mirror


AOL, Netflix and the End of Open Research

An anonymous reader writes "In 2006, heads rolled at AOL after the company released anonymized logs of user searches. With last week's announcement that researchers had been able to learn the identities of users in the scrubbed Netflix dataset, could the days of companies sharing data with academic researchers be numbered? Shortly after the AOL incident, Google's Eric Schmidt called the data release 'a terrible thing,' and assured the public that 'this kind of thing could not happen at Google.' Will any high tech company ever take this kind of chance again? If not, how will this impact research and and the development of future technologies that could have come from the study of real data?"

85 comments

  1. Correlations by Lachryma · · Score: 5, Insightful
    The identities were learned because the users shared their movie preference information with IMDB.

    I don't see this as a problem, yet.

    1. Re:Correlations by ShieldW0lf · · Score: 0, Flamebait

      Do I care that information about me might have been released amongst all the rest? Not at all.

      Do I care that massive companies and governments get to amass all this data and not share it with the rest of us? A great deal.

      There is too much privacy. No one cares about your guilty little sexual encounters, no one cares what the doctor says is going to kill you, and there are truly evil people hiding terrible things while you concern yourself with such trivialities.

      Get over yourself. Stop fighting for secrecy and start fighting against ignorance and the hypocrisies it breeds.

      --
      -1 Uncomfortable Truth
    2. Re:Correlations by jfengel · · Score: 1

      They shared some of their movie preference information with the IMDB, but they may have intended to keep the rest of it private. Some of those private ratings have now slipped out.

      I don't know if anything really important came of it, but it's extremely illustrative: even anonymized data can become known if you can tie it in to a public data source. Movie ratings data may be important, or at most slightly embarrassing ("you LIKED Ghost Dad? Ewwww!") but it could easily have been worse if the data had been important.

      It's a message about privacy: secret stuff has to stay secret, and just failing to release the names can't be counted on to be sufficient.

      It's also a warning that you yourself need to be more circumspect. Even public information can be used to deduce private stuff.

    3. Re:Correlations by moderatorrater · · Score: 1

      The only problem I could see is if they didn't get the users' permission first. If the users give their permission, then where's the problem? If not, what the hell was Netflix thinking? I understand selection bias, but that doesn't change the fact that users care about their privacy.

    4. Re:Correlations by Anonymous Coward · · Score: 0

      I'm pretty sure that my wife cares about my guilty little sexual encounters and my employer (or pretty much anyone I sign a long-term contract with) cares about what the doctor says is going to kill me. But I guess being divorced and unemployed is trivial, right?

    5. Re:Correlations by Wescotte · · Score: 1

      Find a wife is allows you to do what you want. If you're not interested in being fateful to a wife to requires it then don't get married to her. While a company can't fire you for what you do outside the office you can also choose to work for a company that is more open minded as well.

    6. Re:Correlations by Anonymous Coward · · Score: 0

      Find a non controlling wife? Where?!

      BTW, given the sexual prowess of all you strapping lads here I'm just going to throw out that if it has tits and walks on two legs it's probably in the realm of fair game.

    7. Re:Correlations by yali · · Score: 1

      Here's what I wrote (with minor editing) in a comment to the earlier article...

      Suppose that you want to keep your political attitudes private -- for whatever reason, you decided it's nobody else's business. On IMDb, publicly linked to your real identity, you choose to only rate movies with non-political content, which you don't mind anybody knowing your opinion about. On Netflix, you believe that your ratings will be kept private, and you want to take advantage of their recommendations. So you rate all the same movies that you rated on IMDb, but you also post your ratings of politically charged movies like Fahrenheit 9/11, The Corporation, etc. With the method described in the arXiv paper, somebody could potentially link your supposedly anonymized political ratings back to your real identity and try to infer your political attitudes from your pattern of ratings.

    8. Re:Correlations by Anonymous Coward · · Score: 0

      Some people prefer criteria that rule out she-males.

    9. Re:Correlations by Anonymous Coward · · Score: 0

      We take what we can get.

  2. k-anonymity and l-diversity by omnirealm · · Score: 5, Informative

    There exist effective techniques that can anonymize the data in order to thwart attempts to correlate identities, while still preserving the statistical properties of the data that make it useful to researchers. They include k-anonymity and l-diversity:

    http://privacy.cs.cmu.edu/people/sweeney/kanonymity.html

    http://www.cs.cornell.edu/~dkifer/papers/ldiversity.pdf

    --
    An unjust law is no law at all. - St. Augustine
    1. Re:k-anonymity and l-diversity by stranger_to_himself · · Score: 3, Insightful

      From scanning those articles it looks as if they are just methods for defining levels of anonymity in a dataset, rather than providing any effective means of achieving it (please correct me if I'm wrong).

      I can't see how, for example, if I am planning a study of small area (ie zip code level) variation in the levels of some disease or other, while adjusting for, say, age, sex, and ethnicity, that I could do so without a dataset that included all of these items. How could you make the records less unique without throwing away the data?

      We have to accept that if we want meaningful research to happen, then we need some amount of data sharing and linking needs to occur. We need to rely, in medicine at least, on ethics committees to represent our best interests when it comes to striking the balance.

      It seems to me that the trend for guarding personal data like its the family silver is a relatively modern thing. If it continues, then reliable unbiassed medical research, especially disease monitoring and control will become impossible.

    2. Re:k-anonymity and l-diversity by Julie188 · · Score: 1
      Yes, seems like there is no reason a technical solution couldn't solve the problem of balancing privacy with data sharing. There is still plenty to be learned if the data sharing were general enough. If researchers knew my age, sex, weight -- do they really need to know my name and address? At the same time, the irony is that if we all released every single detail about ourselves to researchers, the world would be fine -- it's not the researchers that are the bad guys. It is the storing of the data somewhere where the bad guys can break in and get it. Really, the grocery store knows as much about me as anyone -- sharing my grocery habits (which includes my zipcode info) with food researchers isn't the problem. Someone can hack into the store's database, no researcher required.

      Microsoft Subnet: the independent voice of Microsoft customers

    3. Re:k-anonymity and l-diversity by Anonymous Coward · · Score: 0

      Please read these paper and the paper which exploits Netflix IDs. You will know that K-anonymity does not work for Netflix's case. Put things simple: There are just too many unique rating and ID combinations. K-anonymity will either be too expensive to anonymize this kind of high dimensional while sparse data or produce some useless huge dataset for datamining purpose.

  3. The Impact by flaming+error · · Score: 4, Insightful

    > how will this impact research and and the development
    > of future technologies that could have come from the
    > study of real data?

    It's definitely a hindrance. Kind of like not letting cops search houses without permission.

    1. Re:The Impact by mabhatter654 · · Score: 2, Informative

      but it's the companies data, not yours. Once they strip out your name and such your privacy claims are limited. Not that people won't piece things back together using an outside database This is what happened in the Netflix case. They were able to guess user's #3956 name at ANOTHER website. They could probably keep the info off the net-at-large by only letting researchers use their equipment under NDA so not everybody has this info.

      As far as "legal searching" goes, they already do this... legally, they just pay money for private access to these databases with your name still attached! Anything tied to banking or SSN the govt already has in spades.

    2. Re:The Impact by Stradivarius · · Score: 1

      Arguably the data is not truly anonymized if a third party can reconnect your name to the data. So claiming that it's not really "your" data, simply because they did some form of obfuscation, is a bit bogus.

      What these companies really should do is just ask people when they sign up for the service "Hey, we might someday want to provide academic researchers with data on our customers' purchase habits. We will do our best to anonymize this data before providing it to the researchers, but if you've provided similar information publicly somewhere, someone may be able to compare the two sets of information to guess that they both came from the same person. Given this condition, would you like to participate in such research?" Then people can opt in or out, and if they opt in then there's little ground for complaints later.

    3. Re:The Impact by moderatorrater · · Score: 1

      So you're saying that this would all be pointless speculation if they let their users choose whether to participate or not?

    4. Re:The Impact by coaxial · · Score: 1

      You're being sarcastic, but lack of real data is a hinderance. In information retrieval, data sets with real users is hard to get. You need real user data because that's how you evaluate if your algorithm is any good at helping real people find things. People are noisy. People do dumb things. People aren't optimal! Heuristics for mimicking user behavior can work for something this, but ultimately you have to test against real user data. Otherwise your optimizing your system for a user that doesn't exist.

  4. Opt-in by chiasmus1 · · Score: 4, Interesting

    There are people who do not really care if their search results are added to the collection that is released. If Google had an opt-in option for data that they were going to release to academic researchers, I would opt-in. I imagine that there are other people who do not care who is looking at their searches. Something that companies might consider if they wanted to release search results is the option for the users to see what information gets released.

    1. Re:Opt-in by houstonbofh · · Score: 2, Insightful

      But how many would? There are "Chilling Effects" all over the place. For example, I don't want to share my data because it may not be deleted, (Gmail and facebook) and I don't want you to share my data because I don't know what you will do with it, (RIAA) and no one wants to approach the line because lawyers are too damn expensive. I think we need to reinstitute "Trial by Combat" as a defense. Nothing else has stopped frivolous legal shenanigans...

    2. Re:Opt-in by kcwhitta · · Score: 5, Insightful

      The problem with opt-in statistical gathering is that they can skew a sample, subtly biasing it. This would invalidate a lot of scientific research.

    3. Re:Opt-in by wizardforce · · Score: 1

      that's what should happen, not exactly what could happen here. search companies could just attempt to assume the rights to searches as it was after all done on their servers.

      --
      Sigs are too short to say anything truly profound so read the above post instead.
    4. Re:Opt-in by ZombieRoboNinja · · Score: 2, Insightful

      i.e., it might come as a surprise when researchers discover that NOBODY (who opted in) searches the internet for pornography, music torrents, Paris Hilton...

      Hell, out of Google's top 20 searches, you might get maybe 3 listed?

    5. Re:Opt-in by Red+Flayer · · Score: 1

      I think we need to reinstitute "Trial by Combat" as a defense.
      That'd better be trial-by-combat-no-proxies-allowed. What makes you think that the MPAA wouldn't be able to afford the services of Chuck Norris?
      --
      "Trolls they were, but filled with the evil will of their master: a fell race..." -- J.R.R. Tolkien on Olog-hai
    6. Re:Opt-in by KevMar · · Score: 1

      Naw, make it an opt out that you have to update every 6 months. we will call it the do not release my info registry.

      I can see alot of value it letting information like this released, but there should be some rules attached to it.

      First, make it a by request instead of open access data. People requesting access should sign privacy contracts that only allow them to publish the results as long as the results dont identify anyone.

      I dont mind the research saying "23 can be identified by name that searched for the RIAA and wet girls" instead of saying "Kevin, Joe, Bill, Smith searched for the RIAA and wet girls".

      Treat it like HIPPA data. make the researchers fallow the same rules and regulations.

      or we just accept that the internet is not private. Just know that our names and habbits are public. Just how long would it take someone to figure out im a tech guy and play World of Warcraft? Not long at all.

      --
      Im a gamer, not a grammer major. This post is full of spelling and grammer mistakes.
    7. Re:Opt-in by CodeBuster · · Score: 0, Offtopic

      What makes you think that the MPAA wouldn't be able to afford the services of Chuck Norris?

      Chuck Norris? Isn't he getting a bit long in the tooth? They would probably prefer someone like Chuck "The Iceman" Liddell or one some other professional mixed martial arts fighter instead...

    8. Re:Opt-in by danzona · · Score: 1

      In a fight between Batman and Darth Vader, the winner would be Chuck Norris.

      http://www.chucknorrisfacts.com/

    9. Re:Opt-in by Stradivarius · · Score: 1

      It wouldn't invalidate it completely. Worst case, it means the research is only applicable to the subset of people who agree to participate, rather than the user population as a whole. It may still yield useful insight for that subset, and (if the self-selection bias isn't too bad) possibly the larger user population too. So while not perfect, the opt-in data may still be good enough for some uses.

      Another important note is that the data gathering itself is not opt-in. It's the publishing of "anonymous" versions of that data that would be opt-in. So, while the public research on the public data may potentially be biased, the entity that gathered the full private dataset can verify any research results against the private data as a sanity check.

      Realize too that for cases like our Netflix example, self-selection bias is *already* at work. Traditionally, movies are rented at a store, not online, so you've already got a different, probably not generally representative, population being studied. E.g. online populations tend to be younger and more educated than the population as a whole, even today.

    10. Re:Opt-in by moderatorrater · · Score: 1

      Then for research where it's more important to perform the research in a valid way than it is to protect the users privacy, they can release the data to that researcher alone. For things such as improving the recommendation of movies, the bias should be okay.

    11. Re:Opt-in by porcupine8 · · Score: 1
      Considering that the majority of psychological studies are performed on college freshmen taking a Psych 101 class, the reality is that "getting an ubiased random sample" is an ideal that researchers rarely worry too much about living up to.

      Not that I'm defending this practice, but I do think that a very large sample of Google/AOL users who opted-in would actually be more generalizable than the average study.

      --
      Warning: Apple/Nintendo fangirl. Likes her electronics cute & cuddly. May be rabid.
    12. Re:Opt-in by Anonymous Coward · · Score: 0

      I say between the lawyers by default. It'd be great when all of a sudden those kickboxing classes became equivalent to a 95th percentile LSAT score at Harvard Law.

    13. Re:Opt-in by Anonymous Coward · · Score: 0

      If Google had an opt-in option for data that they were going to release to academic researchers, I would opt-in. Didn't Google's CEO just go on record saying they don't release this data? Maybe you meant AOL or Netflix?
    14. Re:Opt-in by stranger_to_himself · · Score: 1

      Considering that the majority of psychological studies are performed on college freshmen taking a Psych 101 class, the reality is that "getting an ubiased random sample" is an ideal that researchers rarely worry too much about living up to.

      That's not really true. It might be true for toy studies for students or pilots where instruments are being tested (I've done a few myself in that context), but all serious psychological studies spend a good deal of time trying to get their sample right.
    15. Re:Opt-in by porcupine8 · · Score: 1
      Pssh, you haven't read the studies I have, then. Let me just grab a few sitting on my desk (not counting the ones done on children) and list their participants:

      36 respondents at the U of MN, 107 students at U of MN (journal of consumer research)
      90 U of MD undergrads (j of personality and social psych)
      100 Columbia undergrads, 60 Columbia undergrads, 29 Columbia undergrads (j of personality and social psych)
      114 UCSB undergrads, 16 female UCSB grad students (applied cog psych)
      26 males with normal or corrected-to-normal vision, 20 non-video-game-players (Cognition)
      20 male U of Rochester undergrads, 32 non-video-game-players (psychological science)
      48 undergrads at U Toronto, 20 undergrads at U Toronto (psychological science)

      My not-very-random sample shows only one paper that did not specifically use undergrads for at least one of its studies, and 4 of 7 that used students for all of the studies in the paper (and none of the others were at all specific as to where their subjects were from, they just didn't use the word student). And these are mostly in decent journals, some of them are very highly cited studies. In many places (including where I'm at now for grad school), Intro to Psych students are *required* to spend X hours participating in experiments for course credit.

      --
      Warning: Apple/Nintendo fangirl. Likes her electronics cute & cuddly. May be rabid.
    16. Re:Opt-in by stranger_to_himself · · Score: 1

      Okay I stand corrected. That's quite shocking. I do more neurological/psychiatric stuff I suppose, which is bit more medical so the samples have to be more population representative, and our study staff spend a lot of time sampling electoral registers and randomly knocking on doors. We even worry sometimes when we see samples all recruited from the same town.

  5. Inviting drama by Rob+T+Firefly · · Score: 5, Funny

    Google's Eric Schmidt called the data release 'a terrible thing,' and assured the public that 'this kind of thing could not happen at Google.' Eric, you fool! Have you no concept of the world's tencency toward drama and hilarity? Loudly declaring "this kind of thing could never happen at Google" is like saying "at least it's not raining" or "it's a million-to-one chance" or some other damn fool thing that will prove you wrong nine times out of ten.
    1. Re:Inviting drama by Danny+Rathjens · · Score: 1

      He knows it's not likely to happen at his company because they are already monetizing this type of data mining research themselves in house and don't want to let anyone else do it. :)

    2. Re:Inviting drama by 3ryon · · Score: 1

      Eric, you fool! Have you no concept of the world's tencency toward drama and hilarity? Loudly declaring "this kind of thing could never happen at Google" is like saying "at least it's not raining" or "it's a million-to-one chance" or some other damn fool thing that will prove you wrong nine times out of ten.

      I'll never win the lottery, I'll never win the lottery! Do you hear me god? I'll never win the lottery!

  6. Pot calling the kettle black by UberHoser · · Score: 0

    Didn't Google get their balls twisted for outing a chinese blogger ? /golfclap

    --
    Guns are for wimps... Use a crossbow.. this way you can pin them to their chair when you go postal.
    1. Re:Pot calling the kettle black by Anonymous Coward · · Score: 0

      Didn't Google get their balls twisted for outing a chinese blogger ? /golfclap No.

      Seriously, is it that hard to tell the difference between Yahoo and Google? What's next?: "I heard Linus Torvalds threw a chair across the room when one of his top maintainers said he was leaving to go work at Microsoft."
  7. privacy: you can't have your cake, and eat it too by Anonymous Coward · · Score: 1, Insightful

    The final question regarding "what research opportunities will be lost" because of data privacy is pretty horrible. It is analogous to "what crime prevention successess will be sacrificed, because society was not willing to live as a collective prisoner to the state". I.e. duh- yes, you can prevent crime from locking everyone up. But there are *more important values* to be achieved by not presuming everyone guilty and locking them up ahead of time. I.e. in the same way, yeah, you could have all kinds of great research if companies abandoned any attempt at restricting the dissemination of information they have about their consumers. But again, there are things of greater value. ... It's just another form of the fact that liberty isn't free. It has a price. Those unwilling to pay that price, won't get liberty.

    So in other words, shut up about your lost research opportunities. Go take a walk outside and cherish what liberty and privacy you have.

  8. research for the sake of? by BlowChunx · · Score: 4, Insightful

    I love this quote from TFA:
    "Companies do not make money by giving researchers access to data. "

    Wrong! Netflix released data to get a better recommendation system. The better they can pick movies for you, the more you will like their service. The $1million prize is peanuts compared to the increase in revenue a better system can bring.

    I wonder if anyone has estimated the value of the man hours invested in this contest?

    1. Re:research for the sake of? by Dare+nMc · · Score: 1

      Netflix released data to get a better recommendation system.

      Yes, but netflix has a very good system already. Now Tivo, Blockbuster, inteliflix, Dish, etc have a well researched starting point to catch up. and they have more data than the researchers, etc thanks to Netflix data combined with in house.

      Granted the winning solution looks way to computational, and data intensive to run on a Tivo box at real time. I guess those units with the regular phone connection could have it processed off-line and receive it later.

      What Netflix did gain was publicity, now everyone knows they have a kick ass system of recommending movies. I know I started giving more movie feedback because of the press.
    2. Re:research for the sake of? by BlowChunx · · Score: 1

      No, I don't think that Netflix had a "very good system already". I don't do pattern recognition for a living (my field is CFD), and I had a system that beat Netflix after about 1 month of reading papers and figuring out how to compute the SVD for a large sparse matrix.

      What they *really* need is a good way to filter the errors out of the data that they have. Errors in the data introduce larger errors in your predictions...

    3. Re:research for the sake of? by Anonymous Coward · · Score: 0

      Yes, but netflix has a very good system already.
      I'd say they have a passable system already. It doesn't handle limited data points well. For instance, it's constantly recommending stand-up comic shows. It's no doubt doing this because almost everyone who rented them rated them highly and since I've rated comedies highly in the past, it thinks I'd rate them highly. But it doesn't take into account that those rentals have less appeal to the people who aren't specifically interested in that genre. It's annoying enough that I've adapted my ratings to account for the behaviors of the recommendation system.

      To give another for instance, I originally rated "The Corporation" 5 stars. Suddenly my recommendations are nearly all for documentaries about how screwy America has become. It seems that it assumes that I've rated all the movies I've seen from this category 5 stars, so I'd naturally want to see all the rest of the movies from this category. The problem is that I have only so much interest in those kinds of movies and, while I'm likely to rate whatever movie I see from that category as 5-stars, I'm not likely to have any interest in renting 95% of the movies in this category.

      So now I think about what kind of recommendations are likely to result from a non-3-star rating and only give my true rating if I'm interested in how it will affect the recommendations I'm given. I shouldn't have to think this way. The recommendations should take more account of the types of movies I queue in addition to the ratings I give. If 95% of the movies in my queue are from the last year or are from a certain genre, the recommendation engine should suggest predominantly those types of movies.
  9. Responsibility and rewards. by palegray.net · · Score: 2, Interesting

    If companies don't do a thorough enough job of sanitizing statistical data before releasing it, they have to be prepared to deal with the consequences. I'm all for maintaining research access to large volumes of real-world data, but it does need to be obtained through responsible channels.

    All that said, I think an interesting question is: How can we build systems that appropriately compensate companies for access to their data, with strict enforcement of measures designed to thwart misuse of the data? Posters above have given links to research that provides frameworks for making sure data is safe for release; how would a good wrapper for such a system work to incorporate rewards for companies who participate?

  10. So... by thatskinnyguy · · Score: 3, Funny

    Shortly after the AOL incident, Google's Eric Schmidt called the data release 'a terrible thing,' and assured the public that 'this kind of thing could not happen at Google.' So...
    AOL = evil
    Netflix = evil
    Facebook = evil
    Goolgle != evil

    Thank you Eric for giving us the warm and fuzzies that Google is not evil with your two cents.
    --
    The game.
    1. Re:So... by palegray.net · · Score: 1

      Goolgle != evil

      You've got it all wrong... "Goolgle" is clearly evil, since whoever they are, they're obviously trying to get rich quick on typos made by the most awesome company on the planet (Google).

      Google would never stoop to such measures, another example shining example of why "Google" ne "evil"

    2. Re:So... by retupmoca · · Score: 1

      "Google" ne "evil"
      But "Google" == "evil", so google is both evil and not evil at the same time.

      Aw crap, my cat just pooped in some box...
    3. Re:So... by palegray.net · · Score: 1

      My cat crapped in the litter box about half an hour ago. We'd better be careful to carefully consider the implications of this statistically correlated data and guard against its improper release and use!

    4. Re:So... by Anonymous Coward · · Score: 0

      should be
      AOL == evil
      Netflix == evil
      Facebook == evil

  11. Research by Anonymous Coward · · Score: 0

    From TFA:

    "So, what if companies require researchers to sign agreements before the firms hand over anonymized user data? Isn't that a good way to protect users, yet still enable researchers to do their thing? Unfortunately, research is rarely respected by the community when the data comes with strings."

    Of course this is how research can continue. Do think the "anonymized" medical data of patients in medical research are posted on the internet? - obviously not. It will add more bureaucracy and likely reduce the amount of research done, but it won't spell the end of it.

  12. they have the same problem in pharmaceuticals by circletimessquare · · Score: 1

    you can't just randomly give people untested drugs, you need to try it out on rats first

    so obviously, in the future, rats will use aol and we will get human usage pattern information from that

    --
    intellectual property law is philosophically incoherent. it is your moral duty to ignore it or sabotage it
    1. Re:they have the same problem in pharmaceuticals by sm62704 · · Score: 1

      you can't just randomly give people untested drugs, you need to try it out on rats first

      You can if you're the US Military. I can tell you that from experience. Speaking of drugs...

      --
      mcgrew's razor: Never attribute to stupidity that which can be explained by greedy self-interest
    2. Re:they have the same problem in pharmaceuticals by bornwaysouth · · Score: 1

      What's with the 'in the future' bit. It seems to me that human PC users are rats.

      : Strong preference for an urban environment
      : Operate at night
      : Pink eyes
      : Sexually promiscuous
      : Tunnel vision
      : Socially organised into rat packs.
      : Cautious omnivores, who warn fellow rats of toxins
      : Proud champions of Darwin
      : Those who work in labs have white coats.
      : Worried about drain brains, etc...

      It's all going to turn to custard when humans who venture outdoors have better PC capability on their mobile phones. Those tree hugging ozone breathers will be the death knell of current software.

  13. Researchers are to blame. by Clockwurk · · Score: 1

    If researchers use it irresponsibly, then they can't be trusted with access to it. Way to ruin a good thing guys.

    1. Re:Researchers are to blame. by palegray.net · · Score: 2, Informative

      This is kinda like saying security researchers are to blame for discovering and publishing weaknesses in software. Responsible citizens just pretend everything is fine and wait for someone really bad to discover the same weaknesses and exploit them. Because it's so much easier chasing down criminals than it is to fix problems in the first place by adopting better security practices. I guess we could just arrest all researchers who publicize uncomfortable truths. What's the number to Adobe's legal department? I'm sure they still have a few district attorneys on speed dial...

    2. Re:Researchers are to blame. by Clockwurk · · Score: 1

      If researchers want better access to data, then they need to play by the rules. Netflix data was to be used to figure out a better recommendation algorithm, not to crosslink to IMDB in an attempt to expose peoples identity. I would have hoped that the researchers would have had more respect for people than that.

    3. Re:Researchers are to blame. by palegray.net · · Score: 1

      I think you're missing the point of my post. It's one thing to say that researchers should be responsible and "play by the rules." If a researcher's intent is to turn a profit through abuse of statistical data, that's bad. If a researcher's aim is to expose how an unethical person would be able to turn a profit through misuse of the data, that's entirely different (i.e. the difference between security researchers and crackers). We need researchers to point out flaws in data sources, to prevent abuse of the system.

      Let me give you a parallel. Let's say CompanyX makes an operating system, and discovers that it has a critical flaw that allows remote privilege escalation. CompanyX decides to keep that information private, and takes their sweet time developing and releasing a patch for the issue. Meanwhile, system administrators aren't even aware that there's an issue, and some unethical piece of shit cracker releasing a rootkit that exploits the vulnerability. If the sysadmins were aware of the problem, they could have at least taken measures to protect the weak point (turn off the service, etc) until a suitable patch was available. Instead their boxes are p0wned by some script kiddie in Eastern Europe (no offense to anyone of Eastern European descent, just happens that most crack attempts on my servers have been originating from there recently).

      We need companies who provide volumes of statistical data to researchers to take strong measures to prevent misuse of the data, and we need researchers focused on analyzing statistical correlations between disparate data sets to continue to publicize weaknesses in scrubbing algorithms that could be used to "broaden the scope" of anyalysis in improper ways.

  14. Medical records? by CheeseTroll · · Score: 3, Interesting

    This puts the idea of analyzing "anonymous" electronic medical records in an interesting light. Even without a name, SSN, or other ID that explicitly links a record to a specific person, could researchers cross-reference the data with other databases well enough to identify people via patterns in their health record? I'm guessing yes.

    For the record, it's not my intent to troll, but I do think it's something that future researchers will need to take into account to ensure people's privacy.

    --
    A post a day keeps productivity at bay.
    1. Re:Medical records? by guruevi · · Score: 1

      The problem is not cross-referencing patterns because machines nor humans can detect patterns that well to match up 2 datasets that are only related by patterns especially among a large population. So technically, to be totally perfect, you should release the anonymized data of EVERYBODY and filter out cases that are truly unique (can be done simply with keywords).

      The main issue with AOL's and Netflix is that they released data that was self-referencing the user using substitution to replace names with uid's. This way, you can group the data by uid and then look for unique name/words in the set, then match it against an existing database of known AOL/Netflix users to look for matches. If AOL/Netflix would've done the same thing and replaced those unique words/names with a generic word, the researchers would've had much more trouble matching up the users. The same is true for phone numbers, addresses, bank cards and zip codes. Heck even a few simple perl scripts could've done that.

      --
      Custom electronics and digital signage for your business: www.evcircuits.com
    2. Re:Medical records? by Anonymous Coward · · Score: 0

      That's actually a tricky problem. If you have somebody in your dataset who has some unique characteristic (e.g. 108 years old, 8'2" tall, etc.), it can be very easy to determine who they are with no other data.

      Generally, though, there are rules about zip codes and populations, and how big a population has to be before you can release the data. For example, if a zipcode 12345 has a population of 50,000, you can release data about it. But if it's got a population of 100, you might only be able to indicate that the zip code is in a range 12300-12399.

      dom

    3. Re:Medical records? by vrmlguy · · Score: 1

      If AOL/Netflix would've done the same thing and replaced those unique words/names with a generic word, the researchers would've had much more trouble matching up the users. The problem is, if you replace words with other words, you're destroying the semantic meaning of the text, and IIRC the winning algorithm used that semantic meaning to assign scores. Specifically, they looked for common words in movie titles; if you rent several movies with "pirate" in the title, it's reasonable that you might be interested in other movies with "pirate" in the title. Now, you could build a hash-table that consistantly replaced words with meaningless strings (i.e., "pirate" becomes "nhy6mju7ki8") but that (a) destroys your ability to compare "pirate" with "pirates", and (b) just adds a step to the de-anonymizing process (i.e. building a reverse hash-table based on word frequncies).
      --
      Nothing for 6-digit uids?
    4. Re:Medical records? by copdk4 · · Score: 1
      Unlike the data from for-profit companies (AOL, Google etc), medical records are not necessarily "owned" by hospitals. In the US, the data is owned by the PATIENTs versus in the UK it belongs to the NHS Reference

      So atleast in the US, hospitals can get into legal trouble for even disclosing anonymized dataset without consents from each and every patient (although several hospitals make patients sign a form waiving their rights to ownership)

  15. It has to be said. by palegray.net · · Score: 1, Offtopic

    Why depend on Fortune 500 companies to provide large volumes of data to researchers? They provide data comprised of alphanumeric character sequences, punctuation, etc, right? There's a better way that provides that plus a more complete representation of the entire character set! Every UNIX-based machine comes with a built in data generator: /dev/random

    (depending on your machine, your mileage may vary with the quality of the data).

    1. Re:It has to be said. by DerekLyons · · Score: 1

      That's great - if your goal is to analyze the statistical properties of the RNG. It kinda sucks if your goal is to conduct research or marketing in the real world.

  16. locking all people up doesn't prevent all crime by sethawoolley · · Score: 2, Insightful

    You're probably just trolling, but in case you aren't, seeing the rampant crime that is institutionalized in modern prisons, I think your argument falls flat on its face.

    Liberty doesn't have security as its price. Liberty and Security are often correlated, not directly correlated inversely as you assume.

    As more people are free to do things that don't infringe on others' security, security often goes up as the people who would be breaking security systems for their own benefit have plenty of other "acceptable" ways to reap goods, with much fewer risks to boot.

  17. ISP's already sell all your web browsing logs by Mal+Reynolds · · Score: 2, Informative

    This is just the tip of the iceberg. If you live in the US, it's likely that logs of all your web activity are being sold to clickstream companies. The data logs being sold by the ISPs seem to use the exact same sort of inadequate anonymity practices as were used by AOL.

    The problem is that no matter how well the data is cloaked, a users browser habits can easily make the anonymity worthless. As has been seen in the case of NetFlix and AOL, it's easy to figure out whom a person is by simply looking at anonymized logs. A single visit to a social networking site is often enough to make a good guess. But when a specific anomized IP address visits the same page of social networking sites, or edits social their profile at a social networking site, or reviews an item at a vendor site, the real identity of that "anonymized" IP address is completely confirmed.

    Simply cloaking an IP address will never provide anonymity. But the companies that purchase your web surfing logs would have no use for logs that weren't attached to a single user. Unless the ISPs were to keep track of and filter out every single vendor site which revealed a user's real name, there would seem to be no safe way to anonymize user logs. Since there are countless numbers of web forums, vendors, and social networking sites, it would seem technically impossible to truly provide any safe level of anonymity for user logs. Selling these logs is just a bad practice that needs to be stopped.

    I can only wonder why the EFF and other organizations haven't made a bigger deal about this. These ISPs are selling all of their user's web logs. I cannot imagine any effective way the ISP's could ever anonymize this data. More info: http://wanderingstan.com/2007-03-19/is_comcast_selling_your_clickstream_audio_transcript http://arstechnica.com/news.ars/post/20070315-your-isp-may-be-selling-your-web-clicks.html

  18. Private Research != Privacy by Anonymous Coward · · Score: 0

    It has always been the claim that aggregate data is shared, but "no personally identifying information" is released. When correlations like these are made, "personally identifying information" is released in an indirect way. These unintentional leaks have proven that aggregated data can be used to weaken or remove one's privacy, something staunch privacy advocates have voiced for years. Doing such research in private doesn't change the data that's being shared, it only keeps it in the hands of organizations that have paid for it. Those organizations could then turn around and attempt to perform analysis to identify individual users, just as the open researchers have, and no one would be the wiser. I see this as analogous to open source software and the "many eyes" approach to software security, except one might argue that private aggregate data research is worse because at least with closed-source software third-parties generally aren't able to purchase the source code. We are simply talking about a different type of data, and keeping the aggregate data private will only decrease the privacy of the users. Based on his view of aggregate user data, I'm surprised Google's Eric Schmidt isn't a proponent of the closed-source software model.

    I think responsible companies should institute opt-in policies, as some have mentioned, and give researchers open access to such data indefinitely. Once methods of providing true non-identifying aggregate data are available, companies could resume their opt-out policies. Going forward, the open researchers could serve as stewards of the data and alert the companies and the users to possible privacy degrading data sets. The whole process could be modeled after software vulnerability reporting practices to give companies a chance to release a new data set before user information is exposed for all to see.

    A final word on the matter -- if aggregate data can't be used to identify users, then companies like Google should have no problem releasing that data for all to see.

  19. Quoth the unbeatable Pratchett by pcgabe · · Score: 1

    Colon: "So it'd only work if it's your actual million-to-one chance."
    Nobby: "I suppose that's right."
    Colon: "So 999,943-to-one, for example--"
    Carrot: "Wouldn't have a hope. No-one ever said 'It's a 999,943-to-one chance but it might just work.'"

    --
    Don't put advice in your sig.
  20. This could happen to Google, of course by Anonymous Coward · · Score: 0
    While working for Google I did a search on my name in Google's intranet (called MOMA). I was surprised to find several datasets (and multiple copies of some of them) including my name on them. They corresponded to user requests to delete USENET messages. Years before I had sent several of these requests to Google in order to have several USENET removed from Google Groups. It seems that not only they save all that information, but they use it for some research puposes with extremely relaxed confidentiality regulations. They were just plain text files with thousands of entries. I could have just copied them and put them on the Internet with no problem before leaving the company.

    So I don't believe Eric Schmidt. I cannot see how Google can prevent this sort of thing when they have all kinds of user information freely lying around in their intranet.

  21. I don't know what you're talking about by Anonymous Coward · · Score: 0

    100% of people in my opt-in survey said they'd opt-in.

  22. Keeping it anonymous is effectively impossible by Mal+Reynolds · · Score: 1

    No, I don't think there is a technically feasible way to retain anonymity while providing the type of data wanted by researchers and clickstream corporations.

    The reason is because the researchers and clickstream companies don't just want the raw data of what is occurring on a given network. They want to be able to track individual web browsing habits of particular users. They don't need to know who "user 123" is, but they need to be able to differentiate "user 123's" web browsing habits from "user 999's".

    The ISPs deliver this so called anonymity by replacing a user's IP address with a random code. But this sort of IP replacement provides only a facade of anonymity because the code stays the same for all of any given user's web searches. And many typical web surfing habits can easily reveal the real name of the 'anonymized' user. In doing so, gives anyone with this 'anonymized' data the real name and real web browsing habits of most any person in the data.

    For instance, when 'anonymized' user-123 visits his or her home page at a social networking site, they typically log in. A search through the 'anonymized' data for such log-in strings could immediately identify the real name of the 'anonymized' user. The same could happen when user-123 reviews a movie at Amazon or writes a post in any web forum. Even if user-123 used pseudonyms everywhere on the internet, his or her identity could be obtained in other ways. If user-123 were to search for a variety of local services, restaurants, shops, or services, a social engineer could probably work out their real identity. Simply using an online mapping service to get directions from one's house would remove the anonymity and link a real name to all the browsing history of that 'anonymized' user.

    Few internet users consider that companies are analyzing every single move they make on the internet. But US based ISP's routinely sell all of this 'anonymized' data to a variety of Clickstream companies.

    Yes, I suppose the ISP's could try to screen out information from social networking sites. But could they remove all reference to all sites with web forums? Could they filter all sites where users write product reviews? There are so many of those sort of sites and they change so frequently that filtering all that content from the 'anonymized' logs would seem completely unfeasible.

    Those type of sites often make up the majority of many user's browsing habits. So if visits to 'identifier' sites were removed from the data, the minimal remaining data would probably be of little use to the clickstream companies and researchers.

    The fact is that users of this data are really analyzing the web browsing habits of specific, individual users. Because of this, I cannot think of any feasible method to keep the data useful for clickstream companies and researchers while guaranteeing any real level of user anonymity. Your ISP is probably selling your web browsing logs today, and this data is so poorly anonymized that anyone with the data could probably figure out exactly who you are.

  23. Open Marketing Initiatives by nektra · · Score: 1

    People can accept to share information publicly like movies or product rankings. This decision will move down the price of costly marketing studies and will democratize insightful information.
    To balance the protection and sharing of information, more complex social networks infrastructure are required, may be projects like OpenQabal can help.

  24. Well Gee Wally by bratwiz · · Score: 1



    Well Gee Wally, they share our data with everybody damned else.

  25. Re:privacy: you can't have your cake, and eat it t by mojotoad · · Score: 1

    Nodal networks are interesting things. There's research to be had there, regardless of what a 'node' is. This article is about cleansing real world data in such a way that the 'nodes' can be used for such research regardless of nodal identity. So, yes, real and interesting anonymous data can be gleaned. But so can meta-data associated with a 'node'.

    Just hope that you don't become too AdNoid while your AdNodes are tonsured.

    Cheers,
    Matt

  26. Of course by Anonymous Coward · · Score: 0

    "Even without a name, SSN, or other ID that explicitly links a record to a specific person, could researchers cross-reference the data with other databases well enough to identify people via patterns in their health record? I'm guessing yes."

    That's why even "anonymous" medical records are still confidential. In medical research (which I do), names and other obvious identifiers are left out when patient records are extracted into databases, but the intent is not to make it impossible to identify people. It is just a matter of only giving access to the confidential information that is needed for the task at hand.

    I don't think the issue is that "anonymization" isn't anonymous enough - it is a matter of maintaining confidentiality.

  27. throw in the feds by conspirator57 · · Score: 1

    So how does this acceptance of the professional responsibility of researchers change when one acknowledges that at any moment homeland security or the like can issue a national security letter to obtain access to the dataset? They could use it to identify potential troublemakers, and moreover to uncover people's secrets to blackmail them with. Or they could uncover minor crimes and selectively enforce the laws on people they suspect of whatever they aren't able to prove. Worse, they could employ these tactics on political enemies. Imagine a McCarthy with access to such a cheap wealth of actionable information.

    --
    "If still these truths be held to be
    Self evident."
    -Edna St. Vincent Millay
    1. Re:throw in the feds by stranger_to_himself · · Score: 1

      I understand these arguments, and they're important, but I don't think the answer lies in trying to hide things. The powerful guys will always find a way to get what they want, and its really just the smaller people with legitimate uses going through the correct procedures that you'll hinder.

      People can't blackmail you with public domain information. And if the data on minor crimes is available on everybody, then you can point to the selective prosecution. Information is a powerful tool in the hands of both good and bad people.

  28. Anonymizing personal data == DRM by knorthern+knight · · Score: 1
    We have a major problem...
    1. Music-DRM protects the RIAA's data and tries to prevent end-users from derivinging an unprotected version of the data in the file.

      Personal-data-DRM (anonymization) protects the ISP's or hospital's data and tries to prevent end-users (researchers) from deriving an unprotected (unanonymized) version of the data in the file.

    2. Music-DRM makes things more difficult for legitimate customers who legally purchased the data files (music).

      Personal-data-DRM (anonymization) makes things more difficult for legitimate researchers who legally obtained the data files.

    3. Music-DRM has always been defeated.

      Personal-data-DRM (anonymization) will always be defeated.
    We can't have our cake and eat it too. You either have usable data for researchers, or else you have privacy for people. It also occurs to me that data-mining technologies can be applied to breaking music-DRM in the same manner as they can be applied to breaking personal-data-DRM.
    --

    I'm not repeating myself
    I'm an X window user; I'm an ex-Windows user
  29. clumsy by conspirator57 · · Score: 1

    powerful weapons, particularly early in their history are invariably clumsy and prone to lots of collateral damage.

    i'm not sure society would accept the cost of that damage in exchange for the benefit. Even if you claimed it would only be for a transition period. Heck, i'm not sure you could convince me any such transition period would ever end.

    --
    "If still these truths be held to be
    Self evident."
    -Edna St. Vincent Millay