AOL, Netflix and the End of Open Research
An anonymous reader writes "In 2006, heads rolled at AOL after the company released anonymized logs of user searches. With last week's announcement that researchers had been able to learn the identities of users in the scrubbed Netflix dataset, could the days of companies sharing data with academic researchers be numbered? Shortly after the AOL incident, Google's Eric Schmidt called the data release 'a terrible thing,' and assured the public that 'this kind of thing could not happen at Google.' Will any high tech company ever take this kind of chance again? If not, how will this impact research and and the development of future technologies that could have come from the study of real data?"
I don't see this as a problem, yet.
There exist effective techniques that can anonymize the data in order to thwart attempts to correlate identities, while still preserving the statistical properties of the data that make it useful to researchers. They include k-anonymity and l-diversity:
http://privacy.cs.cmu.edu/people/sweeney/kanonymity.html
http://www.cs.cornell.edu/~dkifer/papers/ldiversity.pdf
An unjust law is no law at all. - St. Augustine
> how will this impact research and and the development
> of future technologies that could have come from the
> study of real data?
It's definitely a hindrance. Kind of like not letting cops search houses without permission.
There are people who do not really care if their search results are added to the collection that is released. If Google had an opt-in option for data that they were going to release to academic researchers, I would opt-in. I imagine that there are other people who do not care who is looking at their searches. Something that companies might consider if they wanted to release search results is the option for the users to see what information gets released.
Slashdot Burying Stories About Slashdot Media Owned
Didn't Google get their balls twisted for outing a chinese blogger ? /golfclap
Guns are for wimps... Use a crossbow.. this way you can pin them to their chair when you go postal.
The final question regarding "what research opportunities will be lost" because of data privacy is pretty horrible. It is analogous to "what crime prevention successess will be sacrificed, because society was not willing to live as a collective prisoner to the state". I.e. duh- yes, you can prevent crime from locking everyone up. But there are *more important values* to be achieved by not presuming everyone guilty and locking them up ahead of time. I.e. in the same way, yeah, you could have all kinds of great research if companies abandoned any attempt at restricting the dissemination of information they have about their consumers. But again, there are things of greater value. ... It's just another form of the fact that liberty isn't free. It has a price. Those unwilling to pay that price, won't get liberty.
So in other words, shut up about your lost research opportunities. Go take a walk outside and cherish what liberty and privacy you have.
I love this quote from TFA:
"Companies do not make money by giving researchers access to data. "
Wrong! Netflix released data to get a better recommendation system. The better they can pick movies for you, the more you will like their service. The $1million prize is peanuts compared to the increase in revenue a better system can bring.
I wonder if anyone has estimated the value of the man hours invested in this contest?
If companies don't do a thorough enough job of sanitizing statistical data before releasing it, they have to be prepared to deal with the consequences. I'm all for maintaining research access to large volumes of real-world data, but it does need to be obtained through responsible channels.
All that said, I think an interesting question is: How can we build systems that appropriately compensate companies for access to their data, with strict enforcement of measures designed to thwart misuse of the data? Posters above have given links to research that provides frameworks for making sure data is safe for release; how would a good wrapper for such a system work to incorporate rewards for companies who participate?
512 MB RAM, 20 GB disk, 200 GB transfer, five datacenters. $19.95/month.
AOL = evil
Netflix = evil
Facebook = evil
Goolgle != evil
Thank you Eric for giving us the warm and fuzzies that Google is not evil with your two cents.
The game.
From TFA:
"So, what if companies require researchers to sign agreements before the firms hand over anonymized user data? Isn't that a good way to protect users, yet still enable researchers to do their thing? Unfortunately, research is rarely respected by the community when the data comes with strings."
Of course this is how research can continue. Do think the "anonymized" medical data of patients in medical research are posted on the internet? - obviously not. It will add more bureaucracy and likely reduce the amount of research done, but it won't spell the end of it.
you can't just randomly give people untested drugs, you need to try it out on rats first
so obviously, in the future, rats will use aol and we will get human usage pattern information from that
intellectual property law is philosophically incoherent. it is your moral duty to ignore it or sabotage it
If researchers use it irresponsibly, then they can't be trusted with access to it. Way to ruin a good thing guys.
This puts the idea of analyzing "anonymous" electronic medical records in an interesting light. Even without a name, SSN, or other ID that explicitly links a record to a specific person, could researchers cross-reference the data with other databases well enough to identify people via patterns in their health record? I'm guessing yes.
For the record, it's not my intent to troll, but I do think it's something that future researchers will need to take into account to ensure people's privacy.
A post a day keeps productivity at bay.
Why depend on Fortune 500 companies to provide large volumes of data to researchers? They provide data comprised of alphanumeric character sequences, punctuation, etc, right? There's a better way that provides that plus a more complete representation of the entire character set! Every UNIX-based machine comes with a built in data generator: /dev/random
(depending on your machine, your mileage may vary with the quality of the data).
512 MB RAM, 20 GB disk, 200 GB transfer, five datacenters. $19.95/month.
You're probably just trolling, but in case you aren't, seeing the rampant crime that is institutionalized in modern prisons, I think your argument falls flat on its face.
Liberty doesn't have security as its price. Liberty and Security are often correlated, not directly correlated inversely as you assume.
As more people are free to do things that don't infringe on others' security, security often goes up as the people who would be breaking security systems for their own benefit have plenty of other "acceptable" ways to reap goods, with much fewer risks to boot.
This is just the tip of the iceberg. If you live in the US, it's likely that logs of all your web activity are being sold to clickstream companies. The data logs being sold by the ISPs seem to use the exact same sort of inadequate anonymity practices as were used by AOL.
The problem is that no matter how well the data is cloaked, a users browser habits can easily make the anonymity worthless. As has been seen in the case of NetFlix and AOL, it's easy to figure out whom a person is by simply looking at anonymized logs. A single visit to a social networking site is often enough to make a good guess. But when a specific anomized IP address visits the same page of social networking sites, or edits social their profile at a social networking site, or reviews an item at a vendor site, the real identity of that "anonymized" IP address is completely confirmed.
Simply cloaking an IP address will never provide anonymity. But the companies that purchase your web surfing logs would have no use for logs that weren't attached to a single user. Unless the ISPs were to keep track of and filter out every single vendor site which revealed a user's real name, there would seem to be no safe way to anonymize user logs. Since there are countless numbers of web forums, vendors, and social networking sites, it would seem technically impossible to truly provide any safe level of anonymity for user logs. Selling these logs is just a bad practice that needs to be stopped.
I can only wonder why the EFF and other organizations haven't made a bigger deal about this. These ISPs are selling all of their user's web logs. I cannot imagine any effective way the ISP's could ever anonymize this data. More info: http://wanderingstan.com/2007-03-19/is_comcast_selling_your_clickstream_audio_transcript http://arstechnica.com/news.ars/post/20070315-your-isp-may-be-selling-your-web-clicks.html
It has always been the claim that aggregate data is shared, but "no personally identifying information" is released. When correlations like these are made, "personally identifying information" is released in an indirect way. These unintentional leaks have proven that aggregated data can be used to weaken or remove one's privacy, something staunch privacy advocates have voiced for years. Doing such research in private doesn't change the data that's being shared, it only keeps it in the hands of organizations that have paid for it. Those organizations could then turn around and attempt to perform analysis to identify individual users, just as the open researchers have, and no one would be the wiser. I see this as analogous to open source software and the "many eyes" approach to software security, except one might argue that private aggregate data research is worse because at least with closed-source software third-parties generally aren't able to purchase the source code. We are simply talking about a different type of data, and keeping the aggregate data private will only decrease the privacy of the users. Based on his view of aggregate user data, I'm surprised Google's Eric Schmidt isn't a proponent of the closed-source software model.
I think responsible companies should institute opt-in policies, as some have mentioned, and give researchers open access to such data indefinitely. Once methods of providing true non-identifying aggregate data are available, companies could resume their opt-out policies. Going forward, the open researchers could serve as stewards of the data and alert the companies and the users to possible privacy degrading data sets. The whole process could be modeled after software vulnerability reporting practices to give companies a chance to release a new data set before user information is exposed for all to see.
A final word on the matter -- if aggregate data can't be used to identify users, then companies like Google should have no problem releasing that data for all to see.
Colon: "So it'd only work if it's your actual million-to-one chance."
Nobby: "I suppose that's right."
Colon: "So 999,943-to-one, for example--"
Carrot: "Wouldn't have a hope. No-one ever said 'It's a 999,943-to-one chance but it might just work.'"
Don't put advice in your sig.
So I don't believe Eric Schmidt. I cannot see how Google can prevent this sort of thing when they have all kinds of user information freely lying around in their intranet.
100% of people in my opt-in survey said they'd opt-in.
No, I don't think there is a technically feasible way to retain anonymity while providing the type of data wanted by researchers and clickstream corporations.
The reason is because the researchers and clickstream companies don't just want the raw data of what is occurring on a given network. They want to be able to track individual web browsing habits of particular users. They don't need to know who "user 123" is, but they need to be able to differentiate "user 123's" web browsing habits from "user 999's".
The ISPs deliver this so called anonymity by replacing a user's IP address with a random code. But this sort of IP replacement provides only a facade of anonymity because the code stays the same for all of any given user's web searches. And many typical web surfing habits can easily reveal the real name of the 'anonymized' user. In doing so, gives anyone with this 'anonymized' data the real name and real web browsing habits of most any person in the data.
For instance, when 'anonymized' user-123 visits his or her home page at a social networking site, they typically log in. A search through the 'anonymized' data for such log-in strings could immediately identify the real name of the 'anonymized' user. The same could happen when user-123 reviews a movie at Amazon or writes a post in any web forum. Even if user-123 used pseudonyms everywhere on the internet, his or her identity could be obtained in other ways. If user-123 were to search for a variety of local services, restaurants, shops, or services, a social engineer could probably work out their real identity. Simply using an online mapping service to get directions from one's house would remove the anonymity and link a real name to all the browsing history of that 'anonymized' user.
Few internet users consider that companies are analyzing every single move they make on the internet. But US based ISP's routinely sell all of this 'anonymized' data to a variety of Clickstream companies.
Yes, I suppose the ISP's could try to screen out information from social networking sites. But could they remove all reference to all sites with web forums? Could they filter all sites where users write product reviews? There are so many of those sort of sites and they change so frequently that filtering all that content from the 'anonymized' logs would seem completely unfeasible.
Those type of sites often make up the majority of many user's browsing habits. So if visits to 'identifier' sites were removed from the data, the minimal remaining data would probably be of little use to the clickstream companies and researchers.
The fact is that users of this data are really analyzing the web browsing habits of specific, individual users. Because of this, I cannot think of any feasible method to keep the data useful for clickstream companies and researchers while guaranteeing any real level of user anonymity. Your ISP is probably selling your web browsing logs today, and this data is so poorly anonymized that anyone with the data could probably figure out exactly who you are.
People can accept to share information publicly like movies or product rankings. This decision will move down the price of costly marketing studies and will democratize insightful information.
To balance the protection and sharing of information, more complex social networks infrastructure are required, may be projects like OpenQabal can help.
Well Gee Wally, they share our data with everybody damned else.
Nodal networks are interesting things. There's research to be had there, regardless of what a 'node' is. This article is about cleansing real world data in such a way that the 'nodes' can be used for such research regardless of nodal identity. So, yes, real and interesting anonymous data can be gleaned. But so can meta-data associated with a 'node'.
Just hope that you don't become too AdNoid while your AdNodes are tonsured.
Cheers,
Matt
"Even without a name, SSN, or other ID that explicitly links a record to a specific person, could researchers cross-reference the data with other databases well enough to identify people via patterns in their health record? I'm guessing yes."
That's why even "anonymous" medical records are still confidential. In medical research (which I do), names and other obvious identifiers are left out when patient records are extracted into databases, but the intent is not to make it impossible to identify people. It is just a matter of only giving access to the confidential information that is needed for the task at hand.
I don't think the issue is that "anonymization" isn't anonymous enough - it is a matter of maintaining confidentiality.
So how does this acceptance of the professional responsibility of researchers change when one acknowledges that at any moment homeland security or the like can issue a national security letter to obtain access to the dataset? They could use it to identify potential troublemakers, and moreover to uncover people's secrets to blackmail them with. Or they could uncover minor crimes and selectively enforce the laws on people they suspect of whatever they aren't able to prove. Worse, they could employ these tactics on political enemies. Imagine a McCarthy with access to such a cheap wealth of actionable information.
"If still these truths be held to be
Self evident."
-Edna St. Vincent Millay
-
Music-DRM protects the RIAA's data and tries to prevent end-users from derivinging an unprotected version of the data in the file.
-
Music-DRM makes things more difficult for legitimate customers who legally purchased the data files (music).
-
Music-DRM has always been defeated.
We can't have our cake and eat it too. You either have usable data for researchers, or else you have privacy for people. It also occurs to me that data-mining technologies can be applied to breaking music-DRM in the same manner as they can be applied to breaking personal-data-DRM.Personal-data-DRM (anonymization) protects the ISP's or hospital's data and tries to prevent end-users (researchers) from deriving an unprotected (unanonymized) version of the data in the file.
Personal-data-DRM (anonymization) makes things more difficult for legitimate researchers who legally obtained the data files.
Personal-data-DRM (anonymization) will always be defeated.
I'm not repeating myself
I'm an X window user; I'm an ex-Windows user
powerful weapons, particularly early in their history are invariably clumsy and prone to lots of collateral damage.
i'm not sure society would accept the cost of that damage in exchange for the benefit. Even if you claimed it would only be for a transition period. Heck, i'm not sure you could convince me any such transition period would ever end.
"If still these truths be held to be
Self evident."
-Edna St. Vincent Millay