NSF Funds Data Anonymization Project
Trailrunner7 writes "A group of researchers from Purdue University has been awarded $1.5 million from the National Science Foundation to help fund an ongoing project that's investigating how well current techniques for anonymizing data are working and whether there's a need for better methods. The grant will help to further research from computer scientists and linguists, who are looking at ways in which people can still be identified through textual clues even after explicitly identifiable data has been removed. The Purdue anonymization project has been ongoing for some time, and also includes researchers from a number of other institutions, including Indiana University and the Kinsey Institute."
It works!
Can I pick up my grant check now?
I wonder if they could get a larger grant from Google or Facebook or the NSA or [insert large organization name here] to get a guaranteed result of "things are just fine, nothing to see here"?
NSFW?
"Common sense will be the death of us all"
"The grant will help to further research from computer scientists and linguists, who are looking at ways in which people can still be identified through textual clues even after explicitly identifiable data has been removed." SHOULD READ
"The grant will help to further research from computer scientists, linguists, AND the N.S.A. who are looking at ways in which people can still be identified through textual clues even after explicitly identifiable data has been removed."
Yours In Krasnoyarsk,
Kilgore T.
Headline had me thinking the science grants were returned Non Sufficient Funds... thats a sign of a really bad economy.
The research is actually into data mining, not some new forms of encryption/anonymization.
I'm sure the results will provide insight that may lead to better anonymization, but I bet framing the whole thing around the more popular side of that spectrum makes it sell better.
So, is this a good development or a bad development? If the finding better ways to identify people leads to better ways to remove that information then it is better?
Or is it better because it will help us not remain anonymous when we donate to our favorite cause and that organization is in some way involved in US politics?
... in other words, meatloaf!
http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=1012208
above is direct link to award
And I don't see how can you have meaningful data if you removed all the information that would enable you to recreate the individual sample.
You can still segregrate them into groups even if you can't identify the individual sample - which is essentially what happens already. Data miners go and determine "People who like Penny Arcade also like Video games - so lets put an Ad for Fable 3 up on the main page" - whether that is Penny-Arcade's decision to get more click-revenue or whether they just let an adserver handle that obvious piece of info is irrelevant, you are still using relevant data with meaning to market to a large group of people instead of an individual.
Now - this article brings up the idea of whether I can retain my anonymity online. If it were up to me to run this experiment, I would do exactly as you said, some advanced behaviour analysis technique. First we'll start off here: I'm on Slashdot. You have an alias, and you have a few of my posts. You can tell that they tend get a little long winded sometimes, easily getting to 3 or more paragraphs if there isn't an immediate punchline in sight, or responding to a question. You also get what stories I usually respond to - I often don't have much to say about Linux releases, but I am often avid in the gaming area.
So you go a lot of the other sites that you can infer slashdotters might frequent. All the tech news sites, and then those towards my posting habits, a lot of gaming sites, yadda yadda yadda. First thing you are looking for is similar aliases, then you cross-refer the posts on different sites to see the similarities. How many Monkeedudes are there on the Gamespy forums? Do any of them make really long posts? He's mentioned on Slashdot that he is Canadian - do any of the other sites have public profile info that say he's Canadian?
And so on and so forth. This is all automated - so it's much quicker than a person trying to build this file. After it's all built, a human can quickly skim the data and knock off any outliers that might have seemed similar to the computer.
Now - have I ever mentioned my name anywhere in all the data collected? My age? My city? Can you infer my age given the relative maturity of my posts - and my registered dates and other posts online? Can you infer my city based on my jokes about the weather around here? How hard would it be to nail me to a Facebook page with various likes and dislikes - if that information were available to you (either publicly or for sale?).
It's a scary world we live in, I don't know if any such systems exist, but I see it as definately technically feasible. It also seems like a great product I could market and make lots of money off of it - but I definately don't believe in progressing that side of the internet.
governemnt entity(CIA+NSA)* national security + keylogger or trojan = we ownz all your base (where base = data). Anonymiztion HAH.
"We are just a war away from Amerikastan. When god vs god the undoing of man." Dave Mustaine
For better anonymization, you could run the data though google translate a few times. That'll guarantee that it's anonymized.