Why Anonymized Data Isn't
Ars has a review of recent research, and a summary of the history, in the field of reidentification — identifying people from anonymized data. Paul Ohm's recent paper is an elaboration of what Ohm terms a central reality of data collection: "Data can either be useful or perfectly anonymous but never both." "...in 2000, [researcher Latanya Sweeney] showed that 87 percent of all Americans could be uniquely identified using only three bits of information: ZIP code, birthdate, and sex. ... For almost every person on earth, there is at least one fact about them stored in a computer database that an adversary could use to blackmail, discriminate against, harass, or steal the identity of him or her. I mean more than mere embarrassment or inconvenience; I mean legally cognizable harm. ... Reidentification science disrupts the privacy policy landscape by undermining the faith that we have placed in anonymization."
For almost every person on earth, there is at least one fact about them stored in a computer database that an adversary could use to blackmail, discriminate against, harass, or steal the identity of him or her. I mean more than mere embarrassment or inconvenience; I mean legally cognizable harm.
...And this is the first thing that the author(s) though of regarding data-mining? Okay, but how would this happen? Why go through all the trouble to gather all that data when you could just hire a P.I. or know (or bribe) a law-enforcement official or an ISP employee? It Reminds me of a conversation I had with a guy who bragged that he could get anybody's info because a very good friend of his worked at the DMV. There were a couple semi-profile firings at the State Department because some employees snooped through celebrities' records for no reason other than voyeurism..er..curiosity.
Those types, the ones with the direct access to the info, are the weakest link. They're only human. "Hey, Bob, there's this guy I really hate. Look up his IP logs and tell me what you see!"
It all boils down to voyeurism. People would rather bring others down before bring their own lives up. It's the nature of the beast! Pathetic.
The only way to make sure that data remains truly anonymous if or it to start out as anonymous data. "Scrubbed" data will always be traceable and often will have the source data, non-scrubbed, leak into the wild.
All hail the glorious Hypno-Google.
Great, another Ohm's law to learn.
They're on to me.
Am I the only one who always gives their birthday as 01/01/1970 and their zip code as 20500?
I mean, seriously. They don't need to know. Why would I give 'em the right numbers? They're lucky I even allow them to have rough demographic data.
ad logicam Claiming a proposition is false because it was presented as the conclusion of a fallacious argument.
See!
-- Anonymous Coward
It takes a man to suffer ignorance and smile
Be yourself no matter what they say
Holy hell forget about that anonymized data crap, I want to learn how she can compress that much data into three bits!
I've pretty much given up any hope of being anonymous. It's just going to get exponentially more difficult as time goes on.
I had my credit card stolen once. It was stolen from the CC company. How is a business supposed to entrust me with thousands of dollars in credit if they don't know who I am? How is a credit card company supposed to function without a worldwide network which authorizes transactions.
If someone wants to find me they'll find me.
If someone wants to use my identity to frame me for a crime then they're just going to encounter a mountain of evidence from numerous sources which contradict their fabrication.
"My G1 was on a Starbucks Wifi at the time of the crime. I used my CC to purchase the drink. I received a text from a nearby tower. I posted a comment on breaking news story that is written in my style of writing. I was seen on 8 security cameras walking to the starbucks from my car. I used an automatic toll card 5 miles away from the coffee shop...." Good luck coming up with a large mountain of evidence to put me somewhere else.
[citation needed]
I can't think of anything I've done online (even my shemale midget fetish on youpron) that could be used to blackmail me, now i get that others are more ashamed about what they do online but "almost everybody"?
IranAir Flight 655 never forget!
Forget anonymity. I'm better off living in a glass house, so it's easier for me to know when I need to yell "Get off my lawn!"
"Ohm terms a central reality of data collection: 'Data can either be useful or perfectly anonymous but never both.'"
Okay, I just got finished anonymizing some data. What's going out is ID (incremented, starting at 1), total voxels, voxels increasing and voxels decreasing. The people who are getting the data think it is highly useful. According to Ohm's "law" that means it is not anonymous.
Unless someone (including Ohm) pipes up with a plausible means for identifying the original subjects, I call BS on Ohm's "law."
Even if the data is completely and unreversably anonymized, it is still invasive. Look at the story yesterday about the marketers data-mining kids' online private conversations for consumer gadget preferences. Even if there's no way from that data to infer the preferences of any particular kid, they should still be able to talk to each other without having their conversation be part of a marketing survey.
Think also of a cafe that sells two kinds of food: apple pie (eaten by freedom-loving patriots), and felafel (eaten by terrorists and their supporters and sympathizers). Of course it would be invasive for the cafe to disclose which of its customers ordered which kind of food. But even releasing aggragate statistics is bad. An increase in felafel sales can led to a bullshit fbi investigation even if individual customers aren't identified.
People sitting on private data constantly search for self-searching justifications to disclose as much as they can without getting clobbered by the sources of the data. It is bullshit. Private should mean no disclosure, not anonymized disclosure, not aggregate disclosure, just plain no disclosure period.
If you ever wonder why people view the privacy of your records in the hand of third parties is important, and don't just hop on the "privacy is dead" bandwagon, this is the sort of scenario they have in mind.
http://en.wikipedia.org/wiki/Mother_Earth_(magazine)
Mother Earth was an anarchist journal that described itself as "A Monthly Magazine Devoted to Social Science and Literature," edited by Emma Goldman. Alexander Berkman, another well-known anarchist, was the magazine's editor from 1907 to 1915. It published longer articles on a variety of anarchist topics including the labor movement, education, literature and the arts, state and government control, and women's emancipation, sexual freedom, and was an early supporter of birth control. Its subscribers and supporters formed a virtual "who's who" of the radical left in America in the years prior to 1920.
In 1917, Mother Earth began to openly call for opposition to American entry into World War I and specifically to disobey government laws on conscription and registration for the military draft. On June 15, 1917, Congress passed the Espionage Act. The law set punishments for acts of interference in foreign policy and espionage. The Act authorized stiff fines and prison terms of up to 20 years for anyone who obstructed the military draft or encouraged "disloyalty" against the U.S. government. After Emma Goldman and Alexander Berkman continued to advocate against conscription, Goldman's offices at Mother Earth were thoroughly searched, and volumes of files and detailed subscription lists from Mother Earth, along with Berkman's journal The Blast, were seized. As a Justice Department news release reported:
"A wagon load of anarchist records and propaganda material was seized, and included in the lot is what is believed to be a complete registry of anarchy's friends in the United States. A splendidly kept card index was found, which the Federal agents believe will greatly simplify their task of identifying persons mentioned in the various record books and papers. The subscription lists of Mother Earth and The Blast, which contain 10,000 names, were also seized."
Mother Earth remained in monthly circulation until August 1917.[1] Berkman and Goldman were found guilty of violating the Espionage Act, (imprisoned for two years) and were later deported.
Prisencolinensinainciusol. Ol Rait!
Any particular reason you chose District of Columbia?
Reply to That ||
I understand that one bit alone will do to specify the sex but, how does one specify ZIP code with just one bit? One bit will tell you whether or no somebody has a ZIP code but, in order to specify a ZIP code completely we need - what, 16 bits?
All persons whom understand encryption also understand that there is no such thing as perfect encryption. Anonymizing(sp?) data works using roughly the same methods as encryption, and there is no such thing as an unbreakable encryption. We can only hope for "acceptable". I'd assume the most acceptable means of anonymizing data would be to allow the user to first choose what gets scrubbed out, followed by a sort of data "blacklist" compiled by experts. The real problem here is that companies selling this data have a vested interest in never getting it quite right.
Where genius and insanity become confused true wisdom is found
So, despite the Birthday Paradox, they can still identify 87% of Americans? For some reason I'm under the impression that there are a lot more zip codes with more than 366 people (heck, even 1000 to call upon 3 or 4 duplicates that should cover gender differences) than there are zip codes under that amount.
More Twoson than Cupertino
Potential nitpick, but here goes.
The summary (not surprisingly for a /. summary) omits a couple of details that give the reader a rather partial picture.
For one, Paul Ohm is an Assistant Professor of law, and although the summary makes it sounds like the linked article would be from a technical perspective, (mostly) it is not.
A quote like:
"Data can either be useful or perfectly anonymous but never both."
needs a bit of background about the qualification of the person making that claim. Why? Simply because it sounds like a rather technical remark. If some computer science researcher made this claim, I would tend to take it more on the face value, otherwise I would take it with a grain of salt.
Now obviously this statement was not meant to be taken quite literally because the notion of "useful" is not precise. I can get reasonably useful information like "most of the people in my country like to buy branded stuff" or "most people who rent videos of actor X regularly, also rent the videos of actor Y regularly" without needing the underlying data to contain *any* personally identifiable information. The fact that extra data is store is a different thing.
I personally believe that instead of claiming that some researcher has argued X, it can be more informative to actually say what kind of researcher it is who made a claim. Not because only researchers in a certain area can be trusted, but because a little bit of background puts the claims in right perspective.
English is not my first language, so I probably didn't catch the whole meaning, but...
The idea was that everyone can be identified with only the birth date, gender and ZIP code? So... err... There is, in fact, not even one ZIP code that has two people living there of the same gender that happen to share a birthday? Sure, to have the year coincide would take a bit more than just the date itself but it's hard for me to imagine that this could be true.
So... what did I miss?
Data can either be useful or perfectly anonymous but never both
What a load of bolaks....
Supposing you have a list of -just- birth dates for every citizen at the census. You -only- have only been given one piece of data per person, the date, nothing more. Just a huge list of dates, sorted chronologically.
1) The data has been totally anonymised.
2) You can do all kinds of meaningful analysis on the age demographics of the population. And make policy decisions based on that.
Fully anonymous data producing useful results.
"Oops, I always forget the purpose of competition is to divide people into winners and losers." - Hobbes
This is much too extreme. There are many good examples of useful data that is for almost all intents and purposes anonymous. Consider the example of anonymous lending libraries from my book, Translucent Databases.
The simplest version just pushes the book title through a one-way function. The more complex version also hides the name in a similar way.
Can the anonymity be stripped away? There are coincidences and connections as Sweeney's examples and the Netflix examples show, but they can be fought by adding some salt/nonce to the one-way function. We can also add passwords.
There are so many different ways to add bits of complexity to the results that there are many tradeoffs we can make between effective privacy and the complexity of using the systems. I think it's good to keep the weaknesses in mind, but I think it's more of a feasible engineering problem than something that should be dismissed out of hand. (The law review piece is also worth reading in its entirety because it's more concerned with the legal issues created by the existence of privacy-enhanced databases. It would be simpler for some issues if they didn't exist and so it helps to argue seriously.)
Data can either be useful or perfectly anonymous but never both
I'm not sure I entirely agree with this statement. While it's tecnically correct, I believe it's misleading...
It's perfectly possible to hash personally identifiable information into an MD5 sum, to ensure that your records are unique, and then to generate useful statistics based on the resulting aggregate data without releasing significant personal information.
For instance:
Key = Hash(Your name + Your Zip + Your Birthday)
Zipcode
Birth Decade
Hobbies
Household income (Averaged to the nearest $20K increment.)
This information is significantly anonymous, and still highly valuable market research. If you happen to submit your information twice, it will be caught by the unique hash.
Of course, the author describes 'perfect' anonymity. It's technically possible that you're the only person in your zip who is between the ages of 21 - 30, enjoys playing video games, and makes $60K-$80K a year... However, it's generalized enough to provide a great deal of plausible deniability.
The same basic statement about anonymity could be made for a person standing in a crowd: given enough detail, you could identify that person by their appearance, without knowing any unique identifying information about them.
What should you get from this? You aren't as anonymous online as you might expect. Who's really surprised?
I think the article should differentiate between an 'anonymous' dataset and one that is 'de-identified.' Anonymous as defined in the article is not the same thing as de-identified as defined by HIPAA. If a dataset is de-identified, you cannot include date of birth (except year), date of visit/service (except year), or anything more than the first three digits of the zip code (there's 15 other identifiers that aren't allowed, either).
The attack described in the article wouldn't work if the dataset were de-identified, or at the very least, would have been a lot more difficult.
" 87 percent of all Americans could be uniquely identified using only three bits of information: ZIP code, birthdate, and sex." I'll be generous and overlook the gross misuse of the term "bits" in this context and pretend that the author wrote "tidbits" instead. That said, I do not for a second believe that 87 percent of Americans were the only person of the same gender born in their particular zip code on the same day. That's just ludicrous. Now if "birthday" is actually referring to a more specific point in time, such as the exact second of birth, then sure...but that's not really common knowledge.
"in 2000, [researcher Latanya Sweeney] showed that 87 percent of all Americans could be uniquely identified using only three bits of information: ZIP code, birthdate, and sex"
That doesn't seem right. IIRC, there are somewhere around 60,000 zipcodes. (Obviously there are under 100000.) If the population is 300 million, that's an average of about 5000 people per zipcode. Male/female splits it in half, so you have 2500 birthdates to distribute uniquely over 365 days.
Looked at another way, 365 days *times* 2 sexes *times* 60000 zipcodes totals less than 44 million. How do you uniquely ID 300 million people?
Add the problem that many people could have given you either their work or home zipcode. How does she do that?
How is this any different than articles about rockets and space travel (after all, most of us will never travel into space, or work for NASA)? Or any other in a myriad of technical subjects that most of us are not, and will not be directly involved in or use directly.
People are curious. They are curious about everything. It's an exercise in futility to pick and chose useful information over non-useful information since none of us knows what tomorrow holds. If someone want's to read celebratory gossip more power to them. In truth, the gossip is more likely to be both true and useful than news about an new process that may produce titanium at half the cost or an article about NASA's next big toy. We on slashdot find the technical news more interesting, normal people who are interested in interpersonal relationships find the gossip more interesting. It's two sides of the same coin.
what if your family members lives become endangered if you "come out"?
true fucking story, Jack, of a friend of mine that was a lesbian. Someone
threw her out of a second story window for even hinting at coming out.
Parents were very important government officials of a not so understanding
government. Details left out so they can all live.
I have worked with anonymized government data extensively, and birthdate and zipcode are always considered personally identifiable information. Sometimes birth year is available, and sometimes state or (rarely) county is available, but I have never even heard of a dataset with both. Datasets with month and day of birth are never considered to be anonymized, and are not released. The author of the paper is much overwrought.
not your personal army.
01/01/1901
What if anonymizer software is mandated to thoroughly replace with random details - random names where names are not important, random IPs where IPs are not important, or blanking out.
This ought to be mandated and audited.
This can be done publicly - if the data is random, a public dump should not be able to harm anyone because there is proof that at that time, the said person was in altogether different place doing something totally unrelated.
This is a rule that should be made compulsory for any system that holds public records that can leak onto the internet.
Sale of people's records will become that much more difficult if regular audits are needed or public dumps are needed.
Although all of this can be circumvented, the legal costs it ties to the operation are now multiple times meer accidental breach of privacy - not obfuscating data you do not need becomes a crime for corporations
A law that helps the people should make sense given this obvious problem.
I have a twin brother living with me. Now try to identify me, Haha!
I do research in the field of anonymization and can say that I agree with a lot of his points, but he takes each of them too far and sounds very alarmist. He seems to see things in a very binary way. One can have anonymization that is effective at preventing reversal for 99% of indviduals or certain types of attacks. For example, I may be able to release a data set that has almost a 0% chance of revealing any particular user but a 100% chance that someone could be revealed.
Anyway, one of the good points he brings out is how stupid the requirements in HIPAA are. One can anonymize with the safe harbor rules (from EU I think) which basically destroy information needed for most kinds of analysis, or they can get a statistician to certify that it has been statistically de-identified without any specific standard for what that means. So in practice you can get anything released if you hire the right statistician.
If the collector of data did something like an MD5 hash to verify an identity, that's very difficult to do an inverse operation on.
Individuals simply aren't capable of securing this information and protecting themselves. They aren't given the power to actually do anything.
The responsibility lies SOLEY with the businesses collecting this information and with law enforcement. They're greedy and stupid, which is why problems occur.
Business:
They're cheap. This is almost the SOLE reason security problems exist. If they actually paid for network hardening, dedicated security staff, BONDING for key individuals and customers, and most importantly, tight restrictions on selling information, these problems would disappear. It's actually pretty easy, it just costs you money (especially in opportunity cost as you can't sell to anyone who is even slightly suspicious). Certain businesses (like ChoicePoint) would be completely unprofitable under such rules, so should be regulated out of existence.
Law enforcement:
Law enforcement has completely dropped the ball on cybercrime because it's a little bit difficult to enforce and isn't "sexy" enough. It requires technical expertise and many of the worst offenders are overseas. Overseas, the problem has to be dealt with through trade sanctions. Sanctions should be imposed on the Brazil, Russia, and China until they do something about cybercrime or let US law enforcement operate freely. All of these countries refuse to extradite, which is the core problem.
Collecting sensitive information like addresses, birthdays, etc.... is a HUGE violation of individual liberties anyway. In almost all cases the collector should be told to F OFF!
This decade is seriously becoming a game with all the types of rules we used to play for fun 10-15-20-25 years ago.
Does anyone else notice that a Patent Turbo-Troll who offers to report you to a legal thug, deliberately making data easy to steal, and "descrubbing" anonymous data all in the same few HOURS?
Come on gang, that's the TimeTwister Combo from MTG.
RIAA got grumpy because record data (songs) was "easy to (infringe)". Phish emails are ... (your verb here) your digital data from the less savvy types.
Someone appoint Richard Garfield as Special Consultant to the President so when stupid new "calls for X" show up on he President's desk, Garfield can take a 30 minute look at it and abuse the hell out of it so bad that it makes Goatse looks like a Victorian Picnic.
My first Journal Entry ever, in 8 years! http://slashdot.org/journal/365947/aphelion-scifi-fantasy-horror-poetry-webzine
Have you ever thought about how a "cat" scan works? Forget the 3D aspects and let's just think about how the cross-sectional pictures work.
Every given reading, is just shooting a ray through the target, and getting a single number out. This is analogous to aggregate summaries are personal details in data. You know the average income of people in zip code 12345, but no specifics. The trick is, later, just as that CT scan is going to shoot a ray through a certain point again from a different direction, your personal details are going to be summarized again by someone else, in a different way.
A picture will emerge. The CT scan is going to "see" the bone as distinct from the tissue right here at this pixel, and this person's data will be un-summarized. It just takes enough rays, and eventually all ambiguity goes away.
A long time ago (about 20 years ago, I think?) there was a neato explanation of a cat scan algorithm in Scientific American. I wish I could find it. Because I bet you could show that article to any "database guy" these days, and they'd nod and smile.
"Believe me!" -- Donald Trump
I don't care, it's the remaining 13% of me that's special.
Nullius in verba
Even Nigerians?
>>I think the solution is to have the concept of "intellectual property" work both ways.
I think u are absolutely right! But then that property would be abused just like every other companies. But at least it would give a person a legal standing. My life history (as partially divulged on Facebook) is a work of art and should be protected just like a painting or Hollywood film. It should be inherit and explicit that my persona should be protected under copyright laws, and anything stemming from that would fall under intellectual property rights jurisdiction. The only problem with all of this is that the entire concept of intellectual property is flawed, as we are all pooled from a collective consciousness that can't be dissected so easily. But in the relative world, having intellectual property law extend to any record created by our lives would be an improvement.
"I think, therefore I can't be."
-- TTNH
I think therefore I can't be ~TTNH
I just don't believe 87% of people can be uniquely identified by {zip,dob,sex} or 18% by {county,dob,sex}. dob only creates 366 bins. Adding sex makes it 732 bins. Even if all 100000 possible zip codes contained a population of 732, that would make for only 73 million people or less than 1/4th of the actual population. And date of birth will not be even close to being uniformly distributed.
From Census Bureau data, the median county population across the U.S. in 2008 was 26,000+. Again, without doing a more intensive analysis to be certain, it still seems unlikely that 18% of individuals would be uniquely identifiable by {county,dob,sex}.
Upon further inspection, 19 counties out of 3193 in the U.S. were estimated to have populations less than 732 in 2008.
NO FRIGGIN' WAY would 18% of all U.S. citizens be uniquely identifiable by {county,dob,sex}. NOT EVEN 1% would be so identifiable.
"" Good luck coming up with a large mountain of evidence to put me somewhere else."
Wait that you sleep somewhere alone, steal your CC, and if you really want to do in the not-too subtil, steal some clothing they jsut used but did not wash. Go use the CC near crime scene. Use old clothing to rip small part on nail or other ragged surface. One should suffice. Steal the mobile phone too, and the driving license. Commit your crime. "lose" the driving licence nearby. Send text message. Replace everything as quickly as possible in the apartement of the sleeping person. Et voila. You have no aliby, and your super system to show where you are just turned agaisnt you.
C. Sagan : A demon haunted world:
http://www.amazon.com/gp/product/0345409469/
visit randi.org
I worked for a big insurance company in Europe, and our testing data sets ought to be anonymized. Well, one quick google search revealed they weren't. The company that did the anonymization thought it would be enough if the phone numbers and emails were false. Thank god the data sets where only used for internal development, but nevertheless we had access to very personal data, since the company also did health insurance. If we hadn't intervened, they would have used the same method to create data sets for load testing, and that would have meant hundreds of thousands of insurance records barely scrambled available to developers (well, we had NDAs and stuff) and also to lots of testers, usually underpaid students.
And this is only one of a few occasions I encountered such things. Don't get me started about banks and IT...
AC for a reason.
But is it Sufficiently Innovative and Non-Obvious?
My first Journal Entry ever, in 8 years! http://slashdot.org/journal/365947/aphelion-scifi-fantasy-horror-poetry-webzine
Fact: The more information you give away about yourself, the more useful it will be to the one having access to it but the potential for abuse increases.
The huge problem we already experience today is the loss of control you have over your provided data. Who has access to what information? Right now, do you know on how many servers your birth date is stored? I certainly do not, but in my case I estimate it goes into the hundreds. How many of these copies of my birth date are actually used/needed on those servers? I guess the number is very low.
I don't really care about my birthdate, but what about my name, home address, credit card number, social security number? Do you know?
Hypothesis: In a perfect world without corrupted minds a secure, centralized server (or server farm) where all your data is stored (including personal data like bank account balance), never given away and accepts verification/change requests from the outside could solve all problems at once. The only public data would be Unique ID's associated with 1) personal data and 2) an operational code, eg verify age is >18 and decrease account balance by 100. Only you are able to approve associations and give away those UID's. There you go, centralized control over your data and what happens to it. Problem solved.
Unfortunately we're already too far down the road and have to life with the current mess, probably forever.
They (and many other "social" web sites") ask for exactly those 3 things. They have millions of users who post all sorts of stuff (mostly crap).
This presumes that one has to store the full birthdate. If the research requirement calls for knowing a person's age, then storing only the birth year would presumably take an enormous amount of precision out of the data, and make actual identification much harder.
Watch out 4chan
I'd rather search for the answers than just ask the questions.
Who could have seen it coming that Dr. Ohm would meet with so much ... resistance? "rimshot".
I send hundreds of patient records OUTSIDE our hospital every day. These records INCLUDE admission time and date, patient's town, state, zipcode, and birthDATE (with year).
How do I get around all those pesky HIPAA laws? I don't--I'm pretty sure this system is illegal. It is however REQUIRED by our state's government.
Oh, you thought laws to limit government power are actually followed? Now I see your confusion....