Anonymous Cowards, Deanonymized
mbstone writes "Arvind Narayana writes: What if authors can be identified based on nothing but a comparison of the content they publish to other web content they have previously authored? Naryanan has a new paper to be presented at the 33rd IEEE Symposium on Security & Privacy. Just as individual telegraphers could be identified by other telegraphers from their 'fists,' Naryanan posits that an author's habitual choices of words, such as, for example, the frequency with which the author uses 'since' as opposed to 'because,' can be processed through an algorithm to identify the author's writing. Fortunately, and for now, manually altering one's writing style is effective as a countermeasure."
In this exploration the algorithm's first choice was correct 20% of the time, with the poster being in the top 20 guesses 35% of the time. Not amazing, but: "We find that we can improve precision from 20% to over 80% with only a halving of recall. In plain English, what these numbers mean is: the algorithm does not always attempt to identify an author, but when it does, it finds the right author 80% of the time. Overall, it identifies 10% (half of 20%) of authors correctly, i.e., 10,000 out of the 100,000 authors in our dataset. Strong as these numbers are, it is important to keep in mind that in a real-life deanonymization attack on a specific target, it is likely that confidence can be greatly improved through methods discussed above — topic, manual inspection, etc."
First! Analyze this anon comment, suckers!
This book is a very interesting read on this very topic: http://www.amazon.com/Author-Trail-Don-Foster/dp/0805063579
I merely function my observations by means of a thesaurus.
Clinton Ebadi, take that for de-anonymisation ...
How bout y'all mind your own business instead of breaching the basic expectation of privacy?
This is, of course, not really new.
A couple of years ago, there was some news (cannot find the link now) that some researchers tried this with a more statistical approach. As an implementation they used a compression algorithm.
I had a try with this on a forum. Somebody posted a long story anonymously, but I suspected the author. I gathered 10 posts from 5 authors, including the suspect. Then I cut the amount of text to equal length. Subsequently I added the anonymous text to each of the 10 samples and bzipped the resulting text.
The resulting zipped file was shortest in the case where I added the unknown text to the samples from the suspected author. The bzip algorithm apparently decided there was more similarity between the posts.
Although this was by no means a real scientific test, I turned out to be correct and was rather pleased with the result. Seems to me such an approach could also be useful for things. Why login on /. when it can just figure out who you are based on what you have just written?
To maintain anonimity you would just have to insert random shit into your posts.
Bonus points for the slashdotter who can deduce my identity based on the non-randomness of this post.
Would that be enough, however? I fear, though, that this might be the new handwriting analysis craze. Still, each person has quirks to their writing to some degree. For one, I think my usual quirk stands out quite well, yeah.
I exaggerated it for the sake of making it obvious. I wonder how well this system at picking up things like this. Meaning, if I started talking like this:
Yo dawg, the meta-battle between anons and the man is heating up. Cool story bro, but we need fight this now. Our privacy is in danger of being shot down like a clay pigeon at a shoot out, yeah?
by Anonymous Coward: I, for one, welcome the shift from car analogies to pizza analogies. um.. overlords?
If it can identity you based on your idiosyncrasies, I suppose that means writers could use software based on these techniques to identity the idiosyncrasies in their own writing. From there, they can learn new ways to express themselves and write in a more colorful and varied manner.
Heck, it can even be a tool that teaches you to think in a more varied manner.
Democracy Now! - your daily, uncensored, corporate-free
But my humor is pretty unique so I guess you could track me through that.... but why?
So now they will write an application that accepts text, runs it though a re-anonymizer that uses a thesaurus/dictionary/translator to scramble the authors habits and makes it impossible to detect. Or even better can determine the habits of some unsuspecting blogger and formats the messages that makes him look like the guilty party.
Its just another damn radar detector, detector, detector, detector.... detector!
Any stats on false positives?
But can someone explain what is meant by "halving of recall"? I can speculate on the Wikipedia link. I've tried searching via Google only to come back to the article mentioning it. But the phrase doesn't make a whole lot of sense to me. Do they mean to narrow down the potential correct answer by subsequent guesses? That is, eliminate half of the incorrect answers then proceed again?
just mix up teh syntax and add extra words as chaff BABA BOOEY! BABA BOOEY! (click)
....work on those in government creating fake identities for spying and provoking things that help them justify their pointless jobs.
Looking at the percentages... Hmmmmm...
Didn't the father of that girl who was the victim of that collar bomb hoax in Australia run a company which sells software which does stuff like this?
I am Spartacus!
thanks , for Good Article
http://engadget1.com/
sorry, bad pun...
Couldn't resist it.
Anonymous Cowards, Deanonymized. Posted by Unknown Lamer.
How many times is this going to be 'discovered' and featured on the front page of Slashdot? It's old news. We get it. No need to publish another story on the topic, there's been one a quarter or so for years.
I'm currently working on filters to do serious substitutions on text. Sort of like the set of filters that is available for *nix systems (do a "man filters" for more info).
Except that instead of humorous substitutions, it would do things like changing Britishisms to Americanisms (e.g. "colour" -> "color"), mess with spelling and grammar (e.g. "grammar" -> grammer" and "who's" -> "whos") and similar. Now I guess I'll also have to add "since" -> "because" and "that" -> ", which" and others.
If you have opinions of similar substitutions, please add!
E.g.
Orig: I'm currently working on filters to do serious substitutions on text.
Pirate: I'm currently workin' on filterrrrs t' do serious substitutions on text.
NYC: I'm currently wawhkin' on filters tuh do serious substitushuns on text. Okay?
Enemies of socialism found out because of/since hadoop analysis of Slashdot post!
if your stupid enough to not change your posting style when trolling, your own bad.
is not the best way to keep a stable system / bandwidth, recently
Slashdot, fix the reply notifications... You won't get away with it...
Even going back to the day when forums on the internet were email lists, there would always be some immature person who was a regular who switched aliases.
It would be so obvious from their pattern of writing that their new alias would seem as effective as a disguise as merely putting dark glasses on.
OMG it can identify 1,000,000,000,000,000,000 out of 100,000,000,000,000,000,000 too!
... use a thesaurus.
So we should all post anonomously to this thread and see if we can be identified?
Sounds like a challenge.
Anybody know who I am?
What's interesting is that you'd want to. First we loath Google and other companies for treating security trivially then we start developing algorithms for rooting out the anonymous. It goes to show that we're all for something until we find a disagreement with it. As soon as an AC says something we disagree with and if we do so to a certain level of passion, we'll cede the moral high ground for the rich, creamy goodness of revenge.
60% of the time, it works every time.
Gem from a lost soul in my childhood:
""What was it when for you said there was maybe like a lot of there but there wasn't and you knew it?"
My first Journal Entry ever, in 8 years! http://slashdot.org/journal/365947/aphelion-scifi-fantasy-horror-poetry-webzine
i seem to remember a story about a team of chinese researchers that did this about year ago.
Well if you think about it the REALLY scary question is not how well this works but whether the courts will accept it. Anybody remember bullet fingerprinting? That was where they supposedly could match a bullet to a specific batch so they could tell if a bullet came from a certain pack of shells or not? We all know now it was total bullshit and that variations even in the same lots could be pretty wide simply because the bullet manufacturers simply weren't that anal retentive about purity as long as the round went straight but that junk science put untold numbers of people in PMITA prison.
Now what if the courts accept this as evidence? Some troll could copy pasta phrases from your actual posts and stitch them together to make them say something else and if they can trip this thing all this technobabble like bullet fingerprinting sells REAL well to juries who sit around watching CSI. Frankly after false flags like fast and furious I wouldn't even trust the feds not to decide to "frame the guilty man" or decide you must be the guy so make the evidence fit. Frankly this is why shows like CSI scare me, all this technobabble sells well to juries who frankly don't understand WTF this crap is, only that it looks high tech like something from CSI therefor it MUST be true.
ACs don't waste your time replying, your posts are never seen by me.
If it's all automated, writers don't need to learn new ways to express themselves. Software can do that for them!
I don't always attempt to identify an author, but when I do, I find the right author 80% of the time.
Je pense que cela peut être facilement évité.
Considering a random association will be 50% accurate (right or wrong), the algorithm is doing a lot worse than a coin flip.
That's gay.
I was going to guess either Tom Womack or Baldrson, but I'm out of time and I don't think I'm right.
My first Journal Entry ever, in 8 years! http://slashdot.org/journal/365947/aphelion-scifi-fantasy-horror-poetry-webzine
Schools already use programs like "White Smoke" and http://www.whitesmoke.com/ and "Style Writer" http://www.stylewriter-usa.com/ to identify grammar errors and stylistic errors, and suggest corrections. These programs are able to identify active and passive voice, clarity and readability of writing, ambiguous words, gender specific words, cliches, and more. I'm not sure the use of such software is such a great idea. I guess it's OK as long as a teacher reviews the results. Then again, if the teacher doesn't do as good a job as the program does...
I do not mind. The most important reason that I post as an AC is that I find it to much of an effort to maintain hundreds of accounts to be able to post on frivolous websites like /.
This is why I practice non-redundancy. Redundancy is too redundant, so constantly repeating words and/or redundant phrases becomes a redundant factor in helping people to determine who you are on the internet when you post as an anonymous coward redundantly.
Remember, kids, practice redundant privacy measures to ensure you will never be exposed.
Sometimes I wish I had multiple personality disorder YOU'LL NEVER EXPOSE ME YOU BASTAGES
Lord, what fools these mortals be!
What's in a name? A blinking idiot! That which we call a rose by any other name would smell as sweet.
And thus I clothe my naked villany with odd old ends stol'n out of holy writ, and seem a saint, when most I play the devil.
...are those with multiple personalities immune to this sort of detection? :P
What do I know, I'm just an idiot, right?
Damn! I'll have to stop using floxinoxinihilipilification so much in my anonymous posts or people will know it's me!
Using the logic proposed in the article- can we assume that all the anonymous cowards using "the other f word" are all Samuel L Jackson?
"That's the way to do it" - Punch
Difs ho boy ne a rel jb lk klen n toulets dtn NuYk.
Just as individual telegraphers could be identified by other telegraphers from their 'fists,'
...anonymous posters can be identified by their Frists?
#naabhaprzrag, #sverubfr-000, #agi-fcbafberq, negvpyr[pynff*=' negvpyr-ary-'] { qvfcynl: abar !vzcbegnag; }
Oswald, anyone criticising you is practicing floccinoccinihilipilification. You are so wonderfull in every way. If only everyone were like you. .. the observant will notice a difference in spelling. ... the more observant will wonder if that were deliberate.
Tell us something new. This is how they caught Kaczinsky the Unibomber. Analysis of word choice word frequency sentence structure can and will identify you And? Identifying a single person from their many anonymous messages online leads you to back to anonymous.
Aside from that it's easy enough to alter your writing to fool the analysis if you want to. Please tell us something new that every single person on Slashdot doesn't already know.
around 2,000 users
#1. the smaller the town , the pettier the politics
#2. there is one user we keep banning, and they keep coming back under a new name, and you can always tell with 100% accuracy that it is the same person, based on sentence cadence and agenda, and overall personality and attitude
intellectual property law is philosophically incoherent. it is your moral duty to ignore it or sabotage it
I've mentioned this before, but it's worth repeating as more and more services no longer use their own identity systems, relying instead on Gravatar, or doing away with their own comments system by relying on Disqus (which uses Gravatar).
In the case of sites using Gravatar incorrectly*, which is pretty much all of them, 'anonymous' posts still have their Gravatar ID attached - which is just an MD5 of the person's e-mail address. All you then need to do is find that same MD5 on another site where the author opted not to post anonymously.
The main reason this ties into the story at hand is in getting reference material together. With e.g. Disqus, you can be reasonably assured (unless account sharing occurs) that the anonymous post with MD5 X on site A is authored by the same person as that of the anonymous post with MD5 X on site B, and you can include both in the pool of reference material.
( This also means there are issues with anonymity even if the author always posts 'anonymous'. )
* The worst part of this is the website owners. Aside from letting anonymous posts still grab their results from Gravatar (even if you don't have a Gravatar 'account', the e-mail address you use will be the MD5 in the HTML), some sites implement Gravatar as an afterthought. You could have been posting to a site for years behind a pseudonym, knowing that you're reasonably anonymous - and then find your pseudonym, and all the posts made, linked to other posts at other sites because the website owner decided to use Gravatar to display users' avatars of choice, using the e-mail address in their account.
Gravatar is a useful service, especially in that the website can save some bandwidth, and the users who do want it can just update a single avatar and have that immediately be used on any site that uses the service.
But I implore webmasters to consider seriously the ramifications of using Gravatar or Disqus, and at least:
1. Disallow Gravatar on posts, profiles, etc. that were created before your implementation of Gravatar.
2. Create an opt-in system for the use of Gravater, per-profile.
3. Disable the Gravatar code when the post author has indicated that they want to post anonymously.
4. If implementing Disqus, make clear that its service may not adhere to your site's own privacy policies, and posting anonymously is a faÃade.
Much the same applies to other login, profile, and comment consolidation/aggregation/syndication systems (such as facebook's), but especially in the case of Gravatar, which requires no user interaction such as a login or existing valid login state), it is all too easy to think only of the benefits.
Don't a good chunk of NLP courses begin with building Markov chains of people's word usage to identify, for example, which authors of the Federalist Papers wrote which ones? This isn't just old, it's really, really old, like when I was in school old.
questions answered frequently.
- Pavlov's Shell
I'll take that in Bayesan filter format, please.
Aww hell that ain't nuthin, come down to the delta bottoms sometime and you'll find you'll probably need a translator to understand your fellow Americans, hell I've lived here all my life and even I need a translator when I get around Yazoo MS. There you'll find that ALL adjectives are replaceable with "Iz/Be/Been" so you get sentences like "I iz be fixin ta get around ta doin what you axked, but I be busy, probably tamarraw" would be considered a perfectly acceptable and actually more understandable than most. What some people call ebonics is nothing but the lower delta bottoms natural speech and when it gets mixed in with Mexican and Creole slang good fricking luck understanding their asses.
ACs don't waste your time replying, your posts are never seen by me.
Thanks for pointers to those programs (I'll try the free stylewriter soon) - I've wondered about programs like that ever since I tried my hand at sentence generators back in high school.
Have you tried these products and/or do you work in or have you benefited from that area of software development? (it's been too long - how do I PM you on slashdot??)
8-PP
I don't know why this is such a big deal. It is very easy to beat.
If you are right-handed, type with your left hand and if left-handed type with your right hand.
Easy!
Go all T.S. Elliot on their asses and build your posts entirely out of things other people have said. First post overlord gritsneal!
Sendou Wave Kick!!
This technology doesn't work. You can't isolate an author. You can only isolate word patterns. Word patterns, while a persuasive idea, are not unique. Unless you want to assume that "first post!" is all one guy. Maybe it's all good ol Max Metro after all, but I doubt it. Thing is, you'll never know any differently.
Stylometry on Wikipedia. Some linguists have been doing it for years and in some cases with more success, but apparently it's only newsworthy when someone outside of linguistics writes about it. (Why yes, I'm a linguist. How did you know?)
"Live free or don't."
There's a simple solution I reckon. Write your piece in English. Translate it to Arabic using google or something. Translate it back to English.
Reminds me of a Joke I heard. This procedure was used to translate the expression "Out of sight, out of mind". The dual translation result: "Invisible Idiot".
HB Gary had poor security.
An SQL injection into their websites custom build CMS which didn't salt any hashed passwords. One of the recovered passwords was also used for an email account (a lesson in not reusing the same password). Anonymous then logged into the email account and sent an email to the system administrator asking for the servers root password, in plain text email I might add. So the servers root password was emailed back to the attacker in the compromised account, and the rest is history.
That's all I can recall from memory, but HB Gary had so many security flaws they really were begging for it.
An SQL query goes to a bar, walks up to a table and asks, "Mind if I join you?"
At least you're coherent!
Sounds also like the Jamaican dialect among the seasonal workers in my area. I'll sign off with:
"Me ha' go' way now co' me ha' simtn' fi' fix eh moi sumtn' fin yaum." /. lingo.)
(aka "Catch ya later, I have to go fix a bunch of $hit and I'm fscking hungry." in
My first Journal Entry ever, in 8 years! http://slashdot.org/journal/365947/aphelion-scifi-fantasy-horror-poetry-webzine
While it has some similarities to Jamaican migrant workers its a LOT more slang heavy. in fact what really makes it a bitch is depending on the region you may have as many as FOUR different kinds of slang mixed in! You'll have black slang, poor white trash slang, and in MS you'll often get Creole slang mixed in there as well. I'd give you a sample but i'm afraid i wasn't joking, I actually DO have to have a translator if I stop around Yazoo, its too slang heavy. At least with the migrant workers its usually English they are mangling, with bottoms talk they are mangling Mexican slang, Creole slang, as well as white and black English. hell if you go down by chemical row they even have some African slang from the Gambian workers they have down there, its so mangled it IMHO is more of a mess than a language.
ACs don't waste your time replying, your posts are never seen by me.
It's a fascinating algorithm.. and I betcha it works.. just the consistent mis-spelling of woords, and capitalization Errors , plus idiosyncratic placements, of the commas.. would likely bring the "suspect" to light.. unless of course, the errors were .. intentional !
The author (Arvind Narayanan) writes a paper and then creates a story on /. linking to his paper.