Linguistics Identifies Anonymous Users
mask.of.sanity writes "Researchers have examined writing styles to identify previously anonymous carders and hackers operating on underground forums. Up to 80 percent of users who wrote at least 5000 words across their posts could be identified using linguistic techniques. Techniques such as stylometric analysis were used to track users who posted across different forums, and could even be used to unveil authors of thesis papers or blogs who had taken to underground networks."
Anonymous First Post... you'll never guess who I am
They know who I am. I will now have to type in random styles.
Little do you know the AC that posts here is in fact just one person.
Wait, 5000 words? I think I'm safe.
wg wgsedg wsewef awe fasd fsefawe fgwagasdg wae fasdf wsef awef sd fas fawe
Who am I?
Anonymous hackers now using tools to scramble their writing style so they stay anonymous.
I worked for a smallish (but not incredibly tiny, maybe 100 employees) company and wrote a letter to the CEO once. We'd been castigated by someone who'd taken over the local office because the company was doing poorly. A number of austerity measures were implemented. I did not find those to be that annoying because I realized it was either that or not have a job. But the castigation didn't sit well with me. We were in trouble because of the decisions of a few bad managers, not the behavior of average employees.
So I wrote a letter about it. He stripped my name off and presented it in an executive meeting to all the people directly under him. He asked "Why am I getting letters like this?". Everybody who worked in my office immediately knew who it was. I had a distinctive writing voice, and a strong reputation.
It did not lead to me being fired. I was actually highly respected there. It led to me being encouraged to have an honest sit-down talk with the new manager for our division (the guy who'd made the speech I wasn't happy about). I think we both came away from that meeting a lot happier about the other.
But that was a strong lesson to me. If I ever really want to be anonymous I'm going to have to purposely work on adopting a completely different writing style. And I will have to keep a wall up between styles and never 'slip'.
Need a Python, C++, Unix, Linux develop
In addition to these metrics, other can be added as well, e.g.: post date, size, tabulation, punctuation, capitalization, regional vocabulary, etc. Also, once you can add frequency-space analysis, naive bayesian filters, in order to increase precision, or to probe against other texts. Anyone interested about investing in text-rewriter technology in order to both detect similarities and automatic-rewrite?
You could always type in Gangnam Style!
My first Journal Entry ever, in 8 years! http://slashdot.org/journal/365947/aphelion-scifi-fantasy-horror-poetry-webzine
It just seems brilliant to me, but I can tell you first hand, how I may talk in my thesis papers is very different that how I may speak or come across in my C++ blog, beer brewing forum, car forums, fishing forums, my backyard BBQ blog, etc, etc, etc. I wonder what the accuracy rating is.
I'd be rather surprised if someone else couldn't.
"Leetspeak, an alternative alphabet popular in some forum circles, cannot be translated."
*sigh* does this mean I must resent people that use this form of communication less?
I'm not so sure I can stoop so low.
This is so bad I don't know where to begin. There is nothing, ever, that excuses this. For every zodiac crazy serial killer or copyright scofflaw they try to apply this to (and fail) there will be thousands and thousands of people that will be persecuted by organizations and governments for expressing their opinions. While this won't have a big effect in the West for half a generation, oppressive governments are going to be all over this.
And then, in ten or fifteen years, the youth will have grown with this technology and become accustomed to it...accepting it. Just like facebook has been accepted.
I'd move to Mars when it's possible but some bureaucrat will analyze everything I've ever written on the interwebz (and I've been mostly not stupid about shit I've written online since 1995 or so) and make some arbitrary decision about how I'm not acceptable because I'm not a huge fan of authority or some such crap.
Way to go humanity.
"Leetspeak, an alternative alphabet popular in some forum circles, cannot be translated."
Looks like leetspeak actually has a use now: H4KK3rZ N 5Kr1p7 K1dd13z r3j01(3!!! LOLZ!
At least until they integrate OCR in to the software. Then it's useless again.
One way to change a bunch of the stylistic queues would be to convert your message to another language and back using Google Translate. Depending on the intermediate language(s) and possibly using different translators should neutralize some things.
and could even be used to unveil authors of thesis papers or blogs who had taken to underground networks.
... a good reason to do it like zu Guttenberg then... Nobody will tie any of his underground writings to his thesis...
Isn't this just the same software that college use to detect plagiarism and whether someone else wrote that essay for you? I thought it was in common use in academia.
The best is the enemy of the good
Hurry up, someone write a small shell script that maps all AC posts on slashdot to their respective authors!
I can conclude that Mr Peter "W.H. Smiths, the book store" used the highly efficient MS HTML (in Word et el) converter to write that analyse page.
Well, I'll write everything in the style of my enemies from now on then.
double ROT-13 all my posts from now on!
The best weapon of a dictatorship is secrecy, but the best weapon of a democracy should be the weapon of openness.
I just love our wonderful friends, the Thought Police. They save us thoughtless boobs from the evil thought terrorists waging their unholy jihad against us using
weapons of mass thought destruction!
Pass the beer and chips, I think American Idol is on. Oh wait, that can't be right. I can;t think.
Pad all communications with cut/paste from various, unrelated news articles and such, for and aft, randomly alternating how much is padded on each side.
Or, you can do what I do and use a different font for each letter.
Why all the civil-liberties hand-wringing? Just how hard is it to read some of the papers on stylometric analysis to see what markers are used, then write a script that randomises them but preserves the sense of the text. Make it a Firefox plugin so it's done automatically. It's a better solution than using Google translate to go English to $language, $language to English.
For extra fun, change your text so its stylometric markers match up with E. L. James, or the leader writer of the Washington Post.
is Ebonics!
First rule of assuming different identity: become the other personality. Develop different speaking patterns, writing style, habits, diet, associations. Keep seperate from your other life. Method acting to the extreme !
Alternatively look up "project Monarch" to see how the three letter agencies have refined this technique :)
The climate change community has a lot of trouble with extremely articulate, anonymous climate deniers, who appear to show up in force and sabotage discussions of climate change on blogs, etc.
I should imagine that such an algorithm might enable researchers to build profiles over denialist astroturf, and correlate them with known people working for known rightwing think tanks. Employed properly, this might have a massive impact on the rightwing black PR industry.
No need for 5k
This same story keeps cropping up in various forms, but we've been doing this at least since the 80s or 90s. I don't know why it keeps being rehashed or why people continually seem surprised by it at this point.
Since I am the same person I should not use the royal we now my cover is blowm by me. I need to stop me doing that by telling me off.
"Up to 80 percent of users who wrote at least 5000 words across their posts could be identified using linguistic techniques. Techniques such as stylometric analysis were used to track users who posted across different forums, and could even be used to unveil authors of thesis papers or blogs who had taken to underground networks."
Not really new. I heard about the techniques long time ago - in mid 90s - in a context of a MS-DOS tool which was unintentionally designed to foil the identification methods.
It was designed for Russian and Belarussian languages (but for English I gather the task should be even easier) and was a byproduct of Prolog-based system for natural language processing and translation. This particular program was allowing to improve or change writing style, e.g. simplify dry legalese or formalize spoken-like text. It wasn't particularly good at it: meaning was occasionally changed or sometimes reformulated sentenced made no sense. But still, it did the job of obfuscating the original writing style.
All hope abandon ye who enter here.
After reading TFA I cannot find any convincing experimental validation. I see a lot of "can" and conditional tense (maybe that's the author's style), but nothing on the validation of the approach. Where is the experimental data, including the number of anonymous users correctly and incorrectly identified on forums?
They didn't identify 80% of the users, they managed to make a guess in 80% of the cases, which they didn't even bother to try to verify. There's no proof that their technique actually works.
So now hackers use software that randomizes their writing style. Problem solved. Then problem solved.
Sur jurst wrurte lurke ur furkurn rurturd urnd yur bur furn. Durrrrrrrrrp.
I regularly, like, totally change my typing method between posts.
You could like totally try and figure out who I was even if I typed 5000 words in this post, but you would totally never find me, ye'know what I mean?
But really, this sort of thing is retarded in every way.
You can frame people easily using this crap if you just pick a target and adopt their typing patterns.
This strikes me as akin to a Lie Detector. I think an honest court would side with the accused 100% of the time as even this cannot absolutely proove they were the author.
Though sadly, a Roberts/Scalia/Thomas Supreme Court would rule against such an individual and for the corporation or state security organs. Dicks.
I now Master Yoda get to my identity mask.
That was the sound of the joke going over your head!
captcha: comics
Aren't those cunning linguists clever? The answer always seems to be right on the tip of their tongue. They don't diddle around. They seem to be able to lick any problem.
Tiller's Rule: Never use a word in written form that you've only heard and never read. You will end up looking foolish.
LOL. OMG. w/e
can i get some funding for shady studies of unreliable techniques too
YOLO
LOL
I know that my writing style is pretty easy to nail. Both the use of words, and commas, and spacing. I'm very aware of it, and for this reason, never, ever, ever, ever, post on any forum. Ever.
- Expect Us
This isn't new stuff, in college during the late 80's during the AI boom :-), I wrote a paper about using linguistic and stylistic analysis to analyze weather or not Shakespeare wrote certain texts attributed to him. I think the downside of this analysis will be the reverse. Meaning that this will be used to analyze someones stuff and create fake texts that could "frame" persons who did not write the text. Then again it could be also used to add noise or create utilities to modify your writing style to make this analysis useless.in a lot of cases. Hey or a booming new career as a ghost writer/blogger/commenter for the rest of us !
The new hacker's skill.
What a bunch of bullshit.
I could count the number of times I laughed so hard on the fingers of one hand.
welcome our stylistic overlords
Any guest worker system is indistinguishable from indentured servitude.
A lot of anonymous intentionally write different, including using styles of opposite sex, in order to counter things like this.
http://www.liwc.net/tryonline.php
3 different posts or emails. Samples included sarcasm, a call to arms for political action to my family, and a proposal for a solution to a complex problem.
Results:
LIWC Dimension Your
Data Personal
Texts Formal
Texts
Self-references (I, me, my) 2.00 11.4 4.2
Social words 4.00 9.5 8.0
Positive emotions 0.00 2.7 2.6
Negative emotions 2.00 2.6 1.6
Overall cognitive words 4.00 7.8 5.4
Articles (a, an, the) 12.00 5.0 7.2
Big words (> 6 letters) 32.00 13.1 19.6
The text you submitted was 50 words in length.
LIWC Dimension Your
Data Personal
Texts Formal
Texts
Self-references (I, me, my) 7.72 11.4 4.2
Social words 7.40 9.5 8.0
Positive emotions 1.61 2.7 2.6
Negative emotions 2.89 2.6 1.6
Overall cognitive words 4.82 7.8 5.4
Articles (a, an, the) 8.04 5.0 7.2
Big words (> 6 letters) 19.61 13.1 19.6
The text you submitted was 311 words in length.
LIWC Dimension Your
Data Personal
Texts Formal
Texts
Self-references (I, me, my) 1.63 11.4 4.2
Social words 7.62 9.5 8.0
Positive emotions 2.90 2.7 2.6
Negative emotions 0.36 2.6 1.6
Overall cognitive words 3.27 7.8 5.4
Articles (a, an, the) 11.62 5.0 7.2
Big words (> 6 letters) 25.77 13.1 19.6
The text you submitted was 551 words in length.
LIWC Dimension Your
Data Personal
Texts Formal
Texts
Self-references (I, me, my) 0.49 11.4 4.2
Social words 0.99 9.5 8.0
Positive emotions 0.49 2.7 2.6
Negative emotions 2.96 2.6 1.6
Overall cognitive words 6.40 7.8 5.4
Articles (a, an, the) 11.33 5.0 7.2
Big words (> 6 letters) 33.50 13.1 19.6
The text you submitted was 203 words in length.
If you RTFA, the Chaos Computer Club presentation seems to be on the topic of "carders forums" & other graynet black hat communities like SEO.
They had an opportunity to train the algorithm using knowledge that would typically be exclusive to the level of access available to a server administrator, large ISP, or expert witness allowed forensic recovery on seized equipment as inputs.
Based on my dataset, it seems like these tactics are context sensitive and will have a margin of error proportional to the length of the positive match outputs with the sample text length used as the input functioning as a ceiling on certainty. This explains why they had to assist the neural network by pre-filtering the datasets. If this is done by a human, then it biases the outcome in the same way that a bingo RNG can be biased by an operator with their eyes open.
Please RTFA because this only reflects on my comprehension of the topic. 6 letters) 20.92 13.1 19.6
The text you submitted was 196 words in length.
[Your comment has too few characters per line (currently 33.8).]
[Your comment has too few characters per line (currently 34.8).]
Aylin Caliskan and Rachel Greenstadt. Translate once, translate twice, translate thrice and attribute: Identifying authors and machine translation tools in translated text. Sixth IEEE International Conference on Semantic Computing (ICSC 2012). https://www.cs.drexel.edu/~ac993/papers/Aylin_ICSC_2012.pdf
use Google translate to make your self anonymous its that simple
I can identify an APK post based on linguistics.
There are tiny timing differences as one types. these are quite distinctive between individuals if you collect enough data. Its related to how an individual learns type; Motor memory of word-phrases versus typing a new word for the first time. Even the pattern of common typing errors and recovery.
The calc to know who i am gets harder when i choose most words from Basic English. I may not sound so school-smart when i'm forced to leave out the clique words, but it sure makes me change my style. I find it fun to do as well. I learned it from those Brit guys who sang all those songs and from Hemingway.
slashdotters also saw the lecture at the 29C3
Unabomber manifesto comes to mind.
Fuck Ajit Pai
I'm curious how this would apply to the Zodiac case. Oh wait, it doesn't:
* He used symbols in communication.
* Voice recognition didn't solve the case.
* DNA evidence didn't solve the case.
* Copycats functioned as noise, might've even given him credit.
WE DON'T NEED NO BLOG CONTROL.
I guess at least half the population like a Cunning Linguist ;-)
I think the Professor of Phonetics Henry Higgins in George Bernard Shaw's opening scene of Pygmalion (or My Fair Lady) could have told you this!
Tracy Johnson
Old fashioned text games hosted below:
http://empire.openmpe.com/
BT
Typing with your wrists crossed will boost your typo count for sure