Can Author Obfuscation Trump Forensic Linguistics? (webis.de)
An anonymous reader writes: Everyone possesses their own writing style, which may be used to identify authors even if they wish to remain anonymous: linguists employ stylometry to settle disputes over the authorship of historic texts as well as more recent cases, and are called to verify the authors of suicide notes or threatening letters. Computer linguists carry out research on software for forensic text analyses, and a recent study shows many of these approaches to be reproducible. Now, a competition has been announced to develop obfuscation software to hide an author's style with the task: "Given a document, paraphrase it so that its writing style does not match that of its original author, anymore." We'll see what comes out of that. Meanwhile, the question remains: Who will win in the long run? Forensic linguists, or obfuscation technology?
Want to obfuscate text? Just run it through a language or 5, then back to the original language using something like google translate. No paraphrasing needed.
Silence is a state of mime.
because of reasons which are not obvious and which I will not reveal although you already know them.
I read Trump as a noun and thought the title was nonsense.
Support my political activism on Patreon.
I doubt this is possible to do very well. Consider [1], where they were able to identify authors from compiled code. Not with close to 100% accuracy, but it's still surprising that your source code style is identifiable with optimization enabled and symbols stripped out.
[1] ftp://ftp.cs.wisc.edu/paradyn/...
Slashdot - News for Nerds, Stuff that Matters, in ISO-8859-1 Has just realised that beta makes this signature redundant
Back in the 1970's Stephen King wrote some novels under the pseudonym Richard Bachman. It worked for a while, but people were able to figure out that Bachman wrote in the same style as the famous Stephen King. Eventually the secret broke.
I wonder if those novels written under the pseudonym would make a good test of the system. Run them through the process, give the results to newer readers of King's known works, and see if they notice the similarities others did in the past.
If you think I voted for Trump because of this post, you're wrong. I voted for Dr. Jill Stein of the Green Party. Again.
If someone is serious about obfuscating their writing, they will be able to. Especially once they get access to the software that would be used to examine it.
However, most people are not going to even bother attempting to obfuscate.
Don't waste your vote! Vote for whoever you want, unless you live in a swing state it won't matter anyways
To quantify the degree of obfuscation, they have precise computational metrics based on their stylometric algorithms. But to judge the quality of the obfuscation, there is no objective metrics. Instead
To measure soundness and properness, obfuscations will be sampled and handed out to participants for peer-review.
which seems to me to make the contest rather less meaningful. Why not just peer review the quality of all obfuscations exceeding some minimum standard?
As a trained linguist, though not an expert on forensic linguistics, I believe that successful automated obfuscation will win and be essentially unbeatable, but probably also detectable. By rewriting a text automatically, valuable information is destroyed that a forensic linguist has to reply upon. (When humans try to obfuscate text, on the other hand, they tend to add such information, potentially even making the task easier for the forensic linguist. For example, black mailers commonly imitate foreign accents in phone calls, which are easy to detect and allow even more conclusions about the person than without this attempt to deceive.)
I'm skeptical about the feasibility of the software, though. Rewriting a text automatically while keeping it readable and stylistically acceptable seems almost as hard as automated translation. Anyway, depending on how the software works, it will very likely be detectable by the same methods as are already used for authorship detection.
You are looking for a tool that extracts the meaning from a text then re-writes it in a standardized, canonical format, or at least "washes" it into one of a list of possible formats such that if you take a bunch of random input from a bunch of different authors, you can't tell from the output who wrote what.
I expect this will be successful within 10 years if we work hard on it.
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
Of course it is, that's what obfuscation does.
All my liberal friends think I'm a conservative, all my conservative friends think I'm a liberal.
This strikes me as an extremely difficult task, assuming the tolerance for losing meaning is low. Maybe IBM Watson work applies.
-Dave
The sooner Trump is obfuscated, the better!
Sent from my ASR33 using ASCII
English has numerous words for the same thing. Try to say a guy is cute, handsome, beautiful, or hot in Portuguese and it all translates to "Bonito".
On the one hand you take life too seriously, and on the other, you do not take playful existence seriously enough. Seth
This has nothing to do with Trump.
We should all just move to newspeak to eliminate the detection / obfuscation arms race entirely.
This is very true. I can identify every single post by apk, even if he posts Anonymously. I must be a genius.
The TFA assumes that stylometry gives somewhat reliable results. It doesn't. Something as simple as an editor cleaning up a work can throw off the analysis.
Even in the optimal scenario (an unedited work by a single author who isn't trying to hide or imitate a different style), the best algorithms have abysmally high failure rates.
(KNN)â"50 neighbors: 0.69 success, 0.28 fail
Decision Tree 0.58 success, 0.42 fail
Mean Margins Tree 0.65 success, 0.36 fail
Stylometry is reasonably effective at correctly identifying when two works by the same author have the same style. It is garbage when it comes to determining when two works have different authors. If I were to guess, I'd say the problem is that the variation in style between authors (compared to the variation within a single author's work) is not always wide enough to allow for reliable identification.
Stylometry is interesting, certainly, but the prospect of such an unreliable method being used for important is alarming.
Procrastination Man strikes again!
All the obfuscation software has to do is change things so it casts enough doubt. I assume the stylometry analysis doesn't return a 1 or 0, it probably returns a probability. Once the probability is below a certain threshold, the job is done. An example of obfuscating: How about a simple machine translation to another language?
That was the turning point of my life--I went from negative zero to positive zero.
Maybe a computer could make sense of Palin's word salad.
I don't read your sig. Why are you reading mine?
Clearly the English language has deteriorated into a hybrid of hillbilly, valleygirl, inner-city slang and various grunts.
It is for this reason I've started a style guide to clear English. This guide includes communicative, informative, and persuasive styles, with a subsection on expletives for persuasive writing and speaking.
Essentially, it's just Strunk and White, Dale Carnegie, and a few other pieces of broad research brought together. Informative style will provide the greatest difficulty, as I'll need to cobble it together from experience and abstract concepts, rather than other research. For example: SQ3R and its derivatives describe methods of study of informative texts (textbooks, essays, articles, etc.), and various books and papers on human memory have cited questioning and organization as ways to improve memorization; many writers incorporate these observations by asking and then answering questions--similar to the rhetorical question.
My target audience encompasses copywriters of books, pamphlets, blogs, and news sites. The book *does* target general consumption, but I particularly want an improvement in mass media. We've reached an era where every person constantly faces the words of an educated man; yet the educated man now talks as the common man, instead of speaking in a way which the common man can easily understand. When the common man's speech deteriorates, the media deteriorates as well.
It is perfectly well for the media to use the language of the common man, but the common man is served best by structuring that language to a higher standard, taking a form best suited to convey information clearly rather than to socialize. The common man is a man of intelligence, even if he is not a man of intellect: he can understand and learn, and he will imitate those behaviors which produce the greatest effect upon him and others. Expose him to clear, concise, vibrant writing and he will begin to speak in clear, concise, vibrant language, even if he is disinclined to study the use of language in such a way.
Support my political activism on Patreon.
Could it do anything for Trump's linguistics?
Brevity is the soul of obfuscation. "Can a program designed to obfuscate author identity defeat a program designed to verify author identity?"
Might as well face it I'm addicted to data.
Supposing it works (not saying it's likely), this would be a big problem for catching plagiarists. Copy somebody's text, run it through this, and then hand it in: boom, you're done. You could certainly have anti-plagiarism software that runs this in reverse (or you take your database of comparison docs and run them all through the obfuscator, something along those lines) but if they do it right and there's some degree of randomness, it introduces a massive dose of plausible deniability to any plagiarism case even with these efforts.
BTW, any typos, grammatical peculiarities, or other abnormalities with my post are due to my text obfuscation software. Don't blame me!
If the intent is to obfuscate the style, just run it through a few languages and back as someone already suggested. But I'm guessing they want something that doesn't look like word salad.
Yup, right there: proper. They're basically asking for someone to write the perfect Bayesian filter beater.
Nope, no sig
You touched a key point there, without actually saying it, which is that the ability of forensic linguistics to recognize a person is inversely proportional to the number of people who could have written the content.
For example, let's say that you're a native Russian speaker, and that your English grammar has certain linguistic quirks that are typical of Russian speakers writing English, e.g. missing all the definite and indefinite articles ("We read book, da?"). If exactly one Russian has access to some piece of information that is contained in the piece of writing, you're screwed. If there are a hundred Russians with access, those particular linguistic quirks no longer provide much help at identifying the author.
One possible takeaway is that the best way to leak something is to anonymously post evidence somewhere without comment, then separately anonymously report that you noticed it, and bring it to someone's attention. This potentially vastly broadens the pool of people with access to the information, and thus makes your linguistic quirks less meaningful. However, this requires a significant time delay between the two posts. Otherwise, one would still strongly suspect that the original poster made the "discovery". But if you can stand to wait a year or two, you're golden.
Check out my sci-fi/humor trilogy at PatriotsBooks.
Alternatively, delete all the definite and indefinite articles. Then they'll blame your one Russian coworker.
Check out my sci-fi/humor trilogy at PatriotsBooks.
German also puts the verb(s) at the end of the sentence. Translate your work into proper German, have a computer make a literal translation back to English and you'll get much the same thing as Yoda-speak.
Good, inexpensive web hosting
Enough with all the Trump articles jeez!
For one example, see
"Obfuscating Document Stylometry to Preserve Author Anonymity"
Gary Kacmarcik & Michael Gamon
This technique is not an automated one, but hey, all you need is more software.
Twain.
John McAfee 'It was like that time I hired that Bangkok prostitute; to do my taxes, while I fucked my accountant'
I hope part of the competition is to retain meaning and have correct grammar. Because if not, you might as well just do content spinning and declare it done.
The book *does* target general consumption
I look forward to devouring it!
I, for one, also saw "Trump" in the title.
That makes two.
<blink>down the rabbit hole</blink>