Can Author Obfuscation Trump Forensic Linguistics? (webis.de)
An anonymous reader writes: Everyone possesses their own writing style, which may be used to identify authors even if they wish to remain anonymous: linguists employ stylometry to settle disputes over the authorship of historic texts as well as more recent cases, and are called to verify the authors of suicide notes or threatening letters. Computer linguists carry out research on software for forensic text analyses, and a recent study shows many of these approaches to be reproducible. Now, a competition has been announced to develop obfuscation software to hide an author's style with the task: "Given a document, paraphrase it so that its writing style does not match that of its original author, anymore." We'll see what comes out of that. Meanwhile, the question remains: Who will win in the long run? Forensic linguists, or obfuscation technology?
Want to obfuscate text? Just run it through a language or 5, then back to the original language using something like google translate. No paraphrasing needed.
Silence is a state of mime.
I read Trump as a noun and thought the title was nonsense.
Support my political activism on Patreon.
Back in the 1970's Stephen King wrote some novels under the pseudonym Richard Bachman. It worked for a while, but people were able to figure out that Bachman wrote in the same style as the famous Stephen King. Eventually the secret broke.
I wonder if those novels written under the pseudonym would make a good test of the system. Run them through the process, give the results to newer readers of King's known works, and see if they notice the similarities others did in the past.
If you think I voted for Trump because of this post, you're wrong. I voted for Dr. Jill Stein of the Green Party. Again.
Well in this case the reason is fairly obvious. Since the question asked about the long run, it is safe to assume machines which can comprehend natural language will be used to obfuscate text in the long run. Once that happens, I would assume obfuscation will easily win. It could not only win, but it could almost certainly be able to produce false positives.
-- All that is necessary for the triumph of evil is that good men do nothing. -- Edmund Burke
Of course it is, that's what obfuscation does.
All my liberal friends think I'm a conservative, all my conservative friends think I'm a liberal.
This has nothing to do with Trump.
The TFA assumes that stylometry gives somewhat reliable results. It doesn't. Something as simple as an editor cleaning up a work can throw off the analysis.
Even in the optimal scenario (an unedited work by a single author who isn't trying to hide or imitate a different style), the best algorithms have abysmally high failure rates.
(KNN)â"50 neighbors: 0.69 success, 0.28 fail
Decision Tree 0.58 success, 0.42 fail
Mean Margins Tree 0.65 success, 0.36 fail
Stylometry is reasonably effective at correctly identifying when two works by the same author have the same style. It is garbage when it comes to determining when two works have different authors. If I were to guess, I'd say the problem is that the variation in style between authors (compared to the variation within a single author's work) is not always wide enough to allow for reliable identification.
Stylometry is interesting, certainly, but the prospect of such an unreliable method being used for important is alarming.
Procrastination Man strikes again!
You touched a key point there, without actually saying it, which is that the ability of forensic linguistics to recognize a person is inversely proportional to the number of people who could have written the content.
For example, let's say that you're a native Russian speaker, and that your English grammar has certain linguistic quirks that are typical of Russian speakers writing English, e.g. missing all the definite and indefinite articles ("We read book, da?"). If exactly one Russian has access to some piece of information that is contained in the piece of writing, you're screwed. If there are a hundred Russians with access, those particular linguistic quirks no longer provide much help at identifying the author.
One possible takeaway is that the best way to leak something is to anonymously post evidence somewhere without comment, then separately anonymously report that you noticed it, and bring it to someone's attention. This potentially vastly broadens the pool of people with access to the information, and thus makes your linguistic quirks less meaningful. However, this requires a significant time delay between the two posts. Otherwise, one would still strongly suspect that the original poster made the "discovery". But if you can stand to wait a year or two, you're golden.
Check out my sci-fi/humor trilogy at PatriotsBooks.
Alternatively, delete all the definite and indefinite articles. Then they'll blame your one Russian coworker.
Check out my sci-fi/humor trilogy at PatriotsBooks.