Can Author Obfuscation Trump Forensic Linguistics? (webis.de)
An anonymous reader writes: Everyone possesses their own writing style, which may be used to identify authors even if they wish to remain anonymous: linguists employ stylometry to settle disputes over the authorship of historic texts as well as more recent cases, and are called to verify the authors of suicide notes or threatening letters. Computer linguists carry out research on software for forensic text analyses, and a recent study shows many of these approaches to be reproducible. Now, a competition has been announced to develop obfuscation software to hide an author's style with the task: "Given a document, paraphrase it so that its writing style does not match that of its original author, anymore." We'll see what comes out of that. Meanwhile, the question remains: Who will win in the long run? Forensic linguists, or obfuscation technology?
Want to obfuscate text? Just run it through a language or 5, then back to the original language using something like google translate. No paraphrasing needed.
Silence is a state of mime.
And thought for a moment it will be a different kind of an article.
because of reasons which are not obvious and which I will not reveal although you already know them.
A thousand monkeys on a thousand typewriters.
It was the best of times, it was the blurst of times...
Stupid monkey!
OK, a few bugs to work out.
Also known as the perfect plagiarism defense.
First thing to come to mind: the Hemingway Editor, which helps "improve" your wording.
This is basically what I did to Wikipedia articles in high school to get past plagiarism filters and google searches while writing papers.
I doubt this is possible to do very well. Consider [1], where they were able to identify authors from compiled code. Not with close to 100% accuracy, but it's still surprising that your source code style is identifiable with optimization enabled and symbols stripped out.
[1] ftp://ftp.cs.wisc.edu/paradyn/...
Slashdot - News for Nerds, Stuff that Matters, in ISO-8859-1 Has just realised that beta makes this signature redundant
is belonging to us!
Back in the 1970's Stephen King wrote some novels under the pseudonym Richard Bachman. It worked for a while, but people were able to figure out that Bachman wrote in the same style as the famous Stephen King. Eventually the secret broke.
I wonder if those novels written under the pseudonym would make a good test of the system. Run them through the process, give the results to newer readers of King's known works, and see if they notice the similarities others did in the past.
If you think I voted for Trump because of this post, you're wrong. I voted for Dr. Jill Stein of the Green Party. Again.
Verbs, you must identify.
At end of sentence, they will be placed.
Yoda, the computer will suspect.
If someone is serious about obfuscating their writing, they will be able to. Especially once they get access to the software that would be used to examine it.
However, most people are not going to even bother attempting to obfuscate.
Don't waste your vote! Vote for whoever you want, unless you live in a swing state it won't matter anyways
To quantify the degree of obfuscation, they have precise computational metrics based on their stylometric algorithms. But to judge the quality of the obfuscation, there is no objective metrics. Instead
To measure soundness and properness, obfuscations will be sampled and handed out to participants for peer-review.
which seems to me to make the contest rather less meaningful. Why not just peer review the quality of all obfuscations exceeding some minimum standard?
There's no way this guy can obfuscate anything he says. Unless, did he endorse himself rather than Palin doing it? hmmmmm
As a trained linguist, though not an expert on forensic linguistics, I believe that successful automated obfuscation will win and be essentially unbeatable, but probably also detectable. By rewriting a text automatically, valuable information is destroyed that a forensic linguist has to reply upon. (When humans try to obfuscate text, on the other hand, they tend to add such information, potentially even making the task easier for the forensic linguist. For example, black mailers commonly imitate foreign accents in phone calls, which are easy to detect and allow even more conclusions about the person than without this attempt to deceive.)
I'm skeptical about the feasibility of the software, though. Rewriting a text automatically while keeping it readable and stylistically acceptable seems almost as hard as automated translation. Anyway, depending on how the software works, it will very likely be detectable by the same methods as are already used for authorship detection.
You are looking for a tool that extracts the meaning from a text then re-writes it in a standardized, canonical format, or at least "washes" it into one of a list of possible formats such that if you take a bunch of random input from a bunch of different authors, you can't tell from the output who wrote what.
I expect this will be successful within 10 years if we work hard on it.
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
This strikes me as an extremely difficult task, assuming the tolerance for losing meaning is low. Maybe IBM Watson work applies.
-Dave
English has numerous words for the same thing. Try to say a guy is cute, handsome, beautiful, or hot in Portuguese and it all translates to "Bonito".
On the one hand you take life too seriously, and on the other, you do not take playful existence seriously enough. Seth
Have you ever seen a TV show where everyone talks the same?
The main characters are high school kids, and talk like high school kids, but so do their parents, the badguys, the police, the foreign exchange students.
Some TV show was on the last time I sent to the gym. I didn't recognize the show, but _everyone_ talked like a fortune cookie.
It drives me crazy. I can't suspend my disbelief, and it just kills the show for me. It's like the actors are just puppets, and I can hear the writer pulling the strings.
I think that many (even successful) writers are phenomenally bad at using so much as two distinct voices.
If I ever write a screenplay or book, I'm going to get a different friend to paraphrase the main characters that should logically sound different.
This has nothing to do with Trump.
We should all just move to newspeak to eliminate the detection / obfuscation arms race entirely.
This is very true. I can identify every single post by apk, even if he posts Anonymously. I must be a genius.
The TFA assumes that stylometry gives somewhat reliable results. It doesn't. Something as simple as an editor cleaning up a work can throw off the analysis.
Even in the optimal scenario (an unedited work by a single author who isn't trying to hide or imitate a different style), the best algorithms have abysmally high failure rates.
(KNN)â"50 neighbors: 0.69 success, 0.28 fail
Decision Tree 0.58 success, 0.42 fail
Mean Margins Tree 0.65 success, 0.36 fail
Stylometry is reasonably effective at correctly identifying when two works by the same author have the same style. It is garbage when it comes to determining when two works have different authors. If I were to guess, I'd say the problem is that the variation in style between authors (compared to the variation within a single author's work) is not always wide enough to allow for reliable identification.
Stylometry is interesting, certainly, but the prospect of such an unreliable method being used for important is alarming.
Procrastination Man strikes again!
It would be a lot more interesting if they could do the reverse: create software to give an individual's writings the style of an arbitrary target. Misdirection instead of obfuscation.
The most effective obfuscation tool is Powerpoint. Even the most stunning and original ideas can be easily reduced to an unremarkable set of slides.
All the obfuscation software has to do is change things so it casts enough doubt. I assume the stylometry analysis doesn't return a 1 or 0, it probably returns a probability. Once the probability is below a certain threshold, the job is done. An example of obfuscating: How about a simple machine translation to another language?
That was the turning point of my life--I went from negative zero to positive zero.
Take les liaisons dangereuses of Chaderlos de Laclos. It is an exemple of poly semantics based on changed register of the "writers". But with a unique author. "Fake polyphonie" (the purpose of the contest is to not be detected doing so)
At the other end of the spectrum "the 1001 tales" tries to unifie with a single tone stories that clearly do not match the same origins. (another failure)
I do guess that collective work with misdirections of fake semantics changes might do the trick with a peculiarly engineered language for that task: french
If you cannot fuzzy the tools, fuzzy the input by using carefuly crafted designed language to make this detection hard.
http://beauty-of-imagination.b...
The goal doesn't need to be obfuscate it to the point where it's completely distinct from the original author's style. The goal should be to obfuscate the style sufficiently that it electronically matches a sufficiently large pool of alternate authors at least as well as it matches the original author. Your goal shouldn't be described in terms of how far you get from the original author's style, but rather in terms of how many other people could have plausibly written the document.
For example, if I'm trying to leak a document I wrote at a state environmental protection agency to the press, what I really care about is sufficient obfuscation that it's hard to tell which employee of the agency wrote the document. I don't care if the obfuscation is "better than" that. In fact, it's probably WORSE to over obfuscate, because any machine obfuscation has the potential to subtly change meaning. An obfuscation method that's so powerful that you can obfuscate the difference between me and someone in a remedial high school english class is not necessarily better at achieving my goal.
Could it do anything for Trump's linguistics?
Brevity is the soul of obfuscation. "Can a program designed to obfuscate author identity defeat a program designed to verify author identity?"
Might as well face it I'm addicted to data.
Supposing it works (not saying it's likely), this would be a big problem for catching plagiarists. Copy somebody's text, run it through this, and then hand it in: boom, you're done. You could certainly have anti-plagiarism software that runs this in reverse (or you take your database of comparison docs and run them all through the obfuscator, something along those lines) but if they do it right and there's some degree of randomness, it introduces a massive dose of plausible deniability to any plagiarism case even with these efforts.
BTW, any typos, grammatical peculiarities, or other abnormalities with my post are due to my text obfuscation software. Don't blame me!
If the intent is to obfuscate the style, just run it through a few languages and back as someone already suggested. But I'm guessing they want something that doesn't look like word salad.
Yup, right there: proper. They're basically asking for someone to write the perfect Bayesian filter beater.
Nope, no sig
"Trumps" ... was the first word that caught my eye. What the Donald?! has he gone and done now?
Then I laughed. As for obfuscation, I have this completely secure home-brew algorithm that I came up with in my 1st year at college and have used ever since...
Joe_Dragon definatley does maybe that because no body else would wan't it.
Eferyone possesses zeir own v-r-ritingkt schtyle, vhich may be used to identify auzzors efen if zey visch to r-r-remain anonymous: lingktuists employ schtylometry to settle disputes ofer ze auzzorschip uff historic texts as vell as more r-r-recent kases, undt are kalled to ferify ze auzzors uff suicide nichtes or zreateningkt letters. Komputer lingktuists karry out r-r-research on software fur forensic text analyses, undt a r-r-recent schtudy schows many uff zese approaches to be r-r-reproducible. Now, a kompetition has been announced to defelop obfuscation software to hide an auzzor's schtyle mitt ze task: "Gifen a document, paraphrase it so zat its v-r-ritingkt schtyle does nicht match zat uff its original auzzor, anymore." Ve'll see vhat komes out uff zat. Meanvile, ze qfestion r-r-remains: Vho vill vin in ze long r-r-run? Forensic lingktuists, or obfuscation technology?
Enough with all the Trump articles jeez!
For one example, see
"Obfuscating Document Stylometry to Preserve Author Anonymity"
Gary Kacmarcik & Michael Gamon
This technique is not an automated one, but hey, all you need is more software.
Yes he can! The was read by a dyslexic first glance.
I hope part of the competition is to retain meaning and have correct grammar. Because if not, you might as well just do content spinning and declare it done.