Slashdot Mirror


Can Author Obfuscation Trump Forensic Linguistics? (webis.de)

An anonymous reader writes: Everyone possesses their own writing style, which may be used to identify authors even if they wish to remain anonymous: linguists employ stylometry to settle disputes over the authorship of historic texts as well as more recent cases, and are called to verify the authors of suicide notes or threatening letters. Computer linguists carry out research on software for forensic text analyses, and a recent study shows many of these approaches to be reproducible. Now, a competition has been announced to develop obfuscation software to hide an author's style with the task: "Given a document, paraphrase it so that its writing style does not match that of its original author, anymore." We'll see what comes out of that. Meanwhile, the question remains: Who will win in the long run? Forensic linguists, or obfuscation technology?

51 of 84 comments (clear)

  1. Ummm by wbr1 · · Score: 5, Funny

    Want to obfuscate text? Just run it through a language or 5, then back to the original language using something like google translate. No paraphrasing needed.

    --
    Silence is a state of mime.
    1. Re:Ummm by Anonymous Coward · · Score: 1

      I do something similar to encrypt my e-mails, I run it through ROT-13 6 times. It's foolproof.

    2. Re:Ummm by Anonymous Coward · · Score: 1

      Well, with the current quality of machine translation you'll lose a lot of content too.

    3. Re:Ummm by s.petry · · Score: 2

      I can one up you, because I use ROT6 13 times. Way more better!

      --

      -The wise argue that there are few absolutes, the fool argues that there are no probabilities.

    4. Re:Ummm by Golddess · · Score: 1

      Which is why you proof read it after the final pass, and make adjustments then. Yes, it might be possible that your adjustments can still be enough to identify you, but it seems much less likely to me.

      --
      "I'm not sure I like the fugnutish tone you used in your post!" -RogL (608926)-
    5. Re:Ummm by ohieaux · · Score: 2, Interesting

      English
      Want to obfuscate text? Just run it through a language or 5, then back to the original language using something like google translate. No paraphrasing needed.
      Afrikaans
      Wil teks verduisteren ? Net hardloop dit deur 'n taal of 5 , dan terug na die oorspronklike taal gebruik van iets soos Google vertaal. Geen parafrasering nodig .
      Albanian
      Dëshironi tekstin errët ? Vetëm të drejtuar atë nëpërmjet një gjuhe ose 5 , pastaj kthehet për të përdorur gjuhën origjinale e diçka si Google Translate . Nuk ka parafrazuar nevojshme .
      Arabic
      5 . .
      Armenian
      , . Just 5, Google . .
      English
      You want the text in the dark. Just run it through the language or 5 , then return to the original source using something like Google language translation . A quote is necessary .
      Note: international characters may not show in comment.

      --
      Where all think alike, no one thinks very much.
    6. Re:Ummm by drew_kime · · Score: 1

      That's pretty close, actually. Hmm ... are there languages with syntax sufficiently different from Romance languages to overcome this?

      --
      Nope, no sig
  2. Obfuscation always wins by deathcloset · · Score: 1

    because of reasons which are not obvious and which I will not reveal although you already know them.

    1. Re:Obfuscation always wins by ranton · · Score: 2

      Well in this case the reason is fairly obvious. Since the question asked about the long run, it is safe to assume machines which can comprehend natural language will be used to obfuscate text in the long run. Once that happens, I would assume obfuscation will easily win. It could not only win, but it could almost certainly be able to produce false positives.

      --
      -- All that is necessary for the triumph of evil is that good men do nothing. -- Edmund Burke
  3. Re:I saw "Trump" in the title by bluefoxlucid · · Score: 2

    I read Trump as a noun and thought the title was nonsense.

  4. Unlikely by DeathToBill · · Score: 1

    I doubt this is possible to do very well. Consider [1], where they were able to identify authors from compiled code. Not with close to 100% accuracy, but it's still surprising that your source code style is identifiable with optimization enabled and symbols stripped out.

    [1] ftp://ftp.cs.wisc.edu/paradyn/...

    --
    Slashdot - News for Nerds, Stuff that Matters, in ISO-8859-1 Has just realised that beta makes this signature redundant
    1. Re:Unlikely by avandesande · · Score: 1

      Someone should write a English compiler.

      --
      love is just extroverted narcissism
    2. Re:Unlikely by gstoddart · · Score: 1

      They'd fail utterly.

      Remember that Star Trek episode where the robots kept saying "Norma, coordinate" up until Kirk and Spock made his brain explode? Picture that.

      English is far too malleable and imprecise.

      --
      Lost at C:>. Found at C.
    3. Re:Unlikely by thoromyr · · Score: 1

      they succeeded with nothing like 100% using a small sample set which has the side effect of avoiding confusion.

      Put another way: face recognition seems promising with similar accuracy rate when limited to a small set of faces. But once you open the flood gates the accuracy goes way down.

      Proponents fall back on the "it works as a pre-filter" which, depending on the size of the population you are working with, might have sufficient true positive with a low enough false positive to make it workable. But it is also a far cry from the claims of identification.

    4. Re:Unlikely by worf_mo · · Score: 1

      Someone should write a English compiler.

      Your message wouldn't pass without a warning.

  5. Stephen King is not dead. by I'm+New+Around+Here · · Score: 4, Interesting

    Back in the 1970's Stephen King wrote some novels under the pseudonym Richard Bachman. It worked for a while, but people were able to figure out that Bachman wrote in the same style as the famous Stephen King. Eventually the secret broke.

    I wonder if those novels written under the pseudonym would make a good test of the system. Run them through the process, give the results to newer readers of King's known works, and see if they notice the similarities others did in the past.

    --
    If you think I voted for Trump because of this post, you're wrong. I voted for Dr. Jill Stein of the Green Party. Again.
    1. Re:Stephen King is not dead. by thoromyr · · Score: 1

      and then there are authors who have a diverse writing style. Try author identifying software | reader identification of anonymized works on a corpus including the work of Walter Jon Williams -- and I doubt that he is the only author to vary style.

    2. Re:Stephen King is not dead. by david_thornley · · Score: 1

      Heck, separate Lord of the Rings into narrative and dialog and compare those. Tolkien used different styles there. The time I remember that he tried using the dialog-type language in narrative and description, at the first formal dinner Frodo attends in Rivendell, it sounded ridiculous.

      --
      "When you have eliminated the unacceptable, whatever is left, however improbable, must be the truthiness" - Holmes
  6. Only for noobs by penguinoid · · Score: 1

    If someone is serious about obfuscating their writing, they will be able to. Especially once they get access to the software that would be used to examine it.

    However, most people are not going to even bother attempting to obfuscate.

    --
    Don't waste your vote! Vote for whoever you want, unless you live in a swing state it won't matter anyways
    1. Re:Only for noobs by EdwardFurlong · · Score: 1
      This seems pretty true, if I was writing something that I would not want traced back to me I would not trust some program anyway.

      Maybe if I was super paranoid about the NSA or Google somehow linking my random internet comments all to me, then a program might have some use.

      It would be interesting to see if the program could go through /. AC postings and see if they can match them up to a user.

    2. Re:Only for noobs by techno-vampire · · Score: 1

      I've actually done some obfuscation of my own communications. Years ago, I worked for a tech company where most of my co-workers were about half my age at best, and their word usage, grammar and syntax often made them look like high school dropouts, especially when compared to my writing. (No, I'm not bragging; it's just that unlike them, I cared about such things and tried harder than they did to get it right.)

      One of the ways we had for giving feedback was an internal website where we could "ask the suits." Officially, the questions were anonymous, but my writing style was distinctive enough to be a giveaway if the responder was familiar with me. To avoid that, I did my best to mimic the style, word-choice and syntax of the other techs, including one or two judicious spelling errors so that my questions looked about the same as anybody else's. I've no idea, of course, if this would fool a determined attempt to identify me, but I'm fairly sure that my identity wasn't obvious, and that's all that I needed.

      --
      Good, inexpensive web hosting
  7. Lacking objective quality metric by GlobalEcho · · Score: 1

    To quantify the degree of obfuscation, they have precise computational metrics based on their stylometric algorithms. But to judge the quality of the obfuscation, there is no objective metrics. Instead

    To measure soundness and properness, obfuscations will be sampled and handed out to participants for peer-review.

    which seems to me to make the contest rather less meaningful. Why not just peer review the quality of all obfuscations exceeding some minimum standard?

  8. Obfuscation will win...if it works by Anonymous Coward · · Score: 1

    As a trained linguist, though not an expert on forensic linguistics, I believe that successful automated obfuscation will win and be essentially unbeatable, but probably also detectable. By rewriting a text automatically, valuable information is destroyed that a forensic linguist has to reply upon. (When humans try to obfuscate text, on the other hand, they tend to add such information, potentially even making the task easier for the forensic linguist. For example, black mailers commonly imitate foreign accents in phone calls, which are easy to detect and allow even more conclusions about the person than without this attempt to deceive.)

    I'm skeptical about the feasibility of the software, though. Rewriting a text automatically while keeping it readable and stylistically acceptable seems almost as hard as automated translation. Anyway, depending on how the software works, it will very likely be detectable by the same methods as are already used for authorship detection.

  9. Basically you are looking for a translator by davidwr · · Score: 1

    You are looking for a tool that extracts the meaning from a text then re-writes it in a standardized, canonical format, or at least "washes" it into one of a list of possible formats such that if you take a bunch of random input from a bunch of different authors, you can't tell from the output who wrote what.

    I expect this will be successful within 10 years if we work hard on it.

    --
    Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
  10. Re:I saw "Trump" in the title by bondsbw · · Score: 2, Funny

    Of course it is, that's what obfuscation does.

    --
    All my liberal friends think I'm a conservative, all my conservative friends think I'm a liberal.
  11. hard by bigdavex · · Score: 1

    This strikes me as an extremely difficult task, assuming the tolerance for losing meaning is low. Maybe IBM Watson work applies.

    --
    -Dave
    1. Re:hard by umghhh · · Score: 1

      That may be true but most of what is written anywhere in this world is meaningless drivel so the problem of losing meaning does not exist really. The need to obfuscate neither I admit.
      This leaves us with people that actually have something to say. I reckon there would be a tiny minority among them, that would want to have such service but that also means its production would most likely be economically unfeasible.
      Then there are trolls and 50c soldiers which could use the service of course but I guess it is easier to automate them than to obfuscate their activity.

  12. Re:I saw "Trump" in the title by Anne+Thwacks · · Score: 1

    The sooner Trump is obfuscated, the better!

    --
    Sent from my ASR33 using ASCII
  13. Depends by spaceman375 · · Score: 1
    This will depend heavily on which language the original and end documents are in. Or: Success relies strongly on source and target vernacular.

    English has numerous words for the same thing. Try to say a guy is cute, handsome, beautiful, or hot in Portuguese and it all translates to "Bonito".

    --
    On the one hand you take life too seriously, and on the other, you do not take playful existence seriously enough. Seth
  14. Click bait title by Anonymous Coward · · Score: 4, Funny

    This has nothing to do with Trump.

  15. Newspeak by techsoldaten · · Score: 1

    We should all just move to newspeak to eliminate the detection / obfuscation arms race entirely.

  16. apk by 110010001000 · · Score: 1

    This is very true. I can identify every single post by apk, even if he posts Anonymously. I must be a genius.

  17. Polygraph 2.0 by jheath314 · · Score: 2

    The TFA assumes that stylometry gives somewhat reliable results. It doesn't. Something as simple as an editor cleaning up a work can throw off the analysis.

    Even in the optimal scenario (an unedited work by a single author who isn't trying to hide or imitate a different style), the best algorithms have abysmally high failure rates.

    (KNN)â"50 neighbors: 0.69 success, 0.28 fail
    Decision Tree 0.58 success, 0.42 fail
    Mean Margins Tree 0.65 success, 0.36 fail

    Stylometry is reasonably effective at correctly identifying when two works by the same author have the same style. It is garbage when it comes to determining when two works have different authors. If I were to guess, I'd say the problem is that the variation in style between authors (compared to the variation within a single author's work) is not always wide enough to allow for reliable identification.

    Stylometry is interesting, certainly, but the prospect of such an unreliable method being used for important is alarming.

    --
    Procrastination Man strikes again!
    1. Re:Polygraph 2.0 by thoromyr · · Score: 1

      Indeed. I've been reading H. Beam Piper's "Fuzzy" stories to my kids and it is quite amusing to have the "veridicator" play such a prominent role as an infallible method of separating truth from lies (although the narration admits the possibility of unintentional deception wherein someone truly believes what they are saying it emphatically rejects the possibility of deliberate deception).

      To the topic at hand, it is certainly interesting and even useful when applied intelligently. For example, it is well established that single "books" (e.g., in the Bible) have multiple authors. Some of this is trivial (anyone reading the Noah story as related in the Bible should be able to immediately tell that there is a minimum of two separate traditions being merged), but serious textual analysis is good for making finer discrimination -- with the caveat that it does not provide absolute answers.

  18. Seems simple to me by Billy+the+Mountain · · Score: 1

    All the obfuscation software has to do is change things so it casts enough doubt. I assume the stylometry analysis doesn't return a 1 or 0, it probably returns a probability. Once the probability is below a certain threshold, the job is done. An example of obfuscating: How about a simple machine translation to another language?

    --
    That was the turning point of my life--I went from negative zero to positive zero.
    1. Re:Seems simple to me by Hognoxious · · Score: 1

      An example of obfuscating: How about a simple machine translation to another language?

      That would certainly obfuscate it for people who didn't speak the other language.

      Were you suggesting a round trip? Things may have moved on, but I remember playing with this some years back and the results were changed way beyond style.

      --
      Confucius say, "Find worm in apple - bad. Find half a worm - worse."
  19. Re:I saw "Trump" in the title by mspohr · · Score: 1

    Maybe a computer could make sense of Palin's word salad.

    --
    I don't read your sig. Why are you reading mine?
  20. Re:I saw "Trump" in the title by balbeir · · Score: 1

    Clearly the English language has deteriorated into a hybrid of hillbilly, valleygirl, inner-city slang and various grunts.

  21. Re:I saw "Trump" in the title by bluefoxlucid · · Score: 1

    It is for this reason I've started a style guide to clear English. This guide includes communicative, informative, and persuasive styles, with a subsection on expletives for persuasive writing and speaking.

    Essentially, it's just Strunk and White, Dale Carnegie, and a few other pieces of broad research brought together. Informative style will provide the greatest difficulty, as I'll need to cobble it together from experience and abstract concepts, rather than other research. For example: SQ3R and its derivatives describe methods of study of informative texts (textbooks, essays, articles, etc.), and various books and papers on human memory have cited questioning and organization as ways to improve memorization; many writers incorporate these observations by asking and then answering questions--similar to the rhetorical question.

    My target audience encompasses copywriters of books, pamphlets, blogs, and news sites. The book *does* target general consumption, but I particularly want an improvement in mass media. We've reached an era where every person constantly faces the words of an educated man; yet the educated man now talks as the common man, instead of speaking in a way which the common man can easily understand. When the common man's speech deteriorates, the media deteriorates as well.

    It is perfectly well for the media to use the language of the common man, but the common man is served best by structuring that language to a higher standard, taking a form best suited to convey information clearly rather than to socialize. The common man is a man of intelligence, even if he is not a man of intellect: he can understand and learn, and he will imitate those behaviors which produce the greatest effect upon him and others. Expose him to clear, concise, vibrant writing and he will begin to speak in clear, concise, vibrant language, even if he is disinclined to study the use of language in such a way.

  22. Yes, but... by Bearhouse · · Score: 1

    Could it do anything for Trump's linguistics?

  23. Re:What does that headline say? by turning+in+circles · · Score: 1

    Brevity is the soul of obfuscation. "Can a program designed to obfuscate author identity defeat a program designed to verify author identity?"

    --
    Might as well face it I'm addicted to data.
  24. Supposing it works... by werepants · · Score: 1

    Supposing it works (not saying it's likely), this would be a big problem for catching plagiarists. Copy somebody's text, run it through this, and then hand it in: boom, you're done. You could certainly have anti-plagiarism software that runs this in reverse (or you take your database of comparison docs and run them all through the obfuscator, something along those lines) but if they do it right and there's some degree of randomness, it introduces a massive dose of plausible deniability to any plagiarism case even with these efforts.

    BTW, any typos, grammatical peculiarities, or other abnormalities with my post are due to my text obfuscation software. Don't blame me!

  25. What spam house is funding this? by drew_kime · · Score: 1

    If the intent is to obfuscate the style, just run it through a few languages and back as someone already suggested. But I'm guessing they want something that doesn't look like word salad.

    We call an obfuscation software

      1. safe, if a forensic analysis does not reveal the original author of its obfuscated texts,
      1. sound, if its obufscated texts are textually entailed with their originals, and
      1. proper, if its obfuscated texts are inconspicuous.

    Yup, right there: proper. They're basically asking for someone to write the perfect Bayesian filter beater.

    --
    Nope, no sig
  26. Re:Wrong goal. by dgatwood · · Score: 2

    You touched a key point there, without actually saying it, which is that the ability of forensic linguistics to recognize a person is inversely proportional to the number of people who could have written the content.

    For example, let's say that you're a native Russian speaker, and that your English grammar has certain linguistic quirks that are typical of Russian speakers writing English, e.g. missing all the definite and indefinite articles ("We read book, da?"). If exactly one Russian has access to some piece of information that is contained in the piece of writing, you're screwed. If there are a hundred Russians with access, those particular linguistic quirks no longer provide much help at identifying the author.

    One possible takeaway is that the best way to leak something is to anonymously post evidence somewhere without comment, then separately anonymously report that you noticed it, and bring it to someone's attention. This potentially vastly broadens the pool of people with access to the information, and thus makes your linguistic quirks less meaningful. However, this requires a significant time delay between the two posts. Otherwise, one would still strongly suspect that the original poster made the "discovery". But if you can stand to wait a year or two, you're golden.

    --

    Check out my sci-fi/humor trilogy at PatriotsBooks.

  27. Re:Wrong goal. by dgatwood · · Score: 2

    Alternatively, delete all the definite and indefinite articles. Then they'll blame your one Russian coworker.

    --

    Check out my sci-fi/humor trilogy at PatriotsBooks.

  28. Re:verb you must identify by techno-vampire · · Score: 1

    German also puts the verb(s) at the end of the sentence. Translate your work into proper German, have a computer make a literal translation back to English and you'll get much the same thing as Yoda-speak.

    --
    Good, inexpensive web hosting
  29. Slashdot is going down the toilet by Soccerguy1832 · · Score: 1

    Enough with all the Trump articles jeez!

  30. Yes, plenty of research out there already by SlideRuleGuy · · Score: 1

    For one example, see

    "Obfuscating Document Stylometry to Preserve Author Anonymity"
    Gary Kacmarcik & Michael Gamon

    This technique is not an automated one, but hey, all you need is more software.

  31. Re:verb you must identify by HornWumpus · · Score: 1

    An average sentence, in a German newspaper, is a sublime and impressive curiosity; it occupies a quarter of a column; it contains all the ten parts of speech -- not in regular order, but mixed; it is built mainly of compound words constructed by the writer on the spot, and not to be found in any dictionary -- six or seven words compacted into one, without joint or seam -- that is, without hyphens; it treats of fourteen or fifteen different subjects, each inclosed in a parenthesis of its own, with here and there extra parentheses which reinclose three or four of the minor parentheses, making pens within pens: finally, all the parentheses and reparentheses are massed together between a couple of king-parentheses, one of which is placed in the first line of the majestic sentence and the other in the middle of the last line of it -- after which comes the VERB, and you find out for the first time what the man has been talking about; and after the verb -- merely by way of ornament, as far as I can make out -- the writer shovels in "haben sind gewesen gehabt haben geworden sein," or words to that effect, and the monument is finished. I suppose that this closing hurrah is in the nature of the flourish to a man's signature -- not necessary, but pretty. German books are easy enough to read when you hold them before the looking-glass or stand on your head -- so as to reverse the construction -- but I think that to learn to read and understand a German newspaper is a thing which must always remain an impossibility to a foreigner.

    Twain.

    --
    John McAfee 'It was like that time I hired that Bangkok prostitute; to do my taxes, while I fucked my accountant'
  32. Isn't This Content Spinning? by Tsu+Dho+Nimh · · Score: 1

    I hope part of the competition is to retain meaning and have correct grammar. Because if not, you might as well just do content spinning and declare it done.

  33. Re:I saw "Trump" in the title by TheRealHocusLocus · · Score: 1

    The book *does* target general consumption

    I look forward to devouring it!
    I, for one, also saw "Trump" in the title.
    That makes two.

    --
    <blink>down the rabbit hole</blink>