Slashdot Mirror


Anonymous No More: Your Coding Style Can Give You Away

itwbennett writes Researchers from Drexel University, the University of Maryland, the University of Goettingen, and Princeton have developed a "code stylometry" that uses natural language processing and machine learning to determine the authors of source code based on coding style. To test how well their code stylometry works, the researchers gathered publicly available data from Google's Code Jam, an annual programming competition that attracts a wide range of programmers, from students to professionals to hobbyists. Looking at data from 250 coders over multiple years, averaging 630 lines of code per author their code stylometry achieved 95% accuracy in identifying the author of anonymous code (PDF). Using a dataset with fewer programmers (30) but more lines of code per person (1,900), the identification accuracy rate reached 97%.

35 of 220 comments (clear)

  1. Can they do it with corporate code? by msobkow · · Score: 5, Interesting

    Can they do it with corporate code where there are naming and style standards in abundance, and code reviews to ensure those guidelines are followed?

    --
    I do not fail; I succeed at finding out what does not work.
    1. Re:Can they do it with corporate code? by Marginal+Coward · · Score: 4, Funny

      It seems like using the applicable features of the corporate version control system would be a lot easier - and possibly even better than 95% accurate.

    2. Re:Can they do it with corporate code? by Penguinisto · · Score: 2

      That's what "git blame" is for...

      /me ducks and runs like hell...

      --
      Quo usque tandem abutere, Nimbus, patientia nostra?
    3. Re:Can they do it with corporate code? by dark.nebulae · · Score: 2

      I've always found that even with style guidelines in place, developers will still leave their fingerprints all over it.

      Some devs will be verbose in their comments, some less. Some devs will embrace IoC where others shun it. Some devs will create a single method with all code in it, some will refactor the heck out of it with many methods. Heck, devs can't even agree sometimes on what should be public, protected, and private (and rarely will style guidelines dictate this kind of thing).

    4. Re:Can they do it with corporate code? by grimmjeeper · · Score: 4, Informative

      You obviously haven't had to work in an environment where code has to be certified. I can tell you from first hand experience that coding in an RTCA DO-178B environment or similar has some pretty strict adherence to some very pedantic and strict coding requirements. You'll find this type of development in avionics systems (both civilian and military) as well as other industries like medical electronics where code safety is literally life-and-death.

      Outside of that type of environment, I do agree with you. You'd be lucky if even half of the developers have seen a company coding standard. You'd be hard pressed to find any developers who really adhere to it even when they know the document exists. But in those small niche markets, you'd be surprised at how strictly they adhere to arbitrary coding standards (whether they really impact code quality or safety or not).

    5. Re:Can they do it with corporate code? by jellomizer · · Score: 4, Interesting

      Perhaps not as well. If people are following the coding standards for the organization then the code for the most part looks far more similar.

      When I am working with a development team, I will tend to adjust my unique style to better match what everyone else is doing. Even if it means doing coding methods that I will normally disagree with.

      If the code tends to use a bunch of Goto's instead of Procedures or classes. I will use those GOTO not for my benefit, but for people who will maintain my code later on, so they won't have to change their mindset and debugging strategies to see what the program is doing to do future corrections.

      I will go full Object Oriented if the group of people that I am working with do their coding full OO.

      My personal style would be more procedural, than OO. Not due to lack of knowledge or not realizing OO advantages and disadvantages. But if I am to code on my own, I code in the way that My Mind handles the requirements, and how I feel would be easier for me to change and fix my code in the future.

      I think this method is best for ID based on personal code, vs group corporate code, where a lot of your particular style is hidden.

      --
      If something is so important that you feel the need to post it on the internet... It probably isn't that important.
    6. Re:Can they do it with corporate code? by war4peace · · Score: 3, Insightful

      *raising hands slowly* Is there a problem, Coding Officer?

      --
      ...gis sdrawkcab (usually not responding to ACs; don't bother posting as AC)
    7. Re: Can they do it with corporate code? by Anonymous Coward · · Score: 2, Funny

      Drats! I was.sure that.everyone else wrote.stuff.like "if(user == 'dumbfuck"){exit 666};

    8. Re:Can they do it with corporate code? by bhcompy · · Score: 2

      Why is it illegal?

    9. Re:Can they do it with corporate code? by Mr+Z · · Score: 2

      Did you read the part in the article where they're actually doing the matching based on the ASTs (abstract syntax trees), and so are able to identify authors even after the code goes through an obfuscator? Relevant quotes:

      Their real innovation, though, was in developing what they call “abstract syntax trees” which are similar to parse tree for sentences, and are derived from language-specific syntax and keywords. These trees capture a syntactic feature set which, the authors wrote, “was created to capture properties of coding style that are completely independent from writing style.” The upshot is that even if variable names, comments or spacing are changed, say in an effort to obfuscate, but the functionality is unaltered, the syntactic feature set won’t change.

      Accuracy rates weren’t statistically different when using an off-the-shelf C++ code obfuscators. Since these tools generally work by refactoring names and removing spaces and comments, the syntactic feature set wasn’t changed so author identification at similar rates was still possible.

      Regarding the first quote: The author of the article probably didn't realize that ASTs aren't a new thing; it's just this application of ASTs that's new. ASTs are as old as the hills. I learned about them from the Dragon Book, and by the time that was written they were old hat.

    10. Re:Can they do it with corporate code? by RabidReindeer · · Score: 2

      A sonnet has strict rules, too.

      But I'd wager that someone could tell one of Shakespeare's from one of yours.

  2. Up next, automatic intelligence rating... by TWX · · Score: 4, Funny

    ...based on the quality of that code...

    --
    Do not look into laser with remaining eye.
    1. Re:Up next, automatic intelligence rating... by halivar · · Score: 4, Funny

      goto blah;
      ^^ Idiot.

      // If you don't know why this is here, don't fuck with it.
      goto blah;

      ^^ Code guru.

    2. Re:Up next, automatic intelligence rating... by lgw · · Score: 4, Insightful

      For lack of mod points let me just say: beautiful!

      It's like this in any engineering discipline:
      * The apprentice doesn't do things by the book, for he thinks himself clever
      * The journeyman does everything by the book, for he has learned the world of pain the book prevents
      * The master goes beyond the book, for he understand why every rule is there and no longer needs the rules

      Or put another way - the apprentice thinks he knows everything, the journeyman known how little he knows, the master knows everything in the field, and still knows how little he knows.

      --
      Socialism: a lie told by totalitarians and believed by fools.
    3. Re:Up next, automatic intelligence rating... by russotto · · Score: 2

      The guru knows the novice knows more than the corporate enterprise architect, but won't let on lest the novice get a more-swelled head.

  3. Re:Lol? by TWX · · Score: 2

    Heh. If it's effective in a clusterfuck of copy/paste, then it should be really effective when the bulk of the code is original...

    Sounds like the solution is to use an entirely different language than the bulk of one's work is in, if one wants to anonymously write malicious or otherwise legally complicated code.

    --
    Do not look into laser with remaining eye.
  4. What about Bitcoin? by Anonymous Coward · · Score: 5, Funny

    Can we use this to find Satoshi?

  5. Re:Next thing you know by Anonymous Coward · · Score: 2, Funny

    Why would they even bother with an algorithm to process your ramblings? Every time I see you post, I instantly think "oh here's this jerk again".

  6. No Kidding by invid · · Score: 4, Insightful

    I can usually tell who wrote the code in the office by whether or not they put a space after their ifs: if(i == 0) vs if (i == 0); where they put their brackets, whether or not they replace their tabs with spaces, how they deal with bools: if (!var) vs if (var == false) and several other telling signs. There are so many combinations of variations no two programmers in the office (about 12 of us) have the same style.

    --
    The Moore-Murphy Law: The number of things that will go wrong will double every 2 years.
    1. Re:No Kidding by ihtoit · · Score: 2

      coding to book (sans comments) will kill the process of identifying authors stone dead, I think. If everybody's "Hello World!" was identical, how do you tell the difference?

      --
      Political debates have me rolling my eyes so much I think I got optical whiplash. I should sue. - Foamy The Squirrel
    2. Re:No Kidding by disambiguated · · Score: 2

      Style guidelines should be about avoiding pitfalls of the language, using appropriate idioms, and not making life miserable for maintainers, not about where you put spaces and braces.

  7. Re:Demonstrates the need... by Anonymous Coward · · Score: 5, Insightful

    This is why people need to follow style guides, so that all source code is styled the same.

    There's a damn good chance 95% of coders are not criminals, nor would they care if someone identified their code.

    That said, this will become a legal nightmare is when this kind of profiling can be used to frame another coder.

    And with the laws wanting to treat any "hacker" as a potential terrorist these days, the consequences of even being accused can be rather severe to deal with.

  8. That explains it by Tablizer · · Score: 2

    I suppose all those "// damn U bill gates!" comments gave me away

  9. Welcome to the party by meerling · · Score: 2

    When I was a kid in the 80s we figured out we could identify who wrote a particular piece of software by looking at it's code. Those individualistic and identifiable features we used in the argument over programming being an art or a science when we wanted to support the art side.

    1. Re:Welcome to the party by Virtucon · · Score: 4, Insightful

      It's all about style. Writing software is very creative and it needs to have the authors fingerprints on it somewhere. If corporations don't like that they can suck the source code into a parser and spit out perfectly mundane crap that loses the intonation and the thoughts the original developer had for it.

      --
      Harrison's Postulate - "For every action there is an equal and opposite criticism"
  10. John Varley Press Enter by Crashmarik · · Score: 3, Informative

    1985 Hugo Winner

    Really, the fact that coding style is recognizable was so well known it made it into pop culture 30 years ago.

    Also, on the smaller sample size the program might just be recognizing the parts of the style that come from the corporate standards. It would be interesting to see if it could recognize code from people who all work at the same company.

  11. Re:Not my Frankencode... by Tablizer · · Score: 3, Funny

    ... a patchwork of open-source freebies.

    So, what's it like to work for FaceBook?

  12. Re:Demonstrates the need... by Impy+the+Impiuos+Imp · · Score: 5, Insightful

    You want scary? The same can be applied to general text on the Internet, tying posters on different sotes together, including anonymous (not your real name avatar) to a site with your real name.

    Which the NSA probably has churning away on its databases. Which probably does little more than add confirmation of said links from watching and recording all traffic to any and all of a billion IP addresses.

    And I, for one, welcome our new panopticon overlords who won't abuse it, not one of their thousand agents, because they're supposed to check a got-a-warrant box on a piece of paper before choosing to abuse it.

    --
    (-1: Post disagrees with my already-settled worldview) is not a valid mod option.
  13. Bad Coders Can't Be Identified by TrollstonButterbeans · · Score: 3, Interesting

    If your coding is terrible and very newbie like, they can't single you out since your code is similar to the ocean of other terrible coders.

    So if you are a paranoid freak, the best way to ensure your safety and keep the government off your back is to write terrible code.

    --
    Priest: "Universe from nothing, no laws of physics, sped up time"+ huge discrepancies. Creationism? No. Big Bang Theory
  14. Oblig XKCD by Krazy+Kanuck · · Score: 2

    Not that many of us actually use comments.... http://xkcd.com/1421/

  15. Most programming isn't new code by jgotts · · Score: 3, Insightful

    Most programming isn't writing new code. Most programming is working on someone else's crap you inherited. Invariably, you're going to be using that person's style or else the result will look like garbage.

    There is also the problem that most non-trivial code is worked on by multiple people at the same time.

    Writing some code from scratch as an assignment is a very artificial exercise nowadays, unless you're in a classroom setting. Therefore, you're going to get a signature from a programmer doing atypical work.

  16. What complete and utter bullshit. by MouseTheLuckyDog · · Score: 2

    95% of 250 coders. That means that out of a million programmers they will misidentify 200000.

    I suspect that there are few enough variances in style to make any coders style unique. For example whether to uses braces on a one line statement after an in if in C.

    With a few programmers it's likely to work, but when the possible source of programmers is the world...

    Not to mention emacs, Visual Studio and such enforcing some indentation standards and programming languages enforcing others.

  17. So you could use this tool to make your code anon. by Maxo-Texas · · Score: 4, Interesting

    Write a version of pretty-printer that rerenders your code into a different style.

    Have a lexicon of mipelled words for each "personality".

    Another lexicon of variable names.
    a vs inta vs int_a vs x.

    Refactoring and unfactoring for subroutines.

    Run the comments through google translate and back to english.
    ukrainian
    japanese
    chinese

    Synonym and antonym substitution in the comments.

    The mind dances at the possibilities to mess with this algorithm.

    --
    She was like chocolate when she drank... semi-sweet at first and then increasingly bitter.
  18. Hah. I write everything in Fortran.. by toonces33 · · Score: 2

    and then use F2C to convert it to C code before I check in.. Try analyzing that!

  19. Pointless, but no doubt true by Kittenman · · Score: 2

    Wouldn't any programmer worth their salt identify themselves in the comments, or (if not) be logged as the last guy in that code on such-and-such a date, while working on such-and-such a patch number? (E,.g 'kittenman was here, 1/Jan/15, fixing Steve's crap').

    But I hope my code is easily recognizable. I'm proud of it. It may not be the smartest, slickest, quickest there is, but it's mine. And it works.

    --
    "The greatest lesson in life is to know that even fools are right sometimes" - Winston Churchill