Slashdot Mirror


Coding Styles Survive Binary Compilation, Could Lead Investigators Back To Programmers (princeton.edu)

An anonymous reader writes: Researchers have created an algorithm that can accurately detect code written by different programmers (PDF), even if the code has been compiled into an executable binary. Because of open source coding repositories like GitHub, state agencies can build a database of all developers and their coding styles, and then easily compare the coding style used in "anti-establishment" software to detect the culprit. Despite all the privacy implications this research may have, the algorithm can also be used by security researchers to track down malware authors. We also discussed an earlier phase of this research.

22 of 164 comments (clear)

  1. Frist! by Anonymous Coward · · Score: 2, Insightful

    Going to be lots of false positives on this one.

    1. Re:Frist! by Aighearach · · Score: 2

      lol yeah. The blathering about governments is just somebody getting silly and running their mouth about stupid shit. Newsflash, if a person sounds like a conspiracy theorist? They're probably not a good data source.

      This is great technology for figuring out which one of 5 people wrote a particular method/function. And I have no doubt that governments will use this technology to mislead juries into believing it is like a fingerprint, by using the word "fingerprint" nearby the name of their test in sentences, but they'll only be using it to reinforce whatever evidence they used to find the person to accuse in the first place.

      This would be more likely to have real-world impact in the hands of a large corporation's recruiting department. You don't necessarily want to hire away all your competitor's team, they just the best few people. With this, you might be able to tease out who wrote which parts of their product; especially if also have code samples from the 20% that applied with both companies, or from FLOSS code.

    2. Re:Frist! by ShanghaiBill · · Score: 4, Interesting

      False positives are not a problem if you deal with them rationally. If a woman is murdered, and the DNA matches one in a million, then in a country of 300 million, there will be 300 matches, and 299 false positives. But if only one lives in the same city, and it happens to be her ex-boyfriend, then the DNA match is useful information.

    3. Re:Frist! by tattood · · Score: 2

      Going to be lots of false positives on this one.

      So I can easily avoid this trap by never hosting any code on Github?

      --
      WTB [sig], PST!!!
  2. Privacy implications? by Registered+Coward+v2 · · Score: 3, Insightful

    People have been analyzing writing styles for a long time to try to identify authors. Expecting your coding style to be obfuscated by compiling it has proven to be as wrong as thinking your identity is shielded if you publish under a pseudonym. If you make your code publicly available you really shouldn't have any expectation of privacy.

    --
    I'm a consultant - I convert gibberish into cash-flow.
    1. Re:Privacy implications? by Anonymous Coward · · Score: 4, Insightful

      I doubt it. Your code once compiled will be very similar to most other similarly skilled programmers in that language, unless you go out of your way to be obfuscate things - i.e. a poor coder. Compilers, libraries, APIs, language versions and proprietary extensions are beyond your coding style. This entire premise assumes there will be no false-positives, which will be the vast majority of hits. So basically, they're casting nets, and claiming success when they get one, ignoring the other thousand. Once you're at the binary, coding style has all but gone (assuming you're not doing assembler, which even then, will come down to the same few solutions to a given functional requirement).

  3. Heck by vikingpower · · Score: 3, Funny

    gotta change my indentation style and public void( String s1 ) whitespace habit, now the guvnmunt automagically can also get these out of binaries built from my code. O gawd, I'm afraid now.

    --
    Religous speak to God. Insane are spoken to by God. When all shut up, one can finally hear Shostakovich in peace
    1. Re:Heck by JaredOfEuropa · · Score: 3, Interesting

      Ideally there is such an enforced coding standard, but I have worked in situations with merged teams or projects where coding styles were rather mixed. From what I could see, cosmetic stuff like braces and indentations caused some annoyance but it didn't really lead to much lost coding time, increased effort in fixing or changing things, or an increase in bugs.

      Anyway, brace placement won't survive compilation so this method is useless for rooting out the K&R traitors.

      --
      If construction was anything like programming, an incorrectly fitted lock would bring down the entire building...
  4. Fuck that. by Anonymous Coward · · Score: 3, Insightful

    Aren't we being tracked enough as it is?
    Why for fucks sake why?

    My new years resolution will to remove all my code from all public repositories.

  5. Accuracy 52% with 600 programmers and 8 samples by El_Muerte_TDS · · Score: 4, Insightful

    Good luck when your programmer pool is a couple of thousand and your samples consist out of obfuscated and underhanded software which is often produced by malware creators.

  6. Stackoverflow is the culprit! by WarmBoota · · Score: 4, Funny

    Good luck tracking me!! I copy all of my code from Stackoverflow!

    --
    90% of everything is crap. Also, crap is relative.
  7. Oh really? by Viol8 · · Score: 4, Insightful

    If you RTFA it seems their sample size was 20 programmers. Occasionally they went up to 100 and they're getting something like 60-80% accuracy. BFD.

    Guys - when you've sampled the compiled, optimised binary output (with all debug info stripped) of a million coders all using different compilers on different architectures and are getting at least a 99% accuracy rate, get back to us. In the meantime, I'm sure you'll get some nice marks from your supervisors but I won't be losing any sleep.

    1. Re: Oh really? by corychristison · · Score: 2

      Don't forget different versions of the same compiler. Eg. gcc-1.0 may have a different binary output than gcc-2.0

  8. Re:I doubt this by Lab+Rat+Jason · · Score: 4, Interesting

    This is why I steal most of my functional code from GitHub in the first place...

    --OR--

    Easy to avoid detection by simply NOT UPLOADING code to GitHub in the first place. The assumption that every dev does this is stupid.

    --
    Which has more power: the hammer, or the anvil?
  9. Re:I doubt this by Anonymous Coward · · Score: 3, Interesting

    As a systems admin I have been called upon at times to automate a few things. Doing so in my situation seemed easiest with vb.net. (Yes I know the actual programmers here are recoiling with horror. Stuff it, the program works, and saves me vast amounts of time)

    In that program somewhere around 5% are lines that I actually coded. Everything else is snippets of code from Microsoft's help files, question/answer sites, and similar opensource programs found online. Unless they are checking for things like the fact that I included no error handling (since I am the only one that uses said program) I fail to see how this would work at all.

    The variable naming conventions are many and varied, there are almost no unique lines of code in the program, and it uses only standard libraries. On top of that I don't have any other programs accredited to me floating around the interwebs, at least in vb.net, so there is nothing to compare it to.

    So how exactly are they going to tell you who wrote that monstrosity? In fact it sounds like their algorithm depends upon the author having other accredited works available out there. So if you don't put anything up on public code sharing sights, you have nothing to worry about.

  10. Re:I doubt this by unrtst · · Score: 2

    In that program somewhere around 5% are lines that I actually coded. Everything else is snippets of code from Microsoft's help files, question/answer sites, and similar opensource programs found online. Unless they are checking for things like the fact that I included no error handling (since I am the only one that uses said program) I fail to see how this would work at all.

    I strongly suspect this is precisely the kind of code that they will most easily be able to associate to individuals.
    Through their deep analysis of public code, I would strongly suspect that they have cached those segments, like any good search engine or data analysis would do. As such, they can diff and cut out any code that has been duplicated from elsewhere (just as they could with raw source code). Anything modified by you would remain.
    Because your coding style is, admittedly, quite different from that in the snippets, it will stand out as if it were glowing.

    That said, if you were a more competent** programmer, then it'd be more difficult to distinguish your code from everything else as it'd all follow best practices.
    ** or some other appropriate word that means you code in similar style to well established design patterns

  11. Re:Volkswagen Code? by BitterKraut · · Score: 2

    At 32c3 https://www.youtube.com/watch?... , Daniel Lange and Felix Domke presented their analysis of Volkswagen's "Dieselgate" software. It seems that that one doesn't look like ordinary code at all, but rather like code patterns generated from tables that relate sensory data to engine control parameters. Think of one of the earliest motivations for building computing machines in the first place: To create parameter tables for artillery aiming!

  12. Re:I doubt this by Gr8Apes · · Score: 2, Interesting

    As such, they can diff and cut out any code that has been duplicated from elsewhere (just as they could with raw source code). Anything modified by you would remain. Because your coding style is, admittedly, quite different from that in the snippets, it will stand out as if it were glowing.

    The funniest thing about this is how wrong that statement is. I can take myself as an example, I've worked in multiple shops, several with different code formatting practices, not to mention potentially different languages. I generally configure my IDE to whatever code formatting requirements there are, so everything I add gets put into the current format. Naming practices are whatever is in the current codebase. So, essentially, from a source and binary perspective, my code will look like whatever the current code base should. Snippets cut from anywhere will always be refactored to fit my needs, and thus may not look at all like what was snipped.

    In short, this is a whole barrel of snake oil for any one actually working professionally and not that rare lone wolf that only codes their own specific way.

    --
    The cesspool just got a check and balance.
  13. What if the style was "idiomatic"? by mark-t · · Score: 3, Interesting

    It seems to me like the easiest way to avoid being identified in this regard would be to write code that follows any published general style guidelines or otherwise very common conventions.

    As a side effect, it will make your source code more readable to others, which is beneficial if you are on a programming team.

  14. possible upside by Gravis+Zero · · Score: 3, Interesting

    while i don't think you'll be able to identify an exact person, i do think this technology could be used to identify code that is prone to error and exploitation or even code that is for exploitation.

    --
    Anons need not reply. Questions end with a question mark.
  15. Uh huh. So what happens when... by thermowax · · Score: 2

    ...you run the object code through a permuter like shikata ga nai?

    I suspect the successful detection rate may be a bit lower.

  16. Re:Of limited use, but an interesting comment on C by deodiaus2 · · Score: 2

    Did the same team that developed that code also run an accuracy assessment? Was there a "prize" (contract payment) associated with meeting certain accuracy? I remember reading about facial recognition systems which worked well in labs, but fail in the field.
    As soon as developers become aware that they might be identified, I think that they might do things (spoof, run beautify and strip comments) to throw such a system off.