Slashdot Mirror


Coding Styles Survive Binary Compilation, Could Lead Investigators Back To Programmers (princeton.edu)

An anonymous reader writes: Researchers have created an algorithm that can accurately detect code written by different programmers (PDF), even if the code has been compiled into an executable binary. Because of open source coding repositories like GitHub, state agencies can build a database of all developers and their coding styles, and then easily compare the coding style used in "anti-establishment" software to detect the culprit. Despite all the privacy implications this research may have, the algorithm can also be used by security researchers to track down malware authors. We also discussed an earlier phase of this research.

8 of 164 comments (clear)

  1. Re:Heck by JaredOfEuropa · · Score: 3, Interesting

    Ideally there is such an enforced coding standard, but I have worked in situations with merged teams or projects where coding styles were rather mixed. From what I could see, cosmetic stuff like braces and indentations caused some annoyance but it didn't really lead to much lost coding time, increased effort in fixing or changing things, or an increase in bugs.

    Anyway, brace placement won't survive compilation so this method is useless for rooting out the K&R traitors.

    --
    If construction was anything like programming, an incorrectly fitted lock would bring down the entire building...
  2. Re:Fuck that. by SQLGuru · · Score: 1, Interesting

    It's versioned.......and cloned.......and forked. Good luck with that.

    I think it's funny (ironic, not ha ha) that many of the people espousing Open Source as being perfect are generally the same ones that have the biggest desire for digital privacy. And because of their push for OSS, they will be some of the first to lose their privacy.

    ** I think OSS has it's place, as does closed source. I also have a desire for some privacy but recognize that I have to give up some of that privacy in order to have some level of convenience. But I'm not an extremist in either direction for either spectrum.

  3. Re:I doubt this by Lab+Rat+Jason · · Score: 4, Interesting

    This is why I steal most of my functional code from GitHub in the first place...

    --OR--

    Easy to avoid detection by simply NOT UPLOADING code to GitHub in the first place. The assumption that every dev does this is stupid.

    --
    Which has more power: the hammer, or the anvil?
  4. Re:I doubt this by Anonymous Coward · · Score: 3, Interesting

    As a systems admin I have been called upon at times to automate a few things. Doing so in my situation seemed easiest with vb.net. (Yes I know the actual programmers here are recoiling with horror. Stuff it, the program works, and saves me vast amounts of time)

    In that program somewhere around 5% are lines that I actually coded. Everything else is snippets of code from Microsoft's help files, question/answer sites, and similar opensource programs found online. Unless they are checking for things like the fact that I included no error handling (since I am the only one that uses said program) I fail to see how this would work at all.

    The variable naming conventions are many and varied, there are almost no unique lines of code in the program, and it uses only standard libraries. On top of that I don't have any other programs accredited to me floating around the interwebs, at least in vb.net, so there is nothing to compare it to.

    So how exactly are they going to tell you who wrote that monstrosity? In fact it sounds like their algorithm depends upon the author having other accredited works available out there. So if you don't put anything up on public code sharing sights, you have nothing to worry about.

  5. Re:I doubt this by Gr8Apes · · Score: 2, Interesting

    As such, they can diff and cut out any code that has been duplicated from elsewhere (just as they could with raw source code). Anything modified by you would remain. Because your coding style is, admittedly, quite different from that in the snippets, it will stand out as if it were glowing.

    The funniest thing about this is how wrong that statement is. I can take myself as an example, I've worked in multiple shops, several with different code formatting practices, not to mention potentially different languages. I generally configure my IDE to whatever code formatting requirements there are, so everything I add gets put into the current format. Naming practices are whatever is in the current codebase. So, essentially, from a source and binary perspective, my code will look like whatever the current code base should. Snippets cut from anywhere will always be refactored to fit my needs, and thus may not look at all like what was snipped.

    In short, this is a whole barrel of snake oil for any one actually working professionally and not that rare lone wolf that only codes their own specific way.

    --
    The cesspool just got a check and balance.
  6. What if the style was "idiomatic"? by mark-t · · Score: 3, Interesting

    It seems to me like the easiest way to avoid being identified in this regard would be to write code that follows any published general style guidelines or otherwise very common conventions.

    As a side effect, it will make your source code more readable to others, which is beneficial if you are on a programming team.

  7. possible upside by Gravis+Zero · · Score: 3, Interesting

    while i don't think you'll be able to identify an exact person, i do think this technology could be used to identify code that is prone to error and exploitation or even code that is for exploitation.

    --
    Anons need not reply. Questions end with a question mark.
  8. Re:Frist! by ShanghaiBill · · Score: 4, Interesting

    False positives are not a problem if you deal with them rationally. If a woman is murdered, and the DNA matches one in a million, then in a country of 300 million, there will be 300 matches, and 299 false positives. But if only one lives in the same city, and it happens to be her ex-boyfriend, then the DNA match is useful information.