Coding Styles Survive Binary Compilation, Could Lead Investigators Back To Programmers (princeton.edu)
An anonymous reader writes: Researchers have created an algorithm that can accurately detect code written by different programmers (PDF), even if the code has been compiled into an executable binary. Because of open source coding repositories like GitHub, state agencies can build a database of all developers and their coding styles, and then easily compare the coding style used in "anti-establishment" software to detect the culprit. Despite all the privacy implications this research may have, the algorithm can also be used by security researchers to track down malware authors.
We also discussed an earlier phase of this research.
Good luck when your programmer pool is a couple of thousand and your samples consist out of obfuscated and underhanded software which is often produced by malware creators.
Good luck tracking me!! I copy all of my code from Stackoverflow!
90% of everything is crap. Also, crap is relative.
If you RTFA it seems their sample size was 20 programmers. Occasionally they went up to 100 and they're getting something like 60-80% accuracy. BFD.
Guys - when you've sampled the compiled, optimised binary output (with all debug info stripped) of a million coders all using different compilers on different architectures and are getting at least a 99% accuracy rate, get back to us. In the meantime, I'm sure you'll get some nice marks from your supervisors but I won't be losing any sleep.
I doubt it. Your code once compiled will be very similar to most other similarly skilled programmers in that language, unless you go out of your way to be obfuscate things - i.e. a poor coder. Compilers, libraries, APIs, language versions and proprietary extensions are beyond your coding style. This entire premise assumes there will be no false-positives, which will be the vast majority of hits. So basically, they're casting nets, and claiming success when they get one, ignoring the other thousand. Once you're at the binary, coding style has all but gone (assuming you're not doing assembler, which even then, will come down to the same few solutions to a given functional requirement).
This is why I steal most of my functional code from GitHub in the first place...
--OR--
Easy to avoid detection by simply NOT UPLOADING code to GitHub in the first place. The assumption that every dev does this is stupid.
Which has more power: the hammer, or the anvil?
False positives are not a problem if you deal with them rationally. If a woman is murdered, and the DNA matches one in a million, then in a country of 300 million, there will be 300 matches, and 299 false positives. But if only one lives in the same city, and it happens to be her ex-boyfriend, then the DNA match is useful information.