Coding Styles Survive Binary Compilation, Could Lead Investigators Back To Programmers (princeton.edu)
An anonymous reader writes: Researchers have created an algorithm that can accurately detect code written by different programmers (PDF), even if the code has been compiled into an executable binary. Because of open source coding repositories like GitHub, state agencies can build a database of all developers and their coding styles, and then easily compare the coding style used in "anti-establishment" software to detect the culprit. Despite all the privacy implications this research may have, the algorithm can also be used by security researchers to track down malware authors.
We also discussed an earlier phase of this research.
Going to be lots of false positives on this one.
This must have a lot of false positives. I'd be surprised if this works at all, but sure it would sell some product and get a few grants.
People have been analyzing writing styles for a long time to try to identify authors. Expecting your coding style to be obfuscated by compiling it has proven to be as wrong as thinking your identity is shielded if you publish under a pseudonym. If you make your code publicly available you really shouldn't have any expectation of privacy.
I'm a consultant - I convert gibberish into cash-flow.
gotta change my indentation style and public void( String s1 ) whitespace habit, now the guvnmunt automagically can also get these out of binaries built from my code. O gawd, I'm afraid now.
Religous speak to God. Insane are spoken to by God. When all shut up, one can finally hear Shostakovich in peace
Aren't we being tracked enough as it is?
Why for fucks sake why?
My new years resolution will to remove all my code from all public repositories.
Good luck when your programmer pool is a couple of thousand and your samples consist out of obfuscated and underhanded software which is often produced by malware creators.
One of the somewhat longstanding assumptions in software development was that hand-optimization was no longer necessary because compiler optimization was sufficient to identify when to unroll loops, when to double-stack recursion, and all the other little performance boosting tricks.
I have no plans to read the study, but it sounds like at least one compiler is failing to do that. Since it is only using a few basic optimizations, different ways of coding the same behavior will still be observably different in the binary.
The real takehome is that we should all keep in mind the nature of the hardware and software we are coding for, and consider writing for ideal precision rather than how you would describe the process for another human.
They can tell if you will commit a crime. Better to get you sooner than later. Already being done in places like Cook county.
So what happens when someone copies and pastes from 10 different authors to make a project.
though, It only works for people who write in fucking high level languages.
Optimized code in a low level language, can not, nor will it ever lead back to anyone else but GOD!
Have fun!
However, a lot of people have similar enough coding styles, so you may be able to break it down to particular camps of styles. Also many people change their style based on the language they are coding in. Also over time their style may evolve and change.
In my career I try to keep my mind open, and I see an other style of coding, other than judging it inferior to mine, I would like to understand it, and if I like it I will incorporate it into my style.
But compiling your code, will not hide how you coded it, I can usually tell how the program is written and the style just by using the application and not even looking at the binary code. Some tasks take a while to run, while others seem quick. What features are flexible and what are fixed. Different styles tend to tolerate particular tradeoffs.
If something is so important that you feel the need to post it on the internet... It probably isn't that important.
Good luck tracking me!! I copy all of my code from Stackoverflow!
90% of everything is crap. Also, crap is relative.
If you RTFA it seems their sample size was 20 programmers. Occasionally they went up to 100 and they're getting something like 60-80% accuracy. BFD.
Guys - when you've sampled the compiled, optimised binary output (with all debug info stripped) of a million coders all using different compilers on different architectures and are getting at least a 99% accuracy rate, get back to us. In the meantime, I'm sure you'll get some nice marks from your supervisors but I won't be losing any sleep.
Well, thank God you can't change your coding style and thank God there aren't style standards for projects one has to adhere to.
Given this rather useless insight, how difficult would it be for someone to write a piece of code that randomly alters your style?
30 minutes?
Just run your app through an obfuscator and it's completely masked. Problem solved.
Visit the Arcade Restoration Workshop @ http://www.arcaderestoration.com
If the machine says Eketek encoded it, then Eketek alters the portions which get flagged as his own and builds a new assembly. Repeat until the machine blames someone for it. Problem solved.
Even their test size seemed to have low accuracy, but I wonder how well this even works over time. I know my code from 5-6 years ago looks nothing like code that I write today.
"People who think they know everything are very annoying to those of us who do."-Mark Twain
Problem 1.) Who wrote this https://de.wikipedia.org/wiki/... ? Problem 2.) In the movie First Blood, Part II (a.k.a. Rambo II), when the camera pans through the interiors of Marshall Murdock's CIA base building, parts of the code listing of some computer program can be seen scrolling through some of the screens there. Who wrote that code? Hint to Problem 2: The person in question is also a Slashdot member.
Seems like the evolution of this will be programs that obfuscate coding styles. But then could the obfuscator itself be tracked back to its author too? Echo, echo, echo...
This technology was used to determine how many coders were used for the Stuxnet attack.
That was back in 2010.
Using that, they determined that a team of 20 people were used, indicating a state-sponsored attack of remarkable complexity...
A pox on web designers who feel that window.innerWidth == screen.availWidth
then translate my code from c-> perl -> lisp -> c
No one will be able to trace me anymore!
Google translation: It's not a bug, it's a feature!
Captcha: rectum
I wonder how well this works with languages like Vala, which get translated to another language (C, in Vala's case) before that is then compiled. How much of your coding style is still visible on the other side then?
Any nontrivial programming exercise involves problem solving. Faced with a particular recurring problem, a programmer will learn methods to solve it. There are many choices. Most programmers, after having learned a small collection of 'good enough' solutions to common problems will continue to use them whenever 'good enough is good enough', the time and effort of relearning seeming unproductive.
This is no different, conceptually, than in sports when certain sportspeople play in a discernible style. Nobody is perfectly uniformly good at all aspects of a discipline. (It would be interesting to see if one could take a list of statistics from tennis matches and use them to identify the players.)
John_Chalisque
Maybe now we can track down the guy that wrote that Volkswagen code? I'll be right there... need to grab my pitchfork.
"Never give up, for that is just the time and place when the tide will change." -Harriet Beecher Stowe ^_^
Ah good thing I use perl-critic.
Did a couple of scripts in quiet time between Xmas and new year. Took the chance to move from perl to python. I use git hub as it is there.
Think my rating will be 'dufus head'
Important part:
Finally, we do not consider executable binaries that are obfuscated
to hinder reverse engineering. While simple systems,
such as packers [2] or encryption stubs that merely restore the
original executable binary into memory during execution may
be analyzed by simply recovering the unpacked or decrypted
executable binary from memory, more complex approaches are
becoming increasingly commonplace, particularly in malware.
So, there are numerous issues here:
1.) getting the samples for training (e.g. the authors already mention this as a problem) => github and friends distribute source code, and it's not necessarily trivial to get the compiler and options right to recreate the correct binary.
2.) If you would for example profile me online, you'd learn from code repositories that I know python, and you might from post interfere that I know other languages. My Python repositories will not help you identify my binaries build in C.
3.) And worst, the code where this deanonymization would be most useful, e.g. malware, is very hard to handle, as it's usually obfuscated to the max. Worse malware has been known to mutate itself on replication to avoid leaving a signature for virus scanners.
Anyway, nice ML paper. ;)
Generally, any programming language has an upper limit regarding the number of commands that are recognized, which the same cannot be said of spoken/written languages. The only thing that will actually be discovered are the differences in algorithms, not the unique number of programmers to a particular dialect.
The place where (and how) you catch nulls is very programmer-specific in my experience and often evades the style-check.
will make any such strategy useless in short order. Source code translators and syntax standardization tools might be another approach.
Anyway, it's a big yawn, however, some enterprising con artists will sell this to clueless government bureaucrats for big bucks. Bureaucrat will get his bonus. Con artist company will get their money. Win, win. It won't work, of course, but when has that ever mattered in the government world?
Please do not read this sig. Thank you.
Cut-and-paste other programmer's code!
Have you read my blog lately?
This study seems to have a high error rate. (70-80% correct, less for big programmer populations)
If might be useful for de-priortizing some leads, but seems a bit like a divining rod.
What is interesting is what it says about what programmers do.
They continually make choices as to how to implement things.
The choices are limited by their judgement and bag of tricks.
What they have seen, what has worked in the past, and what they manage to dream up.
Perhaps this research is actually creating and comparing inventories of these bags?
If so, then they are not just measuring the properties of one programmer, but a network of programmers trading tricks.
It should also produce some hints of which programmers have interacted in the past.
'Interact' does not mean actually are aware of each other, but rather are aware of each of their code.
It seems to me like the easiest way to avoid being identified in this regard would be to write code that follows any published general style guidelines or otherwise very common conventions.
As a side effect, it will make your source code more readable to others, which is beneficial if you are on a programming team.
File under 'M' for 'Manic ranting'
while i don't think you'll be able to identify an exact person, i do think this technology could be used to identify code that is prone to error and exploitation or even code that is for exploitation.
Anons need not reply. Questions end with a question mark.
There is no way this could be even close to conclusive, but the moral of the story is - if it is stupid, but a judge will call it probable cause, then it isn't stupid.
The truth is it doesn't need to be conclusive, it just has to look conclusive to a 60 year old law professional with no programming experience.
As someone who knows a fair amount about compilers and interpreters, I would be highly skeptical of that underlying statement to begin with. Going down this path is a road that stretches the bounds of credulity.
But I think I would also dispute the notion that programmers have unique coding styles in the age of widely accepted standards and practices.
Practices, I would concede are coding styles.
Though, even then... it's not like we're talking about something like assembly, where the style you use would really be pronounced.
We're talking about mostly .net and java applications, that use commodity skillsets, that adhere to certain sets of rules and guidelines. You don't give up your best practices and behaviors when you decide to write something that breaks the law.
You don't change for github, either. In general, most programmers work with whatever is acceptable for the platform they're using, and the rest falls into place.
The exception, of course, being amateurs, and possibly some hobbyists... who don't really know enough to do anything all that malicious to begin with.
This is a witch hunt, plain and simple.
Unless of course, they're actually analyzing writing styles, and they're trying to bring all of us advanced level programmers out by saying something idiotic, knowing that we'll all comment on it.
Fuckit. They're probably doing that anyway.
This signature has Super Cow Powers
01001001 00100000 01100011 01101111 01100100 01100101 00100000 01101001 01101110 00100000 01100010 01101001 01101110 01100001 01110010 01111001
“He’s not deformed, he’s just drunk!”
I suspect that the value is not in answering the question "who the hell wrote this - which programmer in Internet land ?" but in identification a programmer out of a small group of suspects, eg "was this written by the known malware team in Boston, Beijing or Kiev ?". So: it will further narrow the field out of an already small group of suspects.
This has an interesting implication on GPL enforcement. Today if Nasty Corp Inc takes a large chunk of code from Git Hub and makes it part of a proprietary product (eg: sell it & do not provide source), then even if you suspect that they have taken your code it is hard to prove it; yes you may be able to get disclosure by going to court but that costs a lot of money and is hard if they are in a different jurisdiction. Now you will be able to get a good idea if the code is yours before spending significant time and money chasing Nasty Corp.
Perhaps a language that is heavily opinionated in both style and patterns (ie what is considered "idiomatic") can help frustrate this kind of analysis.
Another bullshit method to lump on the pile alongside handwriting analysis, lie detection and parallel construction.
Linus Torvalds has been indicted for creating numerous pieces of malware. "His coding style is unmistakable" prosecutors said quoting numerable code fixes he made after scolding commentaries on other people's coding style.
Custom electronics and digital signage for your business: www.evcircuits.com
I suspect the successful detection rate may be a bit lower.
Did the same team that developed that code also run an accuracy assessment? Was there a "prize" (contract payment) associated with meeting certain accuracy? I remember reading about facial recognition systems which worked well in labs, but fail in the field.
As soon as developers become aware that they might be identified, I think that they might do things (spoof, run beautify and strip comments) to throw such a system off.
That way, the smoking code leads back to the person who I stole it from, rather than to me.
-- Tigger warning: This post may contain tiggers! --
I'd think the only time this would work is if a programmer was a contributor to open source projects, then went bad and started writing software designed to commit crimes.
Say someone contributes to open-source projects and then contributes to the Android project. A U.S. appeals court found Android to infringe Oracle's copyright, pending a forthcoming phase of the trial to determine whether API interoperability is a valid rationale for fair use. Does Android count as "software designed to commit crimes" because copyright infringement is a crime?
If you arent uploading to GitHub, you are an Alchemist, not a Scientist.
If by "alchemist" you mean "someone practicing obsolete practices worthy of derision", this sounds like you're trying to say GitHub ought to have a monopoly on hosting free software projects, as opposed to SourceForge which shares a parent company with Slashdot. Do you work for GitHub?
If we had better compilers that would optimize out this stuff, or compilers that did optimization by default to mask this, this wouldn't be an issue.
This one is easy, problem is it's like fingerprints, which aren't actually unique but provide a convenient way to find someone to persecute. (Not a typo that).
Coding style will be even less unique than fingerprints.
You could possibly use either to prove 'unlikely to be involved' but if it's used it'll most likely be used to 'prove' the innocent guilty.
i'm pretty sure my coding style has changed significantly over time, from project to project, due to experience, learning from past mistakes, influence from other programmers, etc. Also, I've worked on projects where 10 different programmers have touched the same code. Good luck trying to identify me from any two pieces of code.
assignment != equality != identity
n/t
Many, many years ago, I was reverse engineering software for a variety of microcontroller based products. We ran prom dumps through disassemblers, etc. It was easy to tell how many different programmers had a hand in the code. Things like lengths of subroutines, looping techniques, how data was declared and laid out. Even though the original programmers were working in C, it's easy to tell.
This is no different than the typical 80-90% accuracy you can get on random samples of text from your emails or other writings. While 80% on any one sample isn't all that impressive, when you combine it with other evidence that is orthogonal, it's pretty easy to unambiguously identify an author, especially if you have a halfway decent sized corpus to work from.
The Enron case and the resulting discovery motions created a HUGE database of emails and memos to work with.
Software is no different.
This is not going to hold up in any court, except Iran.
I have lost count of the times younger and lazier coders have told me that their high-level compiled code was every bit as efficient as assembly code. There seems to be an absolute conviction that compilers generate the tightest and most-optimized code. If that is so, however, then those compilers should generate the same code no matter who writes the high-level code. If the compilers truly generate different code with uniquely-identifiable characteristics depending upon the coding styles of the individual coders, then those compilers are NOT generating the tight highly-optimized results their supporters claim. Just a thought for you guys that never bother with low level code...
As with ANY tool in a toolbox, there's a time and a place for an assembler, a compiler, and an interpreter - NONE are ALWAYS the best. I'm certainly not an advocate for writing most apps in low-level code, but it's also not correct to assume that it's just fine to do everything in a high-level language and "let the compiler do all the hard work".
void foo()
{
String result;
if (...)
result = a;
else (...)
result = b;
else (...)
result = c;
return result;
}
or
void foo()
{
String result;
if (...)
return a;
else (...)
return b;
else (...)
return c;
}
Both are common, but compile to different code. Do you code to 'a procedure should live on a page'? How about 'a procedure should have a purpose'? Return errors or throw exceptions? Return values or modified arguments? Kitchen sink constructors? Getters & setters or fluent builders? There's seven variables without stopping to think, dividing coders 128 ways, and I'm sure you could find another dozen or so, taking it to one-in-a-million level. There's no need to obfuscate...