Anonymous No More: Your Coding Style Can Give You Away
itwbennett writes Researchers from Drexel University, the University of Maryland, the University of Goettingen, and Princeton have developed a "code stylometry" that uses natural language processing and machine learning to determine the authors of source code based on coding style. To test how well their code stylometry works, the researchers gathered publicly available data from Google's Code Jam, an annual programming competition that attracts a wide range of programmers, from students to professionals to hobbyists. Looking at data from 250 coders over multiple years, averaging 630 lines of code per author their code stylometry achieved 95% accuracy in identifying the author of anonymous code (PDF). Using a dataset with fewer programmers (30) but more lines of code per person (1,900), the identification accuracy rate reached 97%.
Can they do it with corporate code where there are naming and style standards in abundance, and code reviews to ensure those guidelines are followed?
I do not fail; I succeed at finding out what does not work.
Next thing you know, they'll be able to use this text to determine my real /. login.
This is why people need to follow style guides, so that all source code is styled the same.
Really? Code Jam? The clusterfuck of copy/paste crap?
Go to something that has standardized formatting and enforced styled code, like Linux or even Qt and repeat your experiments.
Who releases source code without their name?
Let me know when you can determine the author from just the binary...
...based on the quality of that code...
Do not look into laser with remaining eye.
Using this technique, can they tell us if the NSA did write the Regin Malware now?
Can we use this to find Satoshi?
Are they just reading... // sample.c - takes slashdot comments and replaces troll comments with additional troll comments // by John Smith // v1.0.1
or changelog.txt ?
With coding standards to follow, and tools that uniform-ify your code, it should be easier to anonymize it than with regular prose. And regular prose is apparently trivial to anonymize: see "Practical Attacks Against Authorship Recognition Techniques" by Michael Brennan and Rachel Greenstadt.
This has always been obvious.
It's true for comments here, too. Only apk can craft a true apk comment. Others have tried, but they're never quite like the genuine thing.
But we should be careful with such analysis, too. In some cases it can be totally wrong.
There is a Slashdot-like site called Soylent News. There was once a guy over there who would claim that different posters were actually the same person, even when they weren't, and in some cases couldn't have been (one of the people he accused had died earlier).
How did he "know" they were the same people? He said he had a "complex" algorithm that used bzip2 and a comparison of the size of the compressed comment text. Of course, his allegations were correct about 0% of the time.
Newfags can't triforce
I can usually tell who wrote the code in the office by whether or not they put a space after their ifs: if(i == 0) vs if (i == 0); where they put their brackets, whether or not they replace their tabs with spaces, how they deal with bools: if (!var) vs if (var == false) and several other telling signs. There are so many combinations of variations no two programmers in the office (about 12 of us) have the same style.
The Moore-Murphy Law: The number of things that will go wrong will double every 2 years.
... a patchwork of open-source freebies.
I suppose all those "// damn U bill gates!" comments gave me away
Table-ized A.I.
When I was a kid in the 80s we figured out we could identify who wrote a particular piece of software by looking at it's code. Those individualistic and identifiable features we used in the argument over programming being an art or a science when we wanted to support the art side.
1985 Hugo Winner
Really, the fact that coding style is recognizable was so well known it made it into pop culture 30 years ago.
Also, on the smaller sample size the program might just be recognizing the parts of the style that come from the corporate standards. It would be interesting to see if it could recognize code from people who all work at the same company.
But I can't recall an instance.
I guess we can expect that source code repositories will be scanned and processed. And, for code written by multiple authors, the modified code (from commits) will be scanned and indexed as well.
But, I bet they will never figure out who writes the malware recently attributed to the three letter agencies. They should, however, be able to figure out which agency writes the stuff if they get a copy of the source code or maybe even from decompiling the binary.
Additionally, if written from .NET, the CLR code can be reflected back to VB, C# or any other .NET language to retrieve the source code.
"The key to this system being used is, of course, first obtaining the code stylometries for a wide range of developers. The authors didn't address how, say, a database of programmers’ styles would be compiled. Also, to identify the author of a piece code would require access to the source code, and not just executables, though the authors mention there is some evidence that style is preserved in binaries."
-> so once you post to github and similar 'they' can link every code you ever write to you,....
so you are a good month late with the news
BWAHAHAHAAAA!!!!
If your coding is terrible and very newbie like, they can't single you out since your code is similar to the ocean of other terrible coders.
So if you are a paranoid freak, the best way to ensure your safety and keep the government off your back is to write terrible code.
Priest: "Universe from nothing, no laws of physics, sped up time"+ huge discrepancies. Creationism? No. Big Bang Theory
Grab the systemd source and see how much of it seems to be written by Satan.
Not that many of us actually use comments.... http://xkcd.com/1421/
Most programming isn't writing new code. Most programming is working on someone else's crap you inherited. Invariably, you're going to be using that person's style or else the result will look like garbage.
There is also the problem that most non-trivial code is worked on by multiple people at the same time.
Writing some code from scratch as an assignment is a very artificial exercise nowadays, unless you're in a classroom setting. Therefore, you're going to get a signature from a programmer doing atypical work.
95% of 250 coders. That means that out of a million programmers they will misidentify 200000.
I suspect that there are few enough variances in style to make any coders style unique. For example whether to uses braces on a one line statement after an in if in C.
With a few programmers it's likely to work, but when the possible source of programmers is the world...
Not to mention emacs, Visual Studio and such enforcing some indentation standards and programming languages enforcing others.
I recall some program years ago that claimed to be able to convert your prose into the style of Hemingway/Dickens/. I wonder how easily this
tool could support a similar feature - convert your code to Linus' style! Code like RMS!
I even had a plan to add a formatter to the CVS to convert all code at checkin to a single style, so that diffs between versions would be guaranteed free of coder style quirk differences (tab size, spacing, brace placement). And I would have gotten away with it, if it weren't for that meddling C++ unparseability!
Write a version of pretty-printer that rerenders your code into a different style.
Have a lexicon of mipelled words for each "personality".
Another lexicon of variable names.
a vs inta vs int_a vs x.
Refactoring and unfactoring for subroutines.
Run the comments through google translate and back to english.
ukrainian
japanese
chinese
Synonym and antonym substitution in the comments.
The mind dances at the possibilities to mess with this algorithm.
She was like chocolate when she drank... semi-sweet at first and then increasingly bitter.
and then use F2C to convert it to C code before I check in.. Try analyzing that!
Ditto. They also could have researched if submissions in a given (same) GCJ identity have been (or had a high probability of being...) written by two or more different coders...
;-)
The submissions' speed of top ranked coders seen in early stages of the GCJ contest always amazed me (compared, of course, with my turtle sluggishness...)
I can just imagine how unreadable such code would end up being, as any comments would look like they were written by some sort of AI tool.
Of course you could anonymize source code using an obfuscator.
But maybe the simpler way is to compile Java to bytecode, then decompile it back to Java. I suspect that's as effective as most obfuscators.
Fukkin great man! I get paid extremely well and I get all of the pussy I want. Granted, it's usually overweight, ugly pussy, but still better than what my last job provided.
Perhaps something like Artistic Style might help.
http://astyle.sourceforge.net/
Uh, Linux geek since 1999.
Someone just needs to write a tool that takes source code and translates it into an obfuscated form that only the CPU can understand. Is anyone working on this type of privacy tool?
i did something similar when I used to teach C. I neglected to realize that emacs and IDEs will often produce identical whitespace for simple programs.
Not sure if this works as well for compiled code as it does for source.
I bet there are templates and style-checkers in existence that would make source-code based author identification an issue.
Okay, it won't eliminate fingerprinting completely, but using Kolmogorov Style would reduce variation quite a bit.
For good measure, toss in "run bytecode through proguard" between compiling Java to bytecode and decompile it back to Java. :D
Wouldn't any programmer worth their salt identify themselves in the comments, or (if not) be logged as the last guy in that code on such-and-such a date, while working on such-and-such a patch number? (E,.g 'kittenman was here, 1/Jan/15, fixing Steve's crap').
But I hope my code is easily recognizable. I'm proud of it. It may not be the smartest, slickest, quickest there is, but it's mine. And it works.
"The greatest lesson in life is to know that even fools are right sometimes" - Winston Churchill
Generally speaking each project has a coding style that most code in the project adheres to, for the simple reason that it's easier to maintain when the code all looks more-or-less similar.
If one area uses lowercase with underscores, and the other area uses CamelCase, and one area typedefs the heck out of everything while the other is explicit, then for someone coming in and trying to understand the code it makes it harder than necessary to figure out what's going on.
So if you look at the linux kernel, or glibc, or firefox, or Chrome, or any other similarly large project, there will be some sort of coding style that applies. This is not to say that the style applies blindly. For example there are areas in the kernel where they basically imported a driver that is written in a different coding style. Since that driver is maintained out of the linux kernel tree and is largely self-contained, that was deemed to be acceptable. And even in that case, the driver used an internally-consistent coding style for all the files involved.
Yeah. Go throw the AI in jail. I dare ya.
I just use 'git blame' to figure out who to yell at....
I doubt it. Therefore, this is about as reliable as graphology (handwriting analysis).
If you take two programmers who code to book standard, how do you tell the difference between them using the same strict problem?
Political debates have me rolling my eyes so much I think I got optical whiplash. I should sue. - Foamy The Squirrel
You can have/use this idea for free:
Before a system will build said code, have the build system verify the code not only by the public key/code hash, but as a secondary method - the code fingerprint of the author in question.
This turns a creepy idea into something worthwhile.
Lodragan Draoidh
The more you explain it, the more I don't understand it. - Mark Twain
That's one way to make your ForTran run slower
I don't even have mod points today! You're always sitting next to me posting this leftist crap and you take forever to type it!
Case in point, I am a guitar player, and so was my college roommate. We didn't necessarily play together much, but we both heard each other play a lot, over the course of years.
I'd be able to place his playing anywhere.
For that matter, we used to have a game where we'd try to stump each other by playing clips of guitar players and guessing who they were. This was often improvisational jamming, very obscure recordings from established artists. We usually had to go through 3-4 rounds before someone would get one wrong.
This isn't really much different than handwriting, speech patterns, writing patterns...
yeah right, sitting over there with those right fingers on the mouse while I'm forced to use the touchpad!
Which one of us pushed down the caps lock key when I was typing in the subject line up there? HEY STOP IT!
Hey, stop hitting yourself! uiopuip[';iulpol,.]]p??=` Stop hitting yourself! jkl;jkl;89uiopnm,.jkl Stop hitting yourself! mkpmkplkjklnnnnNNN BWAHAHA!
"Hey, you notice some odd grammar, word choice, and spelling variance in this code?"
"Oh yeah, must be Maxo-Texas. That's his anonymization software."
When things get complex, multiply by the complex conjugate.
If you did this every time, you'd be identified as the guy who runs his code through Google Translate prior to release.
Non-normal behavior is the most easy to single-out. In order to avoid detection, you basically have to become noise. And if you're the only one, then even that is a pattern.
Sure, you could run some things through Google Translate and leave some things alone, but that'd be the equivalent of having two online personas.
"If a nation expects to be ignorant and free in a state of civilization, it expects what never was and never will be."
I noticed years ago that I could identify which of my coworkers wrote a piece of code simply by the style.
Absolutely- if you were the only one using the tool.
She was like chocolate when she drank... semi-sweet at first and then increasingly bitter.
You mean like the people who wear masks of Guy Fawkes?
seems to be a very prolific coder
It runs a lot faster on platforms that don't have a Fortran compiler.
Of course you could read the article first, and see that "Accuracy rates werenÃ(TM)t statistically different when using an off-the-shelf C++ code obfuscators. Since these tools generally work by refactoring names and removing spaces and comments, the syntactic feature set wasnÃ(TM)t changed so author identification at similar rates was still possible."
Time to run this against the 7.2 version of Truecrypt.
The article reveals that: "Accuracy rates weren’t statistically different when using an off-the-shelf C++ code obfuscators. Since these tools generally work by refactoring names and removing spaces and comments, the syntactic feature set wasn’t changed so author identification at similar rates was still possible."
It would be interesting to see what it would take to reduce the probability of identification. I should probably RTFP, tho. Might be in there.
aye!
If everyone used it then we'd all be spartacus.
What I was implying also in my parent post was using the tool the article is about to confirm your code had reached the ambiguous level.
She was like chocolate when she drank... semi-sweet at first and then increasingly bitter.
That's a good point. I also mentioned arbitrarily factoring and refactoring subroutines and I did not state clearly enough that i was suggesting using the tool mentioned in the article to confirm your code was giving a false result.
She was like chocolate when she drank... semi-sweet at first and then increasingly bitter.
Wrong, the machine code emitted by one of the industry heavyweight Fortran compilers will kick the ass out of C's
Duh. They are, like, just seeing this today? We knew this back in the seventies... and I am sure that earlier programmers knew it too.
Used to be able to tell which student's code I was looking at towards the end of a semester, in the 1980s. No need to look at who submitted it. From time to time I'd find one student's work turned in by someone else. That would result in an inquiry and usually an action against that student. Ye old dumpster dive.
Years later I would do code reviews. Hardly any time I could tell you who wrote it. Even if they had departed the company. Certain people do certain things predictably.