Anonymous No More: Your Coding Style Can Give You Away
itwbennett writes Researchers from Drexel University, the University of Maryland, the University of Goettingen, and Princeton have developed a "code stylometry" that uses natural language processing and machine learning to determine the authors of source code based on coding style. To test how well their code stylometry works, the researchers gathered publicly available data from Google's Code Jam, an annual programming competition that attracts a wide range of programmers, from students to professionals to hobbyists. Looking at data from 250 coders over multiple years, averaging 630 lines of code per author their code stylometry achieved 95% accuracy in identifying the author of anonymous code (PDF). Using a dataset with fewer programmers (30) but more lines of code per person (1,900), the identification accuracy rate reached 97%.
Can they do it with corporate code where there are naming and style standards in abundance, and code reviews to ensure those guidelines are followed?
I do not fail; I succeed at finding out what does not work.
This is why people need to follow style guides, so that all source code is styled the same.
Who releases source code without their name?
Let me know when you can determine the author from just the binary...
...based on the quality of that code...
Do not look into laser with remaining eye.
Using this technique, can they tell us if the NSA did write the Regin Malware now?
Heh. If it's effective in a clusterfuck of copy/paste, then it should be really effective when the bulk of the code is original...
Sounds like the solution is to use an entirely different language than the bulk of one's work is in, if one wants to anonymously write malicious or otherwise legally complicated code.
Do not look into laser with remaining eye.
Can we use this to find Satoshi?
With coding standards to follow, and tools that uniform-ify your code, it should be easier to anonymize it than with regular prose. And regular prose is apparently trivial to anonymize: see "Practical Attacks Against Authorship Recognition Techniques" by Michael Brennan and Rachel Greenstadt.
This has always been obvious.
It's true for comments here, too. Only apk can craft a true apk comment. Others have tried, but they're never quite like the genuine thing.
But we should be careful with such analysis, too. In some cases it can be totally wrong.
There is a Slashdot-like site called Soylent News. There was once a guy over there who would claim that different posters were actually the same person, even when they weren't, and in some cases couldn't have been (one of the people he accused had died earlier).
How did he "know" they were the same people? He said he had a "complex" algorithm that used bzip2 and a comparison of the size of the compressed comment text. Of course, his allegations were correct about 0% of the time.
Why would they even bother with an algorithm to process your ramblings? Every time I see you post, I instantly think "oh here's this jerk again".
That kind of depends on the stylesheets, pre-compiler style enforcement routines, and the fact that a shit-ton of corporate code is often improved incrementally by multiple authors.
'course, there's still the comments that you could use, but who does that?
Quo usque tandem abutere, Nimbus, patientia nostra?
I can usually tell who wrote the code in the office by whether or not they put a space after their ifs: if(i == 0) vs if (i == 0); where they put their brackets, whether or not they replace their tabs with spaces, how they deal with bools: if (!var) vs if (var == false) and several other telling signs. There are so many combinations of variations no two programmers in the office (about 12 of us) have the same style.
The Moore-Murphy Law: The number of things that will go wrong will double every 2 years.
I suppose all those "// damn U bill gates!" comments gave me away
Table-ized A.I.
When I was a kid in the 80s we figured out we could identify who wrote a particular piece of software by looking at it's code. Those individualistic and identifiable features we used in the argument over programming being an art or a science when we wanted to support the art side.
1985 Hugo Winner
Really, the fact that coding style is recognizable was so well known it made it into pop culture 30 years ago.
Also, on the smaller sample size the program might just be recognizing the parts of the style that come from the corporate standards. It would be interesting to see if it could recognize code from people who all work at the same company.
So, what's it like to work for FaceBook?
Table-ized A.I.
But I can't recall an instance.
I guess we can expect that source code repositories will be scanned and processed. And, for code written by multiple authors, the modified code (from commits) will be scanned and indexed as well.
But, I bet they will never figure out who writes the malware recently attributed to the three letter agencies. They should, however, be able to figure out which agency writes the stuff if they get a copy of the source code or maybe even from decompiling the binary.
Additionally, if written from .NET, the CLR code can be reflected back to VB, C# or any other .NET language to retrieve the source code.
"The key to this system being used is, of course, first obtaining the code stylometries for a wide range of developers. The authors didn't address how, say, a database of programmers’ styles would be compiled. Also, to identify the author of a piece code would require access to the source code, and not just executables, though the authors mention there is some evidence that style is preserved in binaries."
-> so once you post to github and similar 'they' can link every code you ever write to you,....
so you are a good month late with the news
If your coding is terrible and very newbie like, they can't single you out since your code is similar to the ocean of other terrible coders.
So if you are a paranoid freak, the best way to ensure your safety and keep the government off your back is to write terrible code.
Priest: "Universe from nothing, no laws of physics, sped up time"+ huge discrepancies. Creationism? No. Big Bang Theory
Not that many of us actually use comments.... http://xkcd.com/1421/
Most programming isn't writing new code. Most programming is working on someone else's crap you inherited. Invariably, you're going to be using that person's style or else the result will look like garbage.
There is also the problem that most non-trivial code is worked on by multiple people at the same time.
Writing some code from scratch as an assignment is a very artificial exercise nowadays, unless you're in a classroom setting. Therefore, you're going to get a signature from a programmer doing atypical work.
95% of 250 coders. That means that out of a million programmers they will misidentify 200000.
I suspect that there are few enough variances in style to make any coders style unique. For example whether to uses braces on a one line statement after an in if in C.
With a few programmers it's likely to work, but when the possible source of programmers is the world...
Not to mention emacs, Visual Studio and such enforcing some indentation standards and programming languages enforcing others.
I hate following your rambling, Anonymous Coward. Sometimes you get extremely schizophrenic and contradict yourself!
"Never let your sense of morals prevent you from doing what is right" - Salvor Hardin
Write a version of pretty-printer that rerenders your code into a different style.
Have a lexicon of mipelled words for each "personality".
Another lexicon of variable names.
a vs inta vs int_a vs x.
Refactoring and unfactoring for subroutines.
Run the comments through google translate and back to english.
ukrainian
japanese
chinese
Synonym and antonym substitution in the comments.
The mind dances at the possibilities to mess with this algorithm.
She was like chocolate when she drank... semi-sweet at first and then increasingly bitter.
and then use F2C to convert it to C code before I check in.. Try analyzing that!
Ditto. They also could have researched if submissions in a given (same) GCJ identity have been (or had a high probability of being...) written by two or more different coders...
;-)
The submissions' speed of top ranked coders seen in early stages of the GCJ contest always amazed me (compared, of course, with my turtle sluggishness...)
I can just imagine how unreadable such code would end up being, as any comments would look like they were written by some sort of AI tool.
Of course you could anonymize source code using an obfuscator.
But maybe the simpler way is to compile Java to bytecode, then decompile it back to Java. I suspect that's as effective as most obfuscators.
Perhaps something like Artistic Style might help.
http://astyle.sourceforge.net/
Uh, Linux geek since 1999.
Someone just needs to write a tool that takes source code and translates it into an obfuscated form that only the CPU can understand. Is anyone working on this type of privacy tool?
Newfags can't triforce
Slashdot supports too few entities to do this right, and forget about UTF8. But you can get sorta close.
*
* *
Unless someone can do better?
Socialism: a lie told by totalitarians and believed by fools.
I had a Java instructor who informed the class that he talked to two students in private because their code was nearly identical except for one small detail: one used the x variable, the other used the y variable. The program was so simple that he couldn't flagged the students for cheating.
Wouldn't any programmer worth their salt identify themselves in the comments, or (if not) be logged as the last guy in that code on such-and-such a date, while working on such-and-such a patch number? (E,.g 'kittenman was here, 1/Jan/15, fixing Steve's crap').
But I hope my code is easily recognizable. I'm proud of it. It may not be the smartest, slickest, quickest there is, but it's mine. And it works.
"The greatest lesson in life is to know that even fools are right sometimes" - Winston Churchill
Generally speaking each project has a coding style that most code in the project adheres to, for the simple reason that it's easier to maintain when the code all looks more-or-less similar.
If one area uses lowercase with underscores, and the other area uses CamelCase, and one area typedefs the heck out of everything while the other is explicit, then for someone coming in and trying to understand the code it makes it harder than necessary to figure out what's going on.
So if you look at the linux kernel, or glibc, or firefox, or Chrome, or any other similarly large project, there will be some sort of coding style that applies. This is not to say that the style applies blindly. For example there are areas in the kernel where they basically imported a driver that is written in a different coding style. Since that driver is maintained out of the linux kernel tree and is largely self-contained, that was deemed to be acceptable. And even in that case, the driver used an internally-consistent coding style for all the files involved.
I just use 'git blame' to figure out who to yell at....
I doubt it. Therefore, this is about as reliable as graphology (handwriting analysis).
If you take two programmers who code to book standard, how do you tell the difference between them using the same strict problem?
Political debates have me rolling my eyes so much I think I got optical whiplash. I should sue. - Foamy The Squirrel
You can have/use this idea for free:
Before a system will build said code, have the build system verify the code not only by the public key/code hash, but as a secondary method - the code fingerprint of the author in question.
This turns a creepy idea into something worthwhile.
Lodragan Draoidh
The more you explain it, the more I don't understand it. - Mark Twain
That's one way to make your ForTran run slower
Ever since that corpus callosotomy, I try to remember to type in nice things with my left hand but then my right hand logs in and mods it down...
Case in point, I am a guitar player, and so was my college roommate. We didn't necessarily play together much, but we both heard each other play a lot, over the course of years.
I'd be able to place his playing anywhere.
For that matter, we used to have a game where we'd try to stump each other by playing clips of guitar players and guessing who they were. This was often improvisational jamming, very obscure recordings from established artists. We usually had to go through 3-4 rounds before someone would get one wrong.
This isn't really much different than handwriting, speech patterns, writing patterns...
I once marked CS homework and uncovered cheating for an 'individual' assignment.
A group of students had debug comments in their code - the giveaway? spelling mistakes.
"Hey, you notice some odd grammar, word choice, and spelling variance in this code?"
"Oh yeah, must be Maxo-Texas. That's his anonymization software."
When things get complex, multiply by the complex conjugate.
If you did this every time, you'd be identified as the guy who runs his code through Google Translate prior to release.
Non-normal behavior is the most easy to single-out. In order to avoid detection, you basically have to become noise. And if you're the only one, then even that is a pattern.
Sure, you could run some things through Google Translate and leave some things alone, but that'd be the equivalent of having two online personas.
"If a nation expects to be ignorant and free in a state of civilization, it expects what never was and never will be."
Absolutely- if you were the only one using the tool.
She was like chocolate when she drank... semi-sweet at first and then increasingly bitter.
seems to be a very prolific coder
Time to run this against the 7.2 version of Truecrypt.
aye!
If everyone used it then we'd all be spartacus.
What I was implying also in my parent post was using the tool the article is about to confirm your code had reached the ambiguous level.
She was like chocolate when she drank... semi-sweet at first and then increasingly bitter.
That's a good point. I also mentioned arbitrarily factoring and refactoring subroutines and I did not state clearly enough that i was suggesting using the tool mentioned in the article to confirm your code was giving a false result.
She was like chocolate when she drank... semi-sweet at first and then increasingly bitter.
Wrong, the machine code emitted by one of the industry heavyweight Fortran compilers will kick the ass out of C's
Duh. They are, like, just seeing this today? We knew this back in the seventies... and I am sure that earlier programmers knew it too.
Used to be able to tell which student's code I was looking at towards the end of a semester, in the 1980s. No need to look at who submitted it. From time to time I'd find one student's work turned in by someone else. That would result in an inquiry and usually an action against that student. Ye old dumpster dive.
Years later I would do code reviews. Hardly any time I could tell you who wrote it. Even if they had departed the company. Certain people do certain things predictably.