Researchers Use Machine-Learning Techniques To De-Anonymize Coders (wired.com)
At the DefCon hacking conference on Friday, Rachel Greenstadt, an associate professor of computer science at Drexel University, and Aylin Caliskan, Greenstadt's former PhD student and now an assistant professor at George Washington University, presented a number of studies they've conducted using machine learning techniques to de-anonymize the authors of code samples. "Their work could be useful in a plagiarism dispute, for instance, but it could also have privacy implications, especially for the thousands of developers who contribute open source code to the world," reports Wired. From the report: First, the algorithm they designed identifies all the features found in a selection of code samples. That's a lot of different characteristics. Think of every aspect that exists in natural language: There's the words you choose, which way you put them together, sentence length, and so on. Greenstadt and Caliskan then narrowed the features to only include the ones that actually distinguish developers from each other, trimming the list from hundreds of thousands to around 50 or so. The researchers don't rely on low-level features, like how code was formatted. Instead, they create "abstract syntax trees," which reflect code's underlying structure, rather than its arbitrary components. Their technique is akin to prioritizing someone's sentence structure, instead of whether they indent each line in a paragraph.
The method also requires examples of someone's work to teach an algorithm to know when it spots another one of their code samples. If a random GitHub account pops up and publishes a code fragment, Greenstadt and Caliskan wouldn't necessarily be able to identify the person behind it, because they only have one sample to work with. (They could possibly tell that it was a developer they hadn't seen before.) Greenstadt and Caliskan, however, don't need your life's work to attribute code to you. It only takes a few short samples.
The method also requires examples of someone's work to teach an algorithm to know when it spots another one of their code samples. If a random GitHub account pops up and publishes a code fragment, Greenstadt and Caliskan wouldn't necessarily be able to identify the person behind it, because they only have one sample to work with. (They could possibly tell that it was a developer they hadn't seen before.) Greenstadt and Caliskan, however, don't need your life's work to attribute code to you. It only takes a few short samples.
... could be an interesting use case.
... of the real coder, and the alleged coder ... ... then they have correlation.
They would still need to exclude *all* other coders (via enough samples of their work) ... And have enough luck that no two coders are alike ... to get anything resembling certainty in actual causation.
Which, given that we all learned from the same sources, and that even in research, often many researchers came up with the same idea independently, is not a scenario you they are allowed to ignore.
I'm sorry ... did I break their pseudoscience? [insert crying pile of poop emoji]
Not that I'd be surprised, but ... [insert crying pile of poop emoji]. ...
Ohhh ... I see!
Identifying criminal malware authors is the obvious application but that stands in contrast to the victimhood complex around here
10 print "Hello"
20 goto 10
We need new tool to parse code, create syntax tree, transform in ways to do same tasks but masks the ident of the authors, and re-emits, anonymized.
Code de-anon tools could be used by regimes such as Chinese to find who wrote anti-censorship tool. Very dangerous to prevent anonymous writing, anonymous code, anonymous anything.
Not to blame researcher: it will be done if it can be done. But now... to protect.
Of course I did not read the article!
In places I've worked development teams tend to develop similar coding styles depending on the problem they are trying to solve. So teams working on back end databases will all converge to similar coding styles, frequently influenced by use of common libraries. Does this technique account for that?
...there's code that just makes you wonder " how many authors, iterations and algorithms later?".
The latter is the future that'll take AI to sort out evolution
the police would show up wanting to know where the bodies were buried.
Hi! I make Firefox Plug-ins. Check 'em out @ https://addons.mozilla.org/en-US/firefox/addon/youtube-mp3-podcaster/
Some of my coding is so bad it would melt the eyeballs of anyone who knows proper coding.
Yet, I have seen code so bad that I can even read a screen page of it.
I am sure the bad coders are easy to identify since they all make different mistakes.
But for the best coders, there are only a limited number of ways to make the best solutions and I think it would be harder to tell them apart.
E.C.P.
I occationally contribute to open-source projects, but I do so under my full name anyway. Seeing that they are able to identify authors of compiled code too, it might be interesting to see if they can identify the authors of viruses & malware that has been making the rounds the last decade. Who to sue . . .
Another use case might be the javascript found on web pages. A noscript-like utility could ditch all javascript written by the wrong people - i.e. ad-related or spyware-related stuff. Loose it without the loss of functionality a full script blocking yields.
What you'd need to get past this, is to use a seq2seq tool to shift the distribution of your code features to another version. One could argue that style is very recognisable. But work with GANs shows that it is possible to isolate certain feature vectors and use them to tune the output. With this technique, you can get past this deanonimization technique, even if it is simply just cloaking yourself with someone else's style.
About half the time I code something, I end up grabbing a chunk of code that someone else has written which almost does what I want but not quite, copy/pasting it, and making a few tweaks to it so it'll do what I want.
That's kinda the whole reason software is different from crafting or manufacturing - zero cost of duplication. So there's no point doing duplicate work if someone else has already done it. In fact that's the fundamental rationale underlying open source.
I'd always add a comment regarding where it came from.
If I wrote it "The Usual Suspects" was listed when/if I had time to add comments.
So now you know, NSA/CIA/RIAA,,, /s
use it to show h1-b's copy-pasta code from GIthub?
It seems everything anybody does these days is inventing new ways to use the big data meat grinder.
Code reuse was always encouraged. Now, there's a backlash?
...for some former MS devs... IE6 and XP coders to be soon uncovered!
Slashdot, fix the reply notifications... You won't get away with it...
If I truly crave anonymity, I can use the identifier to effectively re-anonymize myself, by comparing the code I want to keep anonymous to published code that my name is publicly attached to. i.e. I can use the machine to tell me when my code does NOT resemble my published code to a de-anonymizing degree.
Bonus points for writing a babbler that substitutes certain concepts for functionally identical concepts as a means to defeat identification.
Yeah, you shouldn't need to worry then. From TFA:
Experienced developers appear easier to identify than novice ones. The more skilled you are, the more unique your work apparently becomes. That might be in part because beginner programmers often copy and paste code solutions from websites like Stack Overflow.
They're merely drones of the swarm lifeform headed by some "opinion makers".
All code will link back to some page on Stack Exchange - good luck with your profiles!
Once I've worked with a team for a while, I can generally recognise who coded something it from their style.
There are plenty of stylistic elements that distinguish the actual coder, even in shops with tight coding standards. Some favour for loops, some unrole their code, some cram lots of logic on one line, while others aggressively decompose. Some will write very abstract code, others tightly focused on the specific case. Some will use lots of getter setters, others will favour tell don't ask, some will use favour 'do { ... } while()', others will use while loops. Some very short snappy functions, some longer functions, some use programming domain naming, others favour business domain naming. Some favour arrays, others favour collections.
I've often be approached by collegue with comments, such 'this looks like your code' and they are usually right, so this is not some special skill I possess. It is absolutely realistic that an algorithm or AI could identify these elements with static analysis and metrics and a sufficient sample.
I can instantly tell which developer within the company wrote the code I'm reviewing just by looking at it.
Do any of the identified programmers look like gorillas?
Isn't this essentially the same thing that researchers do with historical texts to show/disprove authorship?
This sounds very similar to an idea mentioned by science fiction writer John Varley in his novella "Press Enter".
In 1985.
These researchers, do conveniently named deserve lead/steel pipes to the kneecap. If they don't get the message the first time, the second time they get Christmas presents - concrete boots and a complimentary underwater sightseeing tours. There is no end where immoral morons will stop.
This will get abused on so many levels
Modern textual and linguistic analysis is remarkably good at identifying the author with short excerpts. It turns out that a popular training set is the enormous number of emails and other communications from the Enron case.
If you are a "published author" (of books), then there's enough of a corpus to allow your subsequent works to be rapidly and accurately identified (well above 80% accuracy - 10% false pos, 10% false neg). This makes it practically impossible for an author to publish under a pseudonym without being discovered (e.g. J.K.Rowling writing as Robert Galbraith was discovered within minutes)