Slashdot Mirror


Researchers Use Machine-Learning Techniques To De-Anonymize Coders (wired.com)

At the DefCon hacking conference on Friday, Rachel Greenstadt, an associate professor of computer science at Drexel University, and Aylin Caliskan, Greenstadt's former PhD student and now an assistant professor at George Washington University, presented a number of studies they've conducted using machine learning techniques to de-anonymize the authors of code samples. "Their work could be useful in a plagiarism dispute, for instance, but it could also have privacy implications, especially for the thousands of developers who contribute open source code to the world," reports Wired. From the report: First, the algorithm they designed identifies all the features found in a selection of code samples. That's a lot of different characteristics. Think of every aspect that exists in natural language: There's the words you choose, which way you put them together, sentence length, and so on. Greenstadt and Caliskan then narrowed the features to only include the ones that actually distinguish developers from each other, trimming the list from hundreds of thousands to around 50 or so. The researchers don't rely on low-level features, like how code was formatted. Instead, they create "abstract syntax trees," which reflect code's underlying structure, rather than its arbitrary components. Their technique is akin to prioritizing someone's sentence structure, instead of whether they indent each line in a paragraph.

The method also requires examples of someone's work to teach an algorithm to know when it spots another one of their code samples. If a random GitHub account pops up and publishes a code fragment, Greenstadt and Caliskan wouldn't necessarily be able to identify the person behind it, because they only have one sample to work with. (They could possibly tell that it was a developer they hadn't seen before.) Greenstadt and Caliskan, however, don't need your life's work to attribute code to you. It only takes a few short samples.

66 comments

  1. Satoshi Nakamoto... by chrisvdb · · Score: 4, Interesting

    ... could be an interesting use case.

    1. Re:Satoshi Nakamoto... by Anonymous Coward · · Score: 0

      and also provides ample motivation for covert funding by certain government entities of this research.

    2. Re:Satoshi Nakamoto... by Anonymous Coward · · Score: 0

      Ha came here to write that

    3. Re:Satoshi Nakamoto... by hcs_$reboot · · Score: 1

      What do you do when there is actually a team of devs. SN could be a group of university students, which code is a mix of styles.

      --
      Slashdot, fix the reply notifications... You won't get away with it...
    4. Re: Satoshi Nakamoto... by WarJolt · · Score: 1

      You should be able to classify small code segments.

      I seriously doubt it will work well with pure functional programs. Functional programmers tend to converge on similar programs.

    5. Re: Satoshi Nakamoto... by Anonymous Coward · · Score: 0

      You kiddin, right? Functional programs have no function, without that extra "sauce".

      Anyways, this thing would look at tabs, whitespace, lineshifts, naming, casing, the works.

      Now, if you say "Go", then you have my ear. But you won't.

    6. Re:Satoshi Nakamoto... by Satoshi+Nakamoto · · Score: 1

      Oh shit.

    7. Re:Satoshi Nakamoto... by Anonymous Coward · · Score: 0

      This is dumb. Now you just take the model in your own AI to produce code that looks like someone else produced it or take code someone else produced and refactor it to look like you did, etc.

  2. Yeah, if they have enough samples ... by Anonymous Coward · · Score: 0

    ... of the real coder, and the alleged coder ... ... then they have correlation.

    They would still need to exclude *all* other coders (via enough samples of their work) ... And have enough luck that no two coders are alike ... to get anything resembling certainty in actual causation.

    Which, given that we all learned from the same sources, and that even in research, often many researchers came up with the same idea independently, is not a scenario you they are allowed to ignore.

    I'm sorry ... did I break their pseudoscience? [insert crying pile of poop emoji]

    1. Re: Yeah, if they have enough samples ... by Anonymous Coward · · Score: 0

      The presenters just want to engage in rough gay buttsex on stage.

    2. Re: Yeah, if they have enough samples ... by Anonymous Coward · · Score: 0

      In the next iteration the software will link AC comments on Slashdot to their meatspace identities. 8-)

  3. Please tell me that isn't an actual emoji! by Anonymous Coward · · Score: 0

    Not that I'd be surprised, but ... [insert crying pile of poop emoji]. ...

    Ohhh ... I see!

  4. Malware authors by Anonymous Coward · · Score: 0

    Identifying criminal malware authors is the obvious application but that stands in contrast to the victimhood complex around here

    1. Re:Malware authors by ShanghaiBill · · Score: 1

      Identifying criminal malware authors is the obvious application

      How often do you have the source code to malware?

    2. Re:Malware authors by Kiwikwi · · Score: 1

      From TFA:

      it’s possible to de-anonymize a programmer using only their compiled binary code.

    3. Re:Malware authors by Aighearach · · Score: 3, Funny

      is the obvious application

      I just want to know what my old Perl code does. Maybe this can help!

  5. All I've got to say by Anonymous Coward · · Score: 0

    10 print "Hello"
    20 goto 10

    1. Re:All I've got to say by hcs_$reboot · · Score: 4, Funny

      No need of a complex AI engine. 1) Using line numbers smells an old dev used to early Basic stuff, 2) Using 10 and 20 confirms 1), 3) "Hello" not even "Hello, world" confirms that the dev has no other experience and is probably a lousy programmer, 4) print / goto shows a total lack of imagination, and 5) posted anonymously, so we're looking at an old degenerated pretending-programmer not really proud of his code, posting anonymously in the hope of getting a desperate funny mod while being actually almost certain to leave an unappreciated lousy post. That was easy.

      --
      Slashdot, fix the reply notifications... You won't get away with it...
    2. Re:All I've got to say by JustAnotherOldGuy · · Score: 1

      10 print "Hello"
      20 goto 10

      Based on an in-depth analysis of this code and its many unique "signature" elements, I can state with 100% certainty that it was written by a programmer named "Anonymous Coward".

      --
      Just cruising through this digital world at 33 1/3 rpm...
    3. Re:All I've got to say by Anonymous Coward · · Score: 0

      Damn... you found me out!

    4. Re:All I've got to say by Anonymous Coward · · Score: 0

      So, who is the guy then? CowBoyNeal?

    5. Re:All I've got to say by Anonymous Coward · · Score: 0

      While your analysis is good, it still still matches a majority of the /. population, so it is rather worthless for de-anonymization.

  6. Arms race by Anonymous Coward · · Score: 1

    We need new tool to parse code, create syntax tree, transform in ways to do same tasks but masks the ident of the authors, and re-emits, anonymized.

    Code de-anon tools could be used by regimes such as Chinese to find who wrote anti-censorship tool. Very dangerous to prevent anonymous writing, anonymous code, anonymous anything.

    Not to blame researcher: it will be done if it can be done. But now... to protect.

    1. Re:Arms race by Anonymous Coward · · Score: 0

      We need new tool to parse code, create syntax tree, transform in ways to do same tasks but masks the ident of the authors, and re-emits, anonymized.

      Code de-anon tools could be used by regimes such as Chinese to find who wrote anti-censorship tool. Very dangerous to prevent anonymous writing, anonymous code, anonymous anything.

      Not to blame researcher: it will be done if it can be done. But now... to protect.

      You don't have to do all that.

      Just write everything in VB! [ducks]

    2. Re:Arms race by Anonymous Coward · · Score: 0

      If someone isn't trying to hide something, then why should anyone care?

    3. Re:Arms race by JustAnotherOldGuy · · Score: 3, Interesting

      We need new tool to parse code, create syntax tree, transform in ways to do same tasks but masks the ident of the authors, and re-emits, anonymized.

      Pffft, just copy someone else's code, problem solved. If anything happens it'll get blamed on them.

      --
      Just cruising through this digital world at 33 1/3 rpm...
    4. Re: Arms race by Anonymous Coward · · Score: 0

      Once crypto is made illegal (again).

  7. Does it identify programmers or projects by Anonymous Coward · · Score: 0

    Of course I did not read the article!

    In places I've worked development teams tend to develop similar coding styles depending on the problem they are trying to solve. So teams working on back end databases will all converge to similar coding styles, frequently influenced by use of common libraries. Does this technique account for that?

  8. Obvious, Not-so-Obvious and Not Obvious-Oblivious by ElitistWhiner · · Score: 1

    ...there's code that just makes you wonder " how many authors, iterations and algorithms later?".

    The latter is the future that'll take AI to sort out evolution

  9. I'm pretty sure if you applied this to my code by rsilvergun · · Score: 1

    the police would show up wanting to know where the bodies were buried.

    --
    Hi! I make Firefox Plug-ins. Check 'em out @ https://addons.mozilla.org/en-US/firefox/addon/youtube-mp3-podcaster/
    1. Re:I'm pretty sure if you applied this to my code by Anonymous Coward · · Score: 0

      So, where are they?

    2. Re: I'm pretty sure if you applied this to my code by Anonymous Coward · · Score: 0

      C feet under.

    3. Re: I'm pretty sure if you applied this to my code by Aighearach · · Score: 1

      Give or take ++

  10. Bad coding by Anonymous Coward · · Score: 0

    Some of my coding is so bad it would melt the eyeballs of anyone who knows proper coding.

    Yet, I have seen code so bad that I can even read a screen page of it.

    I am sure the bad coders are easy to identify since they all make different mistakes.

    But for the best coders, there are only a limited number of ways to make the best solutions and I think it would be harder to tell them apart.

    E.C.P.

  11. No worries for me by Anonymous Coward · · Score: 2, Interesting

    I occationally contribute to open-source projects, but I do so under my full name anyway. Seeing that they are able to identify authors of compiled code too, it might be interesting to see if they can identify the authors of viruses & malware that has been making the rounds the last decade. Who to sue . . .

    Another use case might be the javascript found on web pages. A noscript-like utility could ditch all javascript written by the wrong people - i.e. ad-related or spyware-related stuff. Loose it without the loss of functionality a full script blocking yields.

  12. How to get past this by Anonymous Coward · · Score: 0

    What you'd need to get past this, is to use a seq2seq tool to shift the distribution of your code features to another version. One could argue that style is very recognisable. But work with GANs shows that it is possible to isolate certain feature vectors and use them to tune the output. With this technique, you can get past this deanonimization technique, even if it is simply just cloaking yourself with someone else's style.

  13. Have these researchers actually written code? by Solandri · · Score: 4, Insightful

    About half the time I code something, I end up grabbing a chunk of code that someone else has written which almost does what I want but not quite, copy/pasting it, and making a few tweaks to it so it'll do what I want.

    That's kinda the whole reason software is different from crafting or manufacturing - zero cost of duplication. So there's no point doing duplicate work if someone else has already done it. In fact that's the fundamental rationale underlying open source.

    1. Re:Have these researchers actually written code? by jetkust · · Score: 4, Informative
      None of this contradicts what they are doing. They addressed copy and paste in the article.

      From Article:

      Greenstadt and Caliskan have also uncovered a number of interesting insights about the nature of programming. For example, they have found that experienced developers appear easier to identify than novice ones. The more skilled you are, the more unique your work apparently becomes. That might be in part because beginner programmers often copy and paste code solutions from websites like Stack Overflow.

    2. Re:Have these researchers actually written code? by NicknameUnavailable · · Score: 2

      I've been writing software for over 2 decades and I still routinely copy+paste key components straight off the first StackOverflow result and hit run without testing it. It's just faster and it works 99% of the time, the other 1% takes a few more minutes of tweaking it but it ends up looking largely the same. It's definitely not inexperience. In fact, if I'm picking up some bleeding-edge thing I'll tend to do that less because there aren't preexisting code samples.

    3. Re:Have these researchers actually written code? by Anonymous Coward · · Score: 1

      so there is also zero cost in replacing you, right?

      copy-paste programmers are not the best programmers. if i need to google and search how something is done, i have not yet fully understood the problem yet. once i do, coding is the easy part and takes less time often than searching and re-editing 'to make it work'.

    4. Re:Have these researchers actually written code? by Anonymous Coward · · Score: 0

      I tells you, this is why I copypasted SO for 35+ years.. It WAS genius!

    5. Re: Have these researchers actually written code? by Anonymous Coward · · Score: 0

      i wondered that, too. this is an academic project, i seriously doubt it works in real life. most shops have coding and naming standards, code formatters, code itself is quite rigid compared to natural language, there are millions of coders, etc.

    6. Re:Have these researchers actually written code? by jetkust · · Score: 1

      Yes. Everyone does it. Just depends on what you're working on. I rarely can copy and paste code directly like that, but if I could I would. The more copying and pasting you can actually do the less complicated that project likely is to program. I think it's not as much the skill level of the developer but the difficulty of what is being programmed. So I still see a lot of truth in the findings, as you can argue lower skilled (or less experienced) developers may be working on projects with less difficulty.

    7. Re:Have these researchers actually written code? by NicknameUnavailable · · Score: 1

      Getting the right result the first try is more about using the right keywords than the difficulty involved.

    8. Re:Have these researchers actually written code? by AHuxley · · Score: 1

      Its everything around the copy/paste parts that will stand out.
      Comments, style, format, something a university always suggested. . . .
      Even date, font, slang, US vs UK spelling. Useful comments, comments that are always off topic? Unique use of terms to invoke faith, spellings?
      Someone who worked on NSA, GCHQ, mil/contractor code and always has to keep their comments to a set bureaucratic style? Something that feels like billable hours?

      --
      Domestic spying is now "Benign Information Gathering"
  14. When I steal/borrow code by bobstreo · · Score: 1

    I'd always add a comment regarding where it came from.

    If I wrote it "The Usual Suspects" was listed when/if I had time to add comments.

    So now you know, NSA/CIA/RIAA,,, /s

  15. But can they.... by Anonymous Coward · · Score: 0

    use it to show h1-b's copy-pasta code from GIthub?

  16. Does anybody write code that is not for stalking? by Anonymous Coward · · Score: 0

    It seems everything anybody does these days is inventing new ways to use the big data meat grinder.

  17. Code Reuse by Anonymous Coward · · Score: 0

    Code reuse was always encouraged. Now, there's a backlash?

  18. Bad news... by hcs_$reboot · · Score: 1

    ...for some former MS devs... IE6 and XP coders to be soon uncovered!

    --
    Slashdot, fix the reply notifications... You won't get away with it...
  19. Self Defeating by Anonymous Coward · · Score: 0

    If I truly crave anonymity, I can use the identifier to effectively re-anonymize myself, by comparing the code I want to keep anonymous to published code that my name is publicly attached to. i.e. I can use the machine to tell me when my code does NOT resemble my published code to a de-anonymizing degree.

    Bonus points for writing a babbler that substitutes certain concepts for functionally identical concepts as a means to defeat identification.

  20. Have these commenters actually read the article? by Kiwikwi · · Score: 3, Informative

    Yeah, you shouldn't need to worry then. From TFA:

    Experienced developers appear easier to identify than novice ones. The more skilled you are, the more unique your work apparently becomes. That might be in part because beginner programmers often copy and paste code solutions from websites like Stack Overflow.

  21. Joke's on them. Most "people" have no identities. by Anonymous Coward · · Score: 0

    They're merely drones of the swarm lifeform headed by some "opinion makers".

  22. Stack Exchange? by TJHook3r · · Score: 1

    All code will link back to some page on Stack Exchange - good luck with your profiles!

  23. Recognising style by Martin+S. · · Score: 2

    Once I've worked with a team for a while, I can generally recognise who coded something it from their style.

    There are plenty of stylistic elements that distinguish the actual coder, even in shops with tight coding standards. Some favour for loops, some unrole their code, some cram lots of logic on one line, while others aggressively decompose. Some will write very abstract code, others tightly focused on the specific case. Some will use lots of getter setters, others will favour tell don't ask, some will use favour 'do { ... } while()', others will use while loops. Some very short snappy functions, some longer functions, some use programming domain naming, others favour business domain naming. Some favour arrays, others favour collections.

    I've often be approached by collegue with comments, such 'this looks like your code' and they are usually right, so this is not some special skill I possess. It is absolutely realistic that an algorithm or AI could identify these elements with static analysis and metrics and a sufficient sample.

    1. Re:Recognising style by Anonymous Coward · · Score: 1

      People can recognize my perl code, however nobody has been able to understand it. Including myself.

    2. Re:Recognising style by rgmoore · · Score: 1

      The thing I always wonder about this kind of thing is how well it scales. Your coworkers can tell a code snippet is from you because there are only a relative handful of people contributing to your project. But if you trained a program on just your group and then asked it to find your work on GitHub, it would probably find a whole lot of false positives-- other programmers whose style is similar enough to yours that it's fooled.

      This actually shows up in the article. The researchers claim to be 96% accurate when looking at a group of 100 programmers, but only 83% accurate when looking at a group of 600. If they were looking at tens of thousands, they probably woudn't be able to do better than to guess that it was one member from a small group. That can still be useful- being able to reduce the search space is always good- but they're far from being able to pick one person out of millions.

      --

      There's no point in questioning authority if you aren't going to listen to the answers.

  24. I don't AI to do this. by devslash0 · · Score: 1

    I can instantly tell which developer within the company wrote the code I'm reviewing just by looking at it.

  25. gorillas? by Anonymous Coward · · Score: 0

    Do any of the identified programmers look like gorillas?

  26. Isn't this already a thing? by Anonymous Coward · · Score: 0

    Isn't this essentially the same thing that researchers do with historical texts to show/disprove authorship?

  27. Identifying code authors by Anonymous Coward · · Score: 0

    This sounds very similar to an idea mentioned by science fiction writer John Varley in his novella "Press Enter".

    In 1985.

  28. A pipe to the kneecap by Anonymous Coward · · Score: 0

    These researchers, do conveniently named deserve lead/steel pipes to the kneecap. If they don't get the message the first time, the second time they get Christmas presents - concrete boots and a complimentary underwater sightseeing tours. There is no end where immoral morons will stop.

  29. Nope by Anonymous Coward · · Score: 0

    This will get abused on so many levels

  30. No surprise, really by Anonymous Coward · · Score: 0

    Modern textual and linguistic analysis is remarkably good at identifying the author with short excerpts. It turns out that a popular training set is the enormous number of emails and other communications from the Enron case.

    If you are a "published author" (of books), then there's enough of a corpus to allow your subsequent works to be rapidly and accurately identified (well above 80% accuracy - 10% false pos, 10% false neg). This makes it practically impossible for an author to publish under a pseudonym without being discovered (e.g. J.K.Rowling writing as Robert Galbraith was discovered within minutes)