Slashdot Mirror


Anonymous No More: Your Coding Style Can Give You Away

itwbennett writes Researchers from Drexel University, the University of Maryland, the University of Goettingen, and Princeton have developed a "code stylometry" that uses natural language processing and machine learning to determine the authors of source code based on coding style. To test how well their code stylometry works, the researchers gathered publicly available data from Google's Code Jam, an annual programming competition that attracts a wide range of programmers, from students to professionals to hobbyists. Looking at data from 250 coders over multiple years, averaging 630 lines of code per author their code stylometry achieved 95% accuracy in identifying the author of anonymous code (PDF). Using a dataset with fewer programmers (30) but more lines of code per person (1,900), the identification accuracy rate reached 97%.

138 of 220 comments (clear)

  1. Can they do it with corporate code? by msobkow · · Score: 5, Interesting

    Can they do it with corporate code where there are naming and style standards in abundance, and code reviews to ensure those guidelines are followed?

    --
    I do not fail; I succeed at finding out what does not work.
    1. Re:Can they do it with corporate code? by Marginal+Coward · · Score: 4, Funny

      It seems like using the applicable features of the corporate version control system would be a lot easier - and possibly even better than 95% accurate.

    2. Re:Can they do it with corporate code? by Anonymous Coward · · Score: 1

      If your corporate code base has commits from anonymous developers, you're doing something wrong. If it doesn't, and you need this sort of analysis to determine who wrote a section of code, you're doing something wrong.

    3. Re:Can they do it with corporate code? by Penguinisto · · Score: 2

      That's what "git blame" is for...

      /me ducks and runs like hell...

      --
      Quo usque tandem abutere, Nimbus, patientia nostra?
    4. Re:Can they do it with corporate code? by TitusC3v5 · · Score: 1

      It's not just limited by corporate code. Good luck doing this on pep8 Python.

      --
      And the masses cried out, "09 F9 11 02 9D 74 E3 5B D8 41 56 C5 63 56 88 C0!"
    5. Re:Can they do it with corporate code? by dark.nebulae · · Score: 2

      I've always found that even with style guidelines in place, developers will still leave their fingerprints all over it.

      Some devs will be verbose in their comments, some less. Some devs will embrace IoC where others shun it. Some devs will create a single method with all code in it, some will refactor the heck out of it with many methods. Heck, devs can't even agree sometimes on what should be public, protected, and private (and rarely will style guidelines dictate this kind of thing).

    6. Re:Can they do it with corporate code? by MouseTheLuckyDog · · Score: 1

      They are talking about the corporate code as a baseline to compare to the anonymous code.

    7. Re:Can they do it with corporate code? by grimmjeeper · · Score: 4, Informative

      You obviously haven't had to work in an environment where code has to be certified. I can tell you from first hand experience that coding in an RTCA DO-178B environment or similar has some pretty strict adherence to some very pedantic and strict coding requirements. You'll find this type of development in avionics systems (both civilian and military) as well as other industries like medical electronics where code safety is literally life-and-death.

      Outside of that type of environment, I do agree with you. You'd be lucky if even half of the developers have seen a company coding standard. You'd be hard pressed to find any developers who really adhere to it even when they know the document exists. But in those small niche markets, you'd be surprised at how strictly they adhere to arbitrary coding standards (whether they really impact code quality or safety or not).

    8. Re:Can they do it with corporate code? by jellomizer · · Score: 4, Interesting

      Perhaps not as well. If people are following the coding standards for the organization then the code for the most part looks far more similar.

      When I am working with a development team, I will tend to adjust my unique style to better match what everyone else is doing. Even if it means doing coding methods that I will normally disagree with.

      If the code tends to use a bunch of Goto's instead of Procedures or classes. I will use those GOTO not for my benefit, but for people who will maintain my code later on, so they won't have to change their mindset and debugging strategies to see what the program is doing to do future corrections.

      I will go full Object Oriented if the group of people that I am working with do their coding full OO.

      My personal style would be more procedural, than OO. Not due to lack of knowledge or not realizing OO advantages and disadvantages. But if I am to code on my own, I code in the way that My Mind handles the requirements, and how I feel would be easier for me to change and fix my code in the future.

      I think this method is best for ID based on personal code, vs group corporate code, where a lot of your particular style is hidden.

      --
      If something is so important that you feel the need to post it on the internet... It probably isn't that important.
    9. Re:Can they do it with corporate code? by ShanghaiBill · · Score: 1

      If it doesn't, and you need this sort of analysis to determine who wrote a section of code, you're doing something wrong.

      With pair programming, you may have two programmers sharing a keyboard, and alternating writing chunks of code.

      I can usually look at a section of code, and reliably know which of my coworkers wrote it, even when they follow the style guidelines. Do they use an if-else chain, or a switch statement? Do they use #define's or prefer enums? Bitfields, or masks? Often I can tell who wrote it just by looking at the comments. Some people are neurotic about grammar and using complete sentences. Others prefer minimally concise fragments.

    10. Re:Can they do it with corporate code? by AK+Marc · · Score: 1

      Even if they build up a database of 100% of written code, how can they identify me if I only copy and paste code from others?

    11. Re:Can they do it with corporate code? by rtb61 · · Score: 1

      Just curious, how are larger companies going with algorithm libraries and variable naming rules to ensure maximum re usability of code (variables named by function rather than named by application). Any change, is most of it done from scratch, any fancy algorithm data bases with search functions based upon algorithm descriptors and software engineering. Also things like software language translators or the same algorithms stored in different languages. Any shift away from writing code to more assembling algorithms that can expanded or reduced and snapped together.

      --
      Chaos - everything, everywhere, everywhen
    12. Re:Can they do it with corporate code? by grimmjeeper · · Score: 1

      RC doesn't pay me at all. I haven't worked there for over 15 years now.

    13. Re:Can they do it with corporate code? by wolrahnaes · · Score: 1

      Similarly I was thinking this would probably be defeated by a "minifier", obfuscator, or anything along those lines. There are dozens to choose from for most languages and it would be trivial for anyone attempting to remain anonymous to use them on their releases.

      If you want the code to remain usable, there are tools to enforce a standard style instead, in which case just set it up with rules based on a popular project if your language of choice doesn't have a specific style. At that point you're down to comments and variable names. Don't get fancy with either and I'd bet the identifiability would go down significantly.

      --
      I used to get high on life, but I developed a tolerance. Now I need something stronger.
    14. Re:Can they do it with corporate code? by war4peace · · Score: 3, Insightful

      *raising hands slowly* Is there a problem, Coding Officer?

      --
      ...gis sdrawkcab (usually not responding to ACs; don't bother posting as AC)
    15. Re:Can they do it with corporate code? by rubycodez · · Score: 1

      "legal" of course meaning adhering to rules written and ratified by a group of power and money grubbing politicians in the pockets of large corporations.

    16. Re: Can they do it with corporate code? by Anonymous Coward · · Score: 2, Funny

      Drats! I was.sure that.everyone else wrote.stuff.like "if(user == 'dumbfuck"){exit 666};

    17. Re:Can they do it with corporate code? by bhcompy · · Score: 2

      Why is it illegal?

    18. Re:Can they do it with corporate code? by Gorobei · · Score: 1

      Can they do it with corporate code where there are naming and style standards in abundance, and code reviews to ensure those guidelines are followed?

      I was starting to wonder about that, then realized we at $BIGCORP are already generating ASTs from your input buffer, unifying those trees with a bunch of patterns, and telling your editor to flag questionable constructs. You type "if not foo in x" and 50ms later you get a proposed improved snippet. It's pretty rare to see quirky style in our codebase.

    19. Re:Can they do it with corporate code? by aliquis · · Score: 1

      Or what about in the real world than the numbers of "coders" are 1 000 times more?

      It's likely 1 out of how many?

      Also if everyone just replace all function and variable names with a, b, c, d .. after how soon they occur and put it all on one line?

    20. Re:Can they do it with corporate code? by Mr+Z · · Score: 2

      Did you read the part in the article where they're actually doing the matching based on the ASTs (abstract syntax trees), and so are able to identify authors even after the code goes through an obfuscator? Relevant quotes:

      Their real innovation, though, was in developing what they call “abstract syntax trees” which are similar to parse tree for sentences, and are derived from language-specific syntax and keywords. These trees capture a syntactic feature set which, the authors wrote, “was created to capture properties of coding style that are completely independent from writing style.” The upshot is that even if variable names, comments or spacing are changed, say in an effort to obfuscate, but the functionality is unaltered, the syntactic feature set won’t change.

      Accuracy rates weren’t statistically different when using an off-the-shelf C++ code obfuscators. Since these tools generally work by refactoring names and removing spaces and comments, the syntactic feature set wasn’t changed so author identification at similar rates was still possible.

      Regarding the first quote: The author of the article probably didn't realize that ASTs aren't a new thing; it's just this application of ASTs that's new. ASTs are as old as the hills. I learned about them from the Dragon Book, and by the time that was written they were old hat.

    21. Re:Can they do it with corporate code? by s.petry · · Score: 1

      It's not just these type of environments that are strict. Well established companies have the same practices, because the only way to have controlled growth is to adhere to a set of standards. Sure, standards change over time but not quickly. For posterity, controlled does not imply restricted.

      --

      -The wise argue that there are few absolutes, the fool argues that there are no probabilities.

    22. Re:Can they do it with corporate code? by Dashiva+Dan · · Score: 1

      I can tell who wrote it just by looking at the comments

      Yeah, my first thought on this was "how accurate would it be if you a) stripped out comments, and b) ran through a code formatter (many code editors auto-formatting to a standard on the fly)"

      I think including comments is basically cheating, as they're super distinguishable. You can tell what code I've worked on cause I consistently type "teh", spell words like "colour" with my local spelling, etc. But recognising just the actual code itself, that's more impressive.

      --
      "lt;dr" is the correct response to most of my posts.
    23. Re:Can they do it with corporate code? by hcs_$reboot · · Score: 1

      Indeed. During a Google Code Jam contest, one has to be fast and the prog has to be fast also! During the contest, a lot of devs 1. don't use the language they would normally use for other programs 2. use tons of Defines to accelerate typing 3. don't care at all about readability, maintenance, code-style and the like. That makes the whole program unique in a way, a kind of signature, but hard to read. That identification algo would have a much harder time to identify devs based on corporate programs.

      --
      Slashdot, fix the reply notifications... You won't get away with it...
    24. Re: Can they do it with corporate code? by Anonymous Coward · · Score: 1

      I've already narrowed you down to a web developer

    25. Re:Can they do it with corporate code? by RabidReindeer · · Score: 1

      I could do it with corporate code without any analytical software at all.

      One guy I know consistently introduced bugs because he didn't understand assembly language (ironically, he was an assembly language bigot).

      Another caused people to complain because he never coded a subroutine where he could simply cut-and-paste code. And that was in a shop with all sorts of standards.

      Then there are the comments (or lack of them) and their distinctive, but not always professional observations.

      So definitely.

    26. Re:Can they do it with corporate code? by RabidReindeer · · Score: 2

      A sonnet has strict rules, too.

      But I'd wager that someone could tell one of Shakespeare's from one of yours.

    27. Re:Can they do it with corporate code? by Kiwikwi · · Score: 1

      Can they do it with corporate code where there are naming and style standards in abundance, and code reviews to ensure those guidelines are followed?

      Presumably, yes. Style guides are 95% formatting, and if one RTFA (I know, I know), they look only at the structure of the parsed AST, not variable names, comments and whitespace. From the article:

      Accuracy rates weren’t statistically different when using an off-the-shelf C++ code obfuscators. Since these tools generally work by refactoring names and removing spaces and comments, the syntactic feature set wasn’t changed so author identification at similar rates was still possible.

      Since they look at code structure, they've even found identifying patterns that survive compilation and end up in the binary.

      This is one of the coolest data mining results I've seen in quite a while.

    28. Re:Can they do it with corporate code? by jellomizer · · Score: 1

      Most companies don't.
      If it gets to a point where your program is changing its programming language for its code, chances are the entire workflow process will be evaluated, and will be coded from the start up. If there isn't a change in workflow, then there isn't a good need to change how the program is written, and they will just code the legacy system, in the style of the time.

      However your Old COBOL or Fortran system is being migrated to a newer platform, the new workflow means a lot of these cool tricks back then may be so simplified down to a built in language class, so that module that took weeks to perfect may be just as easy as x.dothis()

      --
      If something is so important that you feel the need to post it on the internet... It probably isn't that important.
    29. Re: Can they do it with corporate code? by HornWumpus · · Score: 1

      My code tell would be comments threatening to break all of other coder's fingers. In extreme cases toes also, so the bastard can't code with his feet.

      --
      John McAfee 'It was like that time I hired that Bangkok prostitute; to do my taxes, while I fucked my accountant'
  2. Demonstrates the need... by JonSchell · · Score: 1

    This is why people need to follow style guides, so that all source code is styled the same.

    1. Re:Demonstrates the need... by Anonymous Coward · · Score: 5, Insightful

      This is why people need to follow style guides, so that all source code is styled the same.

      There's a damn good chance 95% of coders are not criminals, nor would they care if someone identified their code.

      That said, this will become a legal nightmare is when this kind of profiling can be used to frame another coder.

      And with the laws wanting to treat any "hacker" as a potential terrorist these days, the consequences of even being accused can be rather severe to deal with.

    2. Re:Demonstrates the need... by Impy+the+Impiuos+Imp · · Score: 5, Insightful

      You want scary? The same can be applied to general text on the Internet, tying posters on different sotes together, including anonymous (not your real name avatar) to a site with your real name.

      Which the NSA probably has churning away on its databases. Which probably does little more than add confirmation of said links from watching and recording all traffic to any and all of a billion IP addresses.

      And I, for one, welcome our new panopticon overlords who won't abuse it, not one of their thousand agents, because they're supposed to check a got-a-warrant box on a piece of paper before choosing to abuse it.

      --
      (-1: Post disagrees with my already-settled worldview) is not a valid mod option.
    3. Re:Demonstrates the need... by Anonymous Coward · · Score: 1

      The trouble with unbridled capitalism is that the government always ends up working on behalf of the powerful to preserve their status. De-embiggening government just means that less democratic power structures take its place. In the worst case, you have the state owning everything on behalf of its puppet-masters, like a single giant business ("state capitalism").

      As in all mature things, the solution is balance - an educated citizenry which trusts people to get on with with their own thing except when they end up with too much power. But who in power wants an educated citizenry? That part must be preserved from the bottom.

    4. Re:Demonstrates the need... by grimmjeeper · · Score: 1

      This is why people need to follow style guides, so that all source code is styled the same.

      Why does all code need to be styled the same?

      I can see a need in a safety critical environment like avionics or medical devices that needs strict adherence to rules to ensure that the code has been written correctly and with as few bugs as possible. But what difference does it make outside of that kind of environment? I mean, so what if there's a thousand different coding standards in the Chrome source? What difference does it really make?

    5. Re:Demonstrates the need... by EvilIdler · · Score: 1

      I wonder how this works for Go, where style is stricter, and people tend to use a formatting tool. Only the comments and naming schemes left to identify by, I guess.

    6. Re:Demonstrates the need... by harperska · · Score: 1

      Even when following a coding style guide 100%, there is still generally enough leeway to allow for plenty of personal style. There's the words you use to name things, use of whitespace and grouping of statements, basically everything about a piece of source code that's lost if you compile and then decompile a program. Just like the prose from two different authors are distinct from one other, even if they go through the same copy editor to fit a publisher's style guide. And if your corporate style guide requires your code to be indistinguishable from decompiled code, you need to find a new job.

  3. Useless by Anonymous Coward · · Score: 1

    Who releases source code without their name?
    Let me know when you can determine the author from just the binary...

  4. Up next, automatic intelligence rating... by TWX · · Score: 4, Funny

    ...based on the quality of that code...

    --
    Do not look into laser with remaining eye.
    1. Re:Up next, automatic intelligence rating... by halivar · · Score: 4, Funny

      goto blah;
      ^^ Idiot.

      // If you don't know why this is here, don't fuck with it.
      goto blah;

      ^^ Code guru.

    2. Re:Up next, automatic intelligence rating... by Tablizer · · Score: 1

      But readable code is often preferred over clever code by team members.

    3. Re:Up next, automatic intelligence rating... by lgw · · Score: 4, Insightful

      For lack of mod points let me just say: beautiful!

      It's like this in any engineering discipline:
      * The apprentice doesn't do things by the book, for he thinks himself clever
      * The journeyman does everything by the book, for he has learned the world of pain the book prevents
      * The master goes beyond the book, for he understand why every rule is there and no longer needs the rules

      Or put another way - the apprentice thinks he knows everything, the journeyman known how little he knows, the master knows everything in the field, and still knows how little he knows.

      --
      Socialism: a lie told by totalitarians and believed by fools.
    4. Re:Up next, automatic intelligence rating... by Anonymous Coward · · Score: 1

      And the guru (guru > master) knows he knows nothing in his field, but still knows more than the corporate enterprise architect

    5. Re:Up next, automatic intelligence rating... by halivar · · Score: 1

      It's like jazz. You have to know know rules before you can break them.

    6. Re:Up next, automatic intelligence rating... by halivar · · Score: 1

      And, I accidentally repeated repeated a word.

    7. Re:Up next, automatic intelligence rating... by c · · Score: 1


      try { ...
            throw BlahException("blah");
      } catch(Exception& blah) { ...
      }
      ^^ Idiot.

      --
      Log in or piss off.
    8. Re:Up next, automatic intelligence rating... by ihtoit · · Score: 1

      if I were the programmer (I'm not, not since primary school when I programmed the TURTLE to draw stuff on large sheets of cartridge paper) I'd be dropping //remarks in everywhere. Back to when I did TURTLE programming, I got berated for wasting time on comments but when it came down to 1000+ lines of code, it was nice to know which draw routines drew what part of the image. My TURTLE St. Paul's Cathedral was 7,700+ lines of code, probably 3/4 of that was comments. If it were stripped of comments it'd probably have ended up way less than 2,000 lines but nobody (not even me) would've had the first clue about what drew what.

      --
      Political debates have me rolling my eyes so much I think I got optical whiplash. I should sue. - Foamy The Squirrel
    9. Re:Up next, automatic intelligence rating... by ranton · · Score: 1

      This doesn't seem so far fetched. I'm not sure the field of natural language processing is that far away from being able to create metrics which would determine the skill of developer by looking at their code. It could then be used by employers during the hiring process and during reviews.

      While that may sound like a nightmare scenario (and it very well could be), a more intelligent software system may even be able to show why it thinks the code is bad, and give an interviewer or reviewer the chance to ask why something was done. Taking 10,000 lines of code and narrowing it down to 100 lines that could help make the determination between good employee or bad employee could be useful.

      The big trick is how to train the system, since you would have to identify good and bad coders for supervised training. I doubt unsupervised training could do anything more than cluster like minded developers together. Although even that is useful, since you could identify a dozen good programmers manually and then have the system identify hundreds more by finding similar coding styles.

      --
      -- All that is necessary for the triumph of evil is that good men do nothing. -- Edmund Burke
    10. Re:Up next, automatic intelligence rating... by gstoddart · · Score: 1

      // exception was found
      // beyond here be dragons, run
      // make your escape now
      goto blah;

      ^^ code master

      --
      Lost at C:>. Found at C.
    11. Re:Up next, automatic intelligence rating... by russotto · · Score: 2

      The guru knows the novice knows more than the corporate enterprise architect, but won't let on lest the novice get a more-swelled head.

    12. Re:Up next, automatic intelligence rating... by danknight48 · · Score: 1

      goto blah;
      ^^ Idiot.

      // If you don't know why this is here, don't fuck with it.
      goto blah;

      ^^ Code guru.

      Yep i hate them aswell, only ever had to use them once in coding. But there is a very rare case that goto is actually needed. Nested loops.
      http://pastebin.com/FBQMDBme

    13. Re:Up next, automatic intelligence rating... by cellocgw · · Score: 1

      And you still got the song wrong. It goes "to know, know, know you, is to love, love, love you..." See? Ya gotta repeat twice (yeah I'm a grammar pedant: say it 3 times is repeating twice :-) ) .

      --
      https://app.box.com/WitthoftResume Code: https://github.com/cellocgw
    14. Re:Up next, automatic intelligence rating... by atownsley · · Score: 1

      From someone who got his MSCS at Drexel, this is used to avoid someone copying code from the Internet and submitting it as their own. Sure people have tried to rename variables, and method / function names...but from what I have heard, they have been caught. The problem is the student didn't understand what they were doing, so structurally the code was the same, they just tried to change things. I don't know all he details of how it works, but the comparison is done a multiple levels (i.e. source, obj, and exe), and due to compiler optimizations patterns do emerge....

      None of the professors (at least in my experience) have a problem with students referencing outside source, figuring out what it does, and then writing THEIR OWN CODE to solve the problem.

      To lgw's post...
      * If you are a stupid apprentice trying to pass someone else's code as your own, it will catch you.
      * If you are a journeyman reading the book, trying to understand the concepts and use an internet code example for reference and then write your own code based your readings and the code example for guidance, you are probably going to be fine
      * If you are a master, why do you need examples...you are a master...go code the damn thing yourself...

    15. Re:Up next, automatic intelligence rating... by Altus · · Score: 1

      Yeah but it's really about all the words you don't repeat.

      --

      "In America, first you get the sugar, then you get the power, then you get the women..." -H. Simpson

    16. Re:Up next, automatic intelligence rating... by ebvwfbw · · Score: 1

      You can' tune a tunafish, eh?

  5. Let's analyze the cyberspying code. by SeaFox · · Score: 1

    Using this technique, can they tell us if the NSA did write the Regin Malware now?

    1. Re:Let's analyze the cyberspying code. by blackomegax · · Score: 1

      I want to see it run Regin against sections of code in gnu/linux/systemd and see if the same NSA shills wrote any of it.

  6. Re:Lol? by TWX · · Score: 2

    Heh. If it's effective in a clusterfuck of copy/paste, then it should be really effective when the bulk of the code is original...

    Sounds like the solution is to use an entirely different language than the bulk of one's work is in, if one wants to anonymously write malicious or otherwise legally complicated code.

    --
    Do not look into laser with remaining eye.
  7. What about Bitcoin? by Anonymous Coward · · Score: 5, Funny

    Can we use this to find Satoshi?

  8. Shouldn't be hard to foil by SlideRuleGuy · · Score: 1

    With coding standards to follow, and tools that uniform-ify your code, it should be easier to anonymize it than with regular prose. And regular prose is apparently trivial to anonymize: see "Practical Attacks Against Authorship Recognition Techniques" by Michael Brennan and Rachel Greenstadt.

  9. This is true for /. comments, too. by Anonymous Coward · · Score: 1

    This has always been obvious.

    It's true for comments here, too. Only apk can craft a true apk comment. Others have tried, but they're never quite like the genuine thing.

    But we should be careful with such analysis, too. In some cases it can be totally wrong.

    There is a Slashdot-like site called Soylent News. There was once a guy over there who would claim that different posters were actually the same person, even when they weren't, and in some cases couldn't have been (one of the people he accused had died earlier).

    How did he "know" they were the same people? He said he had a "complex" algorithm that used bzip2 and a comparison of the size of the compressed comment text. Of course, his allegations were correct about 0% of the time.

    1. Re:This is true for /. comments, too. by ihtoit · · Score: 1

      there's a wiki site (can't remember the name) that takes great joy in posting accusations without attribution or evidence, and when called on them the Admins sit there and claim that the person who posted the slander is now the same person trying to get a retraction based on some sort of magic ring with a seekrit style decoder. Even when called out to post the evidence they claim to hold, they just dive straight in to claiming knowledge they can't possibly have for various reasons not least of which said claimed evidence not existing outside their imaginations.

      --
      Political debates have me rolling my eyes so much I think I got optical whiplash. I should sue. - Foamy The Squirrel
  10. Re:Next thing you know by Anonymous Coward · · Score: 2, Funny

    Why would they even bother with an algorithm to process your ramblings? Every time I see you post, I instantly think "oh here's this jerk again".

  11. Re:Lol? by Penguinisto · · Score: 1

    That kind of depends on the stylesheets, pre-compiler style enforcement routines, and the fact that a shit-ton of corporate code is often improved incrementally by multiple authors.

    'course, there's still the comments that you could use, but who does that?

    --
    Quo usque tandem abutere, Nimbus, patientia nostra?
  12. No Kidding by invid · · Score: 4, Insightful

    I can usually tell who wrote the code in the office by whether or not they put a space after their ifs: if(i == 0) vs if (i == 0); where they put their brackets, whether or not they replace their tabs with spaces, how they deal with bools: if (!var) vs if (var == false) and several other telling signs. There are so many combinations of variations no two programmers in the office (about 12 of us) have the same style.

    --
    The Moore-Murphy Law: The number of things that will go wrong will double every 2 years.
    1. Re:No Kidding by leonardluen · · Score: 1

      i could do the same. not only that but i could often also tell who had originally trained that person because often part of the trainers style often leaked into their style.

      i work at a university and we hire 100 level CS students. so we generally assumed they knew nothing and trained them from scratch.

    2. Re:No Kidding by invid · · Score: 1

      Actually they have recently introduced style cop, which enforces some things, but it ignores a number of discernible quirks.

      --
      The Moore-Murphy Law: The number of things that will go wrong will double every 2 years.
    3. Re:No Kidding by ThatsDrDangerToYou · · Score: 1
      Yeah, about that... I start twitching whenever my boss types: MyFunction (arg1, arg2) and so on. Who puts a space after the function name before the '('? People who must die, of course.

      OK, calming down now.. 1.. 2.. 3.. 4.. 5..

      No, I'm OK, really.

      I had an old boss who was a code style nazi. He was an asshole. And actually, my current boss is very cool, even if he codes like that.

    4. Re:No Kidding by Marginal+Coward · · Score: 1

      I once worked on a project that had a handful of developers, where each developer was in charge of one code for one of the software subsystems of the project. We didn't have much of a coding standard there - only about one page - but we ended up with a consensus coding style in the project that everybody could live with. Even so, you could always tell who wrote what by the personality shown around the edges of the coding style of a given module, function, or even over just a few lines.

    5. Re: No Kidding by Anonymous Coward · · Score: 1

      You should use an automatic JavaScript style tool, then. One of my favorites has a funny name - it's called "Obfuscator."

    6. Re:No Kidding by AK+Marc · · Score: 1

      If the whitespace is meaningless, it should be eliminated (carriage returns excepted). However, I can understand people who add in meaningless whitespace, as some times a + b is easier to read than a+b, even if they are interpreted the same.

    7. Re:No Kidding by OSULugan · · Score: 1

      Should use if (false == var) to avoid incidental issues of assignment. A good pre-processor will catch the inadvertent assignment and flag it for repair, but it is a good practice to be in for code that you don't run through a pre-processor.

    8. Re:No Kidding by CannonballHead · · Score: 1

      So, you don't indent code? Or if you do, at what point is the indent meaningless (how many spaces/tabs) ... ? No spaces after semicolons? Or before/after braces? Or ...

      Readability should count as meaningful. It helps. And the compiler strips it out anyways, right, so ultimately it doesn't matter, just like comments, except in helping understand the code.

      I may be misunderstanding something completely in what you said... but I don't get why you would say it should be removed. Maybe in javascript for network performance reasons or something, but you should just minify or something in that case, because of variable and function name length and all that...

    9. Re:No Kidding by R3d+M3rcury · · Score: 1

      Actually, the one I hate is:

      if ($variable == false) {
            doSomethingInteresting($variable);
      }

      and one of my co-workers does:

      if ($variable == false)
            {
            doSomethingInteresting($variable);
            }

      Of course, my code is beautiful and everyone else's is terse and ugly and everyone should write code the same way that I do. Try suggesting that to a group of programmers and see how far it gets you. Generally, it's not worth the argument--you will waste tons of everyone's time trying to come up with an agreement.

      As the thread suggests, one advantage to different coding styles is that you can generally tell who wrote what and, if there seems to be a bug, you can track them down and tell them to fix it in that ugly mess. In our office, we have the rule that if you go around changing code style, you now own that code and are responsible for it. About the only issue we've run into is that people's styles evolve over time. So the guy right out of school may have a certain style that changes as he is exposed to more styles.

      My favorite story was when someone was trying to push variable naming standards. If it was a C string, the variable name should begin with "sz" (for string, zero terminated). I suggested that instead of doing that, maybe we should just put a dollar-sign at the end. Laughter ensued and that ended that.

    10. Re:No Kidding by PRMan · · Score: 1

      And in Visual Studio, I hit Ctrl+K Ctrl+D all the time, which puts my code into "Standard" Microsoft format. If everyone did this, I imagine the analyzer would drop to 50% or lower.

      --
      Peter predicted that you would "deliberately forget" creation 2000 years ago...
    11. Re:No Kidding by ihtoit · · Score: 2

      coding to book (sans comments) will kill the process of identifying authors stone dead, I think. If everybody's "Hello World!" was identical, how do you tell the difference?

      --
      Political debates have me rolling my eyes so much I think I got optical whiplash. I should sue. - Foamy The Squirrel
    12. Re:No Kidding by disambiguated · · Score: 1

      Use a diff tool that can ignore formatting changes. I'm a fan of Beyond Compare, but there are plenty of others.

    13. Re:No Kidding by disambiguated · · Score: 2

      Style guidelines should be about avoiding pitfalls of the language, using appropriate idioms, and not making life miserable for maintainers, not about where you put spaces and braces.

    14. Re:No Kidding by phantomfive · · Score: 1

      As the thread suggests, one advantage to different coding styles is that you can generally tell who wrote what and, if there seems to be a bug, you can track them down and tell them to fix it in that ugly mess. In our office, we have the rule that if you go around changing code style, you now own that code and are responsible for it. About the only issue we've run into is that people's styles evolve over time. So the guy right out of school may have a certain style that changes as he is exposed to more styles.

      git/cvs/svn/mercurial blame can tell you who wrote whatever code. Please tell me you are using some kind of source repository.......

      --
      "First they came for the slanderers and i said nothing."
    15. Re:No Kidding by AK+Marc · · Score: 1

      Indent isn't meaningless. But there's no reason to double-space an indent. It carries a reading meaning, related to nesting of code.

      Code "feels" smaller when it's compact. Also, having a single spacing method uniform across everyone makes for easier cut-and paste sharing. Having one person space things differently than another will result in decreased readability.

    16. Re:No Kidding by wasteoid · · Score: 1

      if (false == var) prevents accidentally assigning false to var if you forget to use double equals

    17. Re:No Kidding by burbilog · · Score: 1
      I can usually tell who wrote the code in the office by whether or not they put a space after their ifs: if(i == 0) vs if (i == 0); where they put their brackets, whether or not they replace their tabs with spaces, how they deal with bools: if (!var) vs if (var == false) and several other telling signs. There are so many combinations of variations no two programmers in the office (about 12 of us) have the same style.

      Can you do the same after indent -kr?..

  13. That explains it by Tablizer · · Score: 2

    I suppose all those "// damn U bill gates!" comments gave me away

  14. Welcome to the party by meerling · · Score: 2

    When I was a kid in the 80s we figured out we could identify who wrote a particular piece of software by looking at it's code. Those individualistic and identifiable features we used in the argument over programming being an art or a science when we wanted to support the art side.

    1. Re:Welcome to the party by Virtucon · · Score: 4, Insightful

      It's all about style. Writing software is very creative and it needs to have the authors fingerprints on it somewhere. If corporations don't like that they can suck the source code into a parser and spit out perfectly mundane crap that loses the intonation and the thoughts the original developer had for it.

      --
      Harrison's Postulate - "For every action there is an equal and opposite criticism"
  15. John Varley Press Enter by Crashmarik · · Score: 3, Informative

    1985 Hugo Winner

    Really, the fact that coding style is recognizable was so well known it made it into pop culture 30 years ago.

    Also, on the smaller sample size the program might just be recognizing the parts of the style that come from the corporate standards. It would be interesting to see if it could recognize code from people who all work at the same company.

  16. Re:Not my Frankencode... by Tablizer · · Score: 3, Funny

    ... a patchwork of open-source freebies.

    So, what's it like to work for FaceBook?

  17. Vernor Vinge probably beat him to it by Crashmarik · · Score: 1

    But I can't recall an instance.

    1. Re:Vernor Vinge probably beat him to it by AJWM · · Score: 1

      Vinge is considered one of the fathers of cyberpunk because of his "True Names", which did precede Varley's chilling (and Hugo-winning) "Press Enter[]" (1981 vs 1985).

      On the other hand, Varley's much earlier (1976) "Overdrawn at the Memory Bank" was also one of the seminal works of the field.

      Been a while since I've read it, but the warlocks (hackers) in "True Names" would never have let their identity (true name) be determined from their coding styles.

      --
      -- Alastair
  18. Source of Future Data by Ronin+Developer · · Score: 1

    I guess we can expect that source code repositories will be scanned and processed. And, for code written by multiple authors, the modified code (from commits) will be scanned and indexed as well.

    But, I bet they will never figure out who writes the malware recently attributed to the three letter agencies. They should, however, be able to figure out which agency writes the stuff if they get a copy of the source code or maybe even from decompiling the binary.

    Additionally, if written from .NET, the CLR code can be reflected back to VB, C# or any other .NET language to retrieve the source code.

    1. Re:Source of Future Data by Shados · · Score: 1

      Back in the days of .NET 1~2, decompiling via Reflector or whatever other tool got you back pretty good stuff. Today, there's a LOT more sugar, from LINQ to async/await and everything in between. If you go back to the original language, good decompilers sometimes infer what the original sugar was from the output following certain conventions and patterns...but moving that to another language will give you unreadable garbage.

      Reading F# in C# , this>but,worse>

    2. Re:Source of Future Data by Shados · · Score: 1

      Bah, formatter messed things up. The last line was me joking about the crazy nested generic chains that F# types end up looking like in a language that doesn't support the same syntax sugar.

  19. The key to this system being used is, ...... by Selur · · Score: 1

    "The key to this system being used is, of course, first obtaining the code stylometries for a wide range of developers. The authors didn't address how, say, a database of programmers’ styles would be compiled. Also, to identify the author of a piece code would require access to the source code, and not just executables, though the authors mention there is some evidence that style is preserved in binaries."
    -> so once you post to github and similar 'they' can link every code you ever write to you,....

  20. Was discussed at 31c3 by YoungManKlaus · · Score: 1

    so you are a good month late with the news

    1. Re:Was discussed at 31c3 by ihtoit · · Score: 1

      are the podcasts/videocasts out for that yet?

      --
      Political debates have me rolling my eyes so much I think I got optical whiplash. I should sue. - Foamy The Squirrel
    2. Re:Was discussed at 31c3 by YoungManKlaus · · Score: 1

      of course, since like 2 days after the conference ended. http://media.ccc.de/browse/con...

    3. Re:Was discussed at 31c3 by ihtoit · · Score: 1

      silly me. Thanks for the link anyway :)

      --
      Political debates have me rolling my eyes so much I think I got optical whiplash. I should sue. - Foamy The Squirrel
  21. Bad Coders Can't Be Identified by TrollstonButterbeans · · Score: 3, Interesting

    If your coding is terrible and very newbie like, they can't single you out since your code is similar to the ocean of other terrible coders.

    So if you are a paranoid freak, the best way to ensure your safety and keep the government off your back is to write terrible code.

    --
    Priest: "Universe from nothing, no laws of physics, sped up time"+ huge discrepancies. Creationism? No. Big Bang Theory
    1. Re:Bad Coders Can't Be Identified by ThatsDrDangerToYou · · Score: 1

      Ah, my work here is done!

  22. Oblig XKCD by Krazy+Kanuck · · Score: 2

    Not that many of us actually use comments.... http://xkcd.com/1421/

  23. Most programming isn't new code by jgotts · · Score: 3, Insightful

    Most programming isn't writing new code. Most programming is working on someone else's crap you inherited. Invariably, you're going to be using that person's style or else the result will look like garbage.

    There is also the problem that most non-trivial code is worked on by multiple people at the same time.

    Writing some code from scratch as an assignment is a very artificial exercise nowadays, unless you're in a classroom setting. Therefore, you're going to get a signature from a programmer doing atypical work.

  24. What complete and utter bullshit. by MouseTheLuckyDog · · Score: 2

    95% of 250 coders. That means that out of a million programmers they will misidentify 200000.

    I suspect that there are few enough variances in style to make any coders style unique. For example whether to uses braces on a one line statement after an in if in C.

    With a few programmers it's likely to work, but when the possible source of programmers is the world...

    Not to mention emacs, Visual Studio and such enforcing some indentation standards and programming languages enforcing others.

    1. Re:What complete and utter bullshit. by Rinikusu · · Score: 1

      Okay, I just woke up from a nap, but could you show your math there? Maybe I'm missing something because I come up with.. 50k, not 200k...

      --
      If you were me, you'd be good lookin'. - six string samurai
    2. Re:What complete and utter bullshit. by Ksevio · · Score: 1

      I find the statistics dubious as well - they also dropped the dataset to nearly 1/10 while roughly doubling the code input and the results were 2% better, so it's possible if we follow the trend it will reach the 20% you seem to quote.

    3. Re:What complete and utter bullshit. by Kjella · · Score: 1

      What complete and utter bullshit.

      95% of 250 coders. That means that out of a million programmers they will misidentify 200000.

      You know it's not a contest to come up with the worst bullshit. If you're left with one person 95% of the time when you have 249 possible wrong answers, it's like being left with 4000 people when you have 999999 wrong answers. If all those are too close to tell apart you'll misidentify >99.9%.

      Imagine for example that you wanted to find people by height and weight, as measured to nearest cm and kilo. It might work decently on a small group, but if you scale it up to a million people there'll be a lot of duplicates and then you're just guessing, double the population and you halve the chance of being right.

      --
      Live today, because you never know what tomorrow brings
    4. Re:What complete and utter bullshit. by steelfood · · Score: 1

      It's 50,000.

      Or for the study, the 12 people who code exclusively in assembly.

      --
      "If a nation expects to be ignorant and free in a state of civilization, it expects what never was and never will be."
    5. Re:What complete and utter bullshit. by secret_squirrel_99 · · Score: 1

      yes, but when the possible set of coders is everyone in your class, and what you really want to see is if the same kid wrote 5 other students submissions this is perfect and is at least one of the obvious use cases.

      --
      If privacy had a tombstone it would read "We did it for your own good" . -- John Twelve Hawks
  25. Re:Next thing you know by Mordok-DestroyerOfWo · · Score: 1

    I hate following your rambling, Anonymous Coward. Sometimes you get extremely schizophrenic and contradict yourself!

    --
    "Never let your sense of morals prevent you from doing what is right" - Salvor Hardin
  26. So you could use this tool to make your code anon. by Maxo-Texas · · Score: 4, Interesting

    Write a version of pretty-printer that rerenders your code into a different style.

    Have a lexicon of mipelled words for each "personality".

    Another lexicon of variable names.
    a vs inta vs int_a vs x.

    Refactoring and unfactoring for subroutines.

    Run the comments through google translate and back to english.
    ukrainian
    japanese
    chinese

    Synonym and antonym substitution in the comments.

    The mind dances at the possibilities to mess with this algorithm.

    --
    She was like chocolate when she drank... semi-sweet at first and then increasingly bitter.
  27. Hah. I write everything in Fortran.. by toonces33 · · Score: 2

    and then use F2C to convert it to C code before I check in.. Try analyzing that!

  28. Spotting GCJ cheating would be an interesting find by jasax · · Score: 1

    Ditto. They also could have researched if submissions in a given (same) GCJ identity have been (or had a high probability of being...) written by two or more different coders...

    The submissions' speed of top ranked coders seen in early stages of the GCJ contest always amazed me (compared, of course, with my turtle sluggishness...)

    ;-)

  29. Re:So you could use this tool to make your code an by toonces33 · · Score: 1

    I can just imagine how unreadable such code would end up being, as any comments would look like they were written by some sort of AI tool.

  30. Obfuscator? Or just translate A-B-A? by RandCraw · · Score: 1

    Of course you could anonymize source code using an obfuscator.

    But maybe the simpler way is to compile Java to bytecode, then decompile it back to Java. I suspect that's as effective as most obfuscators.

  31. Code beautifier by mrflash818 · · Score: 1

    Perhaps something like Artistic Style might help.

    http://astyle.sourceforge.net/

    --
    Uh, Linux geek since 1999.
  32. Easy Solution by marciot · · Score: 1

    Someone just needs to write a tool that takes source code and translates it into an obfuscated form that only the CPU can understand. Is anyone working on this type of privacy tool?

  33. Re:okay by lgw · · Score: 1

    Newfags can't triforce

    Slashdot supports too few entities to do this right, and forget about UTF8. But you can get sorta close.

      *
    * *

    Unless someone can do better?

    --
    Socialism: a lie told by totalitarians and believed by fools.
  34. Re:emacs by __aaclcg7560 · · Score: 1

    I had a Java instructor who informed the class that he talked to two students in private because their code was nearly identical except for one small detail: one used the x variable, the other used the y variable. The program was so simple that he couldn't flagged the students for cheating.

  35. Pointless, but no doubt true by Kittenman · · Score: 2

    Wouldn't any programmer worth their salt identify themselves in the comments, or (if not) be logged as the last guy in that code on such-and-such a date, while working on such-and-such a patch number? (E,.g 'kittenman was here, 1/Jan/15, fixing Steve's crap').

    But I hope my code is easily recognizable. I'm proud of it. It may not be the smartest, slickest, quickest there is, but it's mine. And it works.

    --
    "The greatest lesson in life is to know that even fools are right sometimes" - Winston Churchill
    1. Re:Pointless, but no doubt true by Shados · · Score: 1

      People still use these stupid 90s style comments with authors and dates and shit? Really?

      Just use the source control system for that.

  36. harder to read if there is no consistency by Chirs · · Score: 1

    Generally speaking each project has a coding style that most code in the project adheres to, for the simple reason that it's easier to maintain when the code all looks more-or-less similar.

    If one area uses lowercase with underscores, and the other area uses CamelCase, and one area typedefs the heck out of everything while the other is explicit, then for someone coming in and trying to understand the code it makes it harder than necessary to figure out what's going on.

    So if you look at the linux kernel, or glibc, or firefox, or Chrome, or any other similarly large project, there will be some sort of coding style that applies. This is not to say that the style applies blindly. For example there are areas in the kernel where they basically imported a driver that is written in a different coding style. Since that driver is maintained out of the linux kernel tree and is largely self-contained, that was deemed to be acceptable. And even in that case, the driver used an internally-consistent coding style for all the files involved.

    1. Re:harder to read if there is no consistency by ChunderDownunder · · Score: 1

      Coding standard adoption can provoke holy wars but at the end of the day, you're a team. Though idiosyncratic decisions irk me, such as prefixing instance variables with underscore. Any decent editor will make such a distinction between scope via colours.

      Pretty printing tools and style checkers present in any decent editor will enforce coding standards with minimal fuss.

  37. There's an easier way... by senedane · · Score: 1

    I just use 'git blame' to figure out who to yell at....

  38. will they show the method? by ihtoit · · Score: 1

    I doubt it. Therefore, this is about as reliable as graphology (handwriting analysis).

    If you take two programmers who code to book standard, how do you tell the difference between them using the same strict problem?

    --
    Political debates have me rolling my eyes so much I think I got optical whiplash. I should sue. - Foamy The Squirrel
  39. Here's a great idea... by Lodragandraoidh · · Score: 1

    You can have/use this idea for free:

    Before a system will build said code, have the build system verify the code not only by the public key/code hash, but as a secondary method - the code fingerprint of the author in question.

    This turns a creepy idea into something worthwhile.

    --

    Lodragan Draoidh
    The more you explain it, the more I don't understand it. - Mark Twain
  40. Re:Hah. I write everything in Fortran.. by rubycodez · · Score: 1

    That's one way to make your ForTran run slower

  41. Re:Next thing you know by Anonymous Coward · · Score: 1

    Ever since that corpus callosotomy, I try to remember to type in nice things with my left hand but then my right hand logs in and mods it down...

  42. Fun fact, everything can be used to track you by Anonymous Coward · · Score: 1

    Case in point, I am a guitar player, and so was my college roommate. We didn't necessarily play together much, but we both heard each other play a lot, over the course of years.

    I'd be able to place his playing anywhere.

    For that matter, we used to have a game where we'd try to stump each other by playing clips of guitar players and guessing who they were. This was often improvisational jamming, very obscure recordings from established artists. We usually had to go through 3-4 rounds before someone would get one wrong.

    This isn't really much different than handwriting, speech patterns, writing patterns...

  43. Re:emacs by ChunderDownunder · · Score: 1

    I once marked CS homework and uncovered cheating for an 'individual' assignment.

    A group of students had debug comments in their code - the giveaway? spelling mistakes.

  44. Re:So you could use this tool to make your code an by physicsphairy · · Score: 1

    "Hey, you notice some odd grammar, word choice, and spelling variance in this code?"
    "Oh yeah, must be Maxo-Texas. That's his anonymization software."

  45. Re:So you could use this tool to make your code an by steelfood · · Score: 1

    If you did this every time, you'd be identified as the guy who runs his code through Google Translate prior to release.

    Non-normal behavior is the most easy to single-out. In order to avoid detection, you basically have to become noise. And if you're the only one, then even that is a pattern.

    Sure, you could run some things through Google Translate and leave some things alone, but that'd be the equivalent of having two online personas.

    --
    "If a nation expects to be ignorant and free in a state of civilization, it expects what never was and never will be."
  46. Re:So you could use this tool to make your code an by Maxo-Texas · · Score: 1

    Absolutely- if you were the only one using the tool.

    --
    She was like chocolate when she drank... semi-sweet at first and then increasingly bitter.
  47. this perltidy sure gets around by wardk · · Score: 1

    seems to be a very prolific coder

  48. Truecrypt? by omnichad · · Score: 1

    Time to run this against the 7.2 version of Truecrypt.

  49. Re:So you could use this tool to make your code an by Maxo-Texas · · Score: 1

    aye!

    If everyone used it then we'd all be spartacus.

    What I was implying also in my parent post was using the tool the article is about to confirm your code had reached the ambiguous level.

    --
    She was like chocolate when she drank... semi-sweet at first and then increasingly bitter.
  50. Re:So you could use this tool to make your code an by Maxo-Texas · · Score: 1

    That's a good point. I also mentioned arbitrarily factoring and refactoring subroutines and I did not state clearly enough that i was suggesting using the tool mentioned in the article to confirm your code was giving a false result.

    --
    She was like chocolate when she drank... semi-sweet at first and then increasingly bitter.
  51. Re:Hah. I write everything in Fortran.. by rubycodez · · Score: 1

    Wrong, the machine code emitted by one of the industry heavyweight Fortran compilers will kick the ass out of C's

  52. Good Example of Reverse Chronological Chauvinism by Doctrinsograce · · Score: 1

    Duh. They are, like, just seeing this today? We knew this back in the seventies... and I am sure that earlier programmers knew it too.

  53. Fairly low tech by ebvwfbw · · Score: 1

    Used to be able to tell which student's code I was looking at towards the end of a semester, in the 1980s. No need to look at who submitted it. From time to time I'd find one student's work turned in by someone else. That would result in an inquiry and usually an action against that student. Ye old dumpster dive.

    Years later I would do code reviews. Hardly any time I could tell you who wrote it. Even if they had departed the company. Certain people do certain things predictably.