Slashdot Mirror


Coding Styles Survive Binary Compilation, Could Lead Investigators Back To Programmers (princeton.edu)

An anonymous reader writes: Researchers have created an algorithm that can accurately detect code written by different programmers (PDF), even if the code has been compiled into an executable binary. Because of open source coding repositories like GitHub, state agencies can build a database of all developers and their coding styles, and then easily compare the coding style used in "anti-establishment" software to detect the culprit. Despite all the privacy implications this research may have, the algorithm can also be used by security researchers to track down malware authors. We also discussed an earlier phase of this research.

164 comments

  1. Frist! by Anonymous Coward · · Score: 2, Insightful

    Going to be lots of false positives on this one.

    1. Re:Frist! by zr · · Score: 1

      if true, even if not definitive would provide useful leads.

    2. Re:Frist! by LifesABeach · · Score: 1

      Brilliant! These are the same people that cannot find a bad person unless they can stealth-fully break into Cell Tower transmissions, Social Networking Sites, and slamming one with lethal does of X-Rays.

    3. Re:Frist! by zr · · Score: 1

      quite a jump there, from analyzing code to irradiating people..

    4. Re:Frist! by myowntrueself · · Score: 1

      quite a jump there, from analyzing code to irradiating people..

      Not really...

      http://hackaday.com/2015/10/26...

      --
      In the free world the media isn't government run; the government is media run.
    5. Re:Frist! by zr · · Score: 1

      apples and oranges. here he or she was arguing a "slippery slope" deal.

      your example is perfectly valid but is about something else.

    6. Re:Frist! by cshark · · Score: 1

      Going to be lots of false positives on this one.

      My thoughts exactly.
      Morons.

      --

      This signature has Super Cow Powers

    7. Re:Frist! by Aighearach · · Score: 2

      lol yeah. The blathering about governments is just somebody getting silly and running their mouth about stupid shit. Newsflash, if a person sounds like a conspiracy theorist? They're probably not a good data source.

      This is great technology for figuring out which one of 5 people wrote a particular method/function. And I have no doubt that governments will use this technology to mislead juries into believing it is like a fingerprint, by using the word "fingerprint" nearby the name of their test in sentences, but they'll only be using it to reinforce whatever evidence they used to find the person to accuse in the first place.

      This would be more likely to have real-world impact in the hands of a large corporation's recruiting department. You don't necessarily want to hire away all your competitor's team, they just the best few people. With this, you might be able to tease out who wrote which parts of their product; especially if also have code samples from the 20% that applied with both companies, or from FLOSS code.

    8. Re:Frist! by Joce640k · · Score: 1

      Going to be lots of false positives on this one.

      Doesn't matter.

      It's a bit like blood-group matching. It can't prove you're the guilty person, but it negative match can certainly prove you aren't.

      --
      No sig today...
    9. Re:Frist! by ShanghaiBill · · Score: 4, Interesting

      False positives are not a problem if you deal with them rationally. If a woman is murdered, and the DNA matches one in a million, then in a country of 300 million, there will be 300 matches, and 299 false positives. But if only one lives in the same city, and it happens to be her ex-boyfriend, then the DNA match is useful information.

    10. Re:Frist! by tattood · · Score: 2

      Going to be lots of false positives on this one.

      So I can easily avoid this trap by never hosting any code on Github?

      --
      WTB [sig], PST!!!
    11. Re: Frist! by Anonymous Coward · · Score: 1

      You know it is bullshit because they did not use it to name the Truecrypt devs.

    12. Re: Frist! by Anonymous Coward · · Score: 0

      Unless she was sleeping with her ex and it was her current BF that murdered her! Dun dun da! Turn in next week to see how it ends.

    13. Re:Frist! by dsmatthews9379 · · Score: 1

      Yup, Neil De Neural-network is going to be is so much bother now, and all he was trying to do was help coders optimise their puny human code.

    14. Re: Frist! by Anonymous Coward · · Score: 0

      All right. This government and establishment are really starting to piss me off. This article is obviously a threat. At a time we have an oppressive patent system and now TPP, and constant, pervasive unconstitutional surveillance, it is obvious we have a serious problem. What are we going to do about it?

    15. Re:Frist! by TemporalBeing · · Score: 1

      False positives are not a problem if you deal with them rationally. If a woman is murdered, and the DNA matches one in a million, then in a country of 300 million, there will be 300 matches, and 299 false positives. But if only one lives in the same city, and it happens to be her ex-boyfriend, then the DNA match is useful information.

      Except in this case that does not work since locality doesn't matter for the Internet or software. Also many good groups establish various coding standards so many authors will now become one; some individuality may survive based on logic structure, but that would get mitigated quickly by group reviews and code updates in response.

      --
      Truth is like the sun. You can shut it out for a time, but it ain't goin' away. - Elvis Presley (source: imdb.com)
    16. Re:Frist! by gweihir · · Score: 1

      Actually, it would not. Large false positive probabilities drive a detection method quickly to negative worth, because they then waste resources that could have spent better. Well known to experts, but something non-experts routinely do not comprehend.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    17. Re:Frist! by gweihir · · Score: 1

      Actually, as Paris showed (and then showed again), they cannot do so having all those capabilities and knowing who these people were beforehand. I think they simply cannot do it in the first place, no matter what outrageous capabilities these cretins will be given next.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    18. Re:Frist! by gweihir · · Score: 1

      False positives are not a problem if you deal with them rationally. If a woman is murdered, and the DNA matches one in a million, then in a country of 300 million, there will be 300 matches, and 299 false positives. But if only one lives in the same city, and it happens to be her ex-boyfriend, then the DNA match is useful information.

      Actually, it is not. You wasted the resources to do 300 million DNA tests, when simply looking for the ex-boyfriend would have helped you to narrow it down. With those 300 million DNA tests, you would have spent, say, the effort for 10'000 of them for locating and questioning the ex-boyfriend and administering a DNA test just to him. Hence you come out the effort for about 299.99 million DNA tests short and you still have to investigate the ex-boyfriend. That wasted effort is going to have massive negative effects elsewhere.

      That is what trawling large databases with large numbers of false positives in the result do: They waste incredible amounts of resources for no gain. That is also why mass-surveillance makes us all very much less safe. If the French had done old-fashioned police work on the hints they had, the two last terrorist attacks would not have happened. Instead they let themselves be blinded by the masses of data. (That is if they were not intentionally looking the other way.)

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    19. Re:Frist! by vlad30 · · Score: 1

      Like coders don't copy and paste code from various sources

      --
      Your'e all thinking it, I just said it for you
    20. Re:Frist! by RockDoctor · · Score: 1

      ... or by sharing your GitHub log-in details with a half-dozen other people so that each one of you dilutes and poisons the "fingerprint" of the other 5.

      --
      Birds are not dinosaur descendants;birds are dinosaurs, for all useful meanings of "birds", "are" and "dinosaurs"
  2. I doubt this by Anonymous Coward · · Score: 1

    This must have a lot of false positives. I'd be surprised if this works at all, but sure it would sell some product and get a few grants.

    1. Re:I doubt this by Anonymous Coward · · Score: 1

      Yes. We'd gone through this a billion times already. Every couple years someone decides they can tell who a programmer is through the binaries except they forget about how much code is little more than snippets of other code or the like.

    2. Re:I doubt this by Lab+Rat+Jason · · Score: 4, Interesting

      This is why I steal most of my functional code from GitHub in the first place...

      --OR--

      Easy to avoid detection by simply NOT UPLOADING code to GitHub in the first place. The assumption that every dev does this is stupid.

      --
      Which has more power: the hammer, or the anvil?
    3. Re:I doubt this by Anonymous Coward · · Score: 3, Interesting

      As a systems admin I have been called upon at times to automate a few things. Doing so in my situation seemed easiest with vb.net. (Yes I know the actual programmers here are recoiling with horror. Stuff it, the program works, and saves me vast amounts of time)

      In that program somewhere around 5% are lines that I actually coded. Everything else is snippets of code from Microsoft's help files, question/answer sites, and similar opensource programs found online. Unless they are checking for things like the fact that I included no error handling (since I am the only one that uses said program) I fail to see how this would work at all.

      The variable naming conventions are many and varied, there are almost no unique lines of code in the program, and it uses only standard libraries. On top of that I don't have any other programs accredited to me floating around the interwebs, at least in vb.net, so there is nothing to compare it to.

      So how exactly are they going to tell you who wrote that monstrosity? In fact it sounds like their algorithm depends upon the author having other accredited works available out there. So if you don't put anything up on public code sharing sights, you have nothing to worry about.

    4. Re:I doubt this by unrtst · · Score: 2

      In that program somewhere around 5% are lines that I actually coded. Everything else is snippets of code from Microsoft's help files, question/answer sites, and similar opensource programs found online. Unless they are checking for things like the fact that I included no error handling (since I am the only one that uses said program) I fail to see how this would work at all.

      I strongly suspect this is precisely the kind of code that they will most easily be able to associate to individuals.
      Through their deep analysis of public code, I would strongly suspect that they have cached those segments, like any good search engine or data analysis would do. As such, they can diff and cut out any code that has been duplicated from elsewhere (just as they could with raw source code). Anything modified by you would remain.
      Because your coding style is, admittedly, quite different from that in the snippets, it will stand out as if it were glowing.

      That said, if you were a more competent** programmer, then it'd be more difficult to distinguish your code from everything else as it'd all follow best practices.
      ** or some other appropriate word that means you code in similar style to well established design patterns

    5. Re:I doubt this by Gr8Apes · · Score: 2, Interesting

      As such, they can diff and cut out any code that has been duplicated from elsewhere (just as they could with raw source code). Anything modified by you would remain. Because your coding style is, admittedly, quite different from that in the snippets, it will stand out as if it were glowing.

      The funniest thing about this is how wrong that statement is. I can take myself as an example, I've worked in multiple shops, several with different code formatting practices, not to mention potentially different languages. I generally configure my IDE to whatever code formatting requirements there are, so everything I add gets put into the current format. Naming practices are whatever is in the current codebase. So, essentially, from a source and binary perspective, my code will look like whatever the current code base should. Snippets cut from anywhere will always be refactored to fit my needs, and thus may not look at all like what was snipped.

      In short, this is a whole barrel of snake oil for any one actually working professionally and not that rare lone wolf that only codes their own specific way.

      --
      The cesspool just got a check and balance.
    6. Re:I doubt this by Anonymous Coward · · Score: 0

      Variable naming conventions don't survive compilation (no, not even in MSIL), so that isn't going to even factor into this algorithm.

      The 95% MSDN/Stack*/CodePlex stuff that they already know the author of will be discarded.

      And your 5% remaining code will be your developer fingerprint, in a database, with no name associated to it until you slip up some time in the indefinite future. And then they've... added your name to the database. (Assuming no false positives.) This falls squarely into DILLIGAF category.

      As a professional software developer, I still don't care. And I have binaries that could be retrieved from public websites and reverse engineered and/or analyzed. (I don't bother with obfuscation, since all of my work is work-for-hire LOB software anyway.)

    7. Re:I doubt this by Anonymous Coward · · Score: 0

      AC you are responding to here.

      Hmm interesting thought. Although I have to question if they would be able to name the programmer, or just label it as a noob's program. With no other examples to compare it to that are attributable to me, how would they be able to name me?

      Its like a handwriting analysis with no known samples to compare it too.

    8. Re:I doubt this by jeffmflanagan · · Score: 1

      I'd think the only time this would work is if a programmer was a contributor to open source projects, then went bad and started writing software designed to commit crimes. Anyone starting as a criminal developer would never have uploaded their code to Github.

    9. Re:I doubt this by cshark · · Score: 1

      Yes. We'd gone through this a billion times already. Every couple years someone decides they can tell who a programmer is through the binaries except they forget about how much code is little more than snippets of other code or the like.

      Even so. All of this assumed that the blackhat coder is sharing his code or contributing to the open source community to begin with.
      Github and repositories like it are a not a panacea of coding styles from every coder on earth. The total number of people that contribute is actually very small when you consider the size and scope of the overall community. Furthermore, I've intentionally changed my coding style a dozen times.

      I would challenge these Princeton researchers to make heads or tails of me, honestly.
      I don't think they could do it.

      Not with the technique outlined in this whitepaper.

      --

      This signature has Super Cow Powers

    10. Re:I doubt this by Anonymous Coward · · Score: 1

      Because they compile a list of sites you went to, find 94% of the code you used came directly from a collection of 7 or 8 of them, all visited within the last two months.

      Why do people keep thinking their identity is not permanent?

    11. Re: I doubt this by Anonymous Coward · · Score: 0

      They will survive unless you strip the binaries and remove all debug info.

    12. Re:I doubt this by Zardus · · Score: 1

      I'm speculating, but this could probably be applied without the need of a large github corpus. If you have some set of malware that you know was written by a specific person/group, you could check other pieces of malware to see if the same people wrote them. That'd probably be useful to *somebody*.

      --
      You can mod your friends, you can mod your nose, but you can't mod your friend's nose.
    13. Re:I doubt this by Aighearach · · Score: 1

      I'd be surprised if this works at all

      It is just like a polygraph machine. It works, it works well, it works under known conditions, and it produces known results. Of course, then things it does are not the things described by the words people use near it, nor are more of the actions it is used to support actually supported by the function of the machine. And yet, the machine is not malfunctioning.

      This is not an investigative tool for law enforcement. It is a useful tool for certain business researchers, and it may prove useful to historians of computing in the future. There are probably other reasonable uses, too.

    14. Re:I doubt this by Aighearach · · Score: 1

      Right, except, you don't make a case that supports your conclusion. You make a good case that the use the idiot in TFS describes is not a good use. But that doesn't harm the capabilities of the technology at all. Your conclusion that it is snake oil mistakes the location and nature of the mistake.

      From what you said, it sounds like each of your past employers could use it to tell if you were the one who wrote a particular function or method, based on the specific ways that you implemented their stated coding standards. Once you find a better use case, then you can realize that the size of detectable signal is very reasonable for the technology to work. There only needs to be subtle differences between team members to tell which one wrote a particular thing. Whereas, if you don't already have a small sample size then it reaches the problems you describe. But not all use cases reach those problems.

    15. Re:I doubt this by spire3661 · · Score: 1

      If you arent uploading to GitHub, you are an Alchemist, not a Scientist.

      --
      Good-bye
    16. Re:I doubt this by Anonymous Coward · · Score: 0

      If nothing else it would be interesting to compare known malware with the code from anti-virus vendors to see how big part they take in creating their own market.

    17. Re: I doubt this by Anonymous Coward · · Score: 0

      If he's using a proxy they have jack shit.

    18. Re:I doubt this by 0100010001010011 · · Score: 1

      I don't even write my own code.

      Simulink writes it for me. Are they also able to back out Simulink model styles as well?

    19. Re:I doubt this by KGIII · · Score: 1

      I am not saying that this product is effective but your post reminds me of something I've been meaning to ask...

      Why is it you point to things where the program/application/function will not work? Yeah, err... We know (or they know) that it will not work everywhere and in every point in time. Strangely enough, a good tradesman has a variety of tools in their toolbox.

      What you're saying is that a hammer is not very good for doing window glazing. While that's certainly true, it's hardly salient.

      --
      "So long and thanks for all the fish."
    20. Re:I doubt this by The+Snowman · · Score: 1

      This must have a lot of false positives.

      True, especially for projects where the maintainers care about style and ensure code in pull requests conforms to project guidelines. Note: this is not about formatting, where to put braces, etc. which is information lost during compilation. I am talking about naming (which may be preserved in debugging symbols), code structures, etc. which may be partially or fully preserved.

      I'd be surprised if this works at all, but sure it would sell some product and get a few grants.

      Me too. Mostly because compiled code is likely optimized, rearranged, and information is lost during compilation anyway. Five people could write the same block of code slightly differently, and a compiler could compile it to the same machine-/byte-/whatever-code. How do you tell which of the five wrote it? Most likely, you do not.

      --
      24 beers in a case, 24 hours in a day. Coincidence? I think not!
    21. Re: I doubt this by KGIII · · Score: 1

      You forget everything from traffic shaping determinisms, timing, ownership of the proxy, browser fingerprinting, cookies, XSS may give up information, and powers that can visualize a goodly portion of the entire 'net and use deterministic algorithms to narrow down the search.

      However unrealistic those things are, or realistic - depending on one's opinion, they are simply a few things you neglected to mention so we'll assume you've not taken steps to mitigate those risks. Many of them can be avoided but true anonymity is damned difficult today. I understand that, for instance, you can remain reasonably certain that you're secure if you remain on the Tor network and don't browse into clearnet. I've heard good things about L2P but not tried it and don't know enough to opine.

      Mostly, I just read what the guys/gals here have to say and ask questions and then do some reading on my own. I used to be, shall we say, quite interested in security for a variety of reasons but those reasons have passed by and I am not really concerned as much. Thus, I no longer keep up on the latest and greatest news. It does(n't) help that I've been retired for 8 years now.

      At any rate, if you think the use of a proxy is adequate then you're sorely mistaken. It can be *a* step in the right direction but that sure as hell had better not be your only line of defense. Even if they aren't logging, they have an upstream bandwidth provider...

      --
      "So long and thanks for all the fish."
    22. Re: I doubt this by Anonymous Coward · · Score: 0

      .Net debug info is kept in a separate PDB file (a debug symbol database file) that is generated with each assembly at compile time. The assembly itself contains little-to-no debug info unless you enable TRACE in the compiler settings. And TRACE is off by default in Visual Studio. The only things that are kept in the MSIL are namespaces, class names (regardless of protection level), and public or protected member names. Everything else is scrubbed because it will never need to show up in a stack trace, including private and/or internal member variable names and immediate-use variable names.

      If you don't deploy the PDB alongside the DLL/EXE, no one will never get useful debug info from the DLL/EXE, much less a decently accurate "fingerprint" for the programmer(s) that wrote it.

    23. Re:I doubt this by Actually,+I+do+RTFA · · Score: 1

      Alternatively, I'm a professional.

      --
      Your ad here. Ask me how!
    24. Re:I doubt this by spire3661 · · Score: 1

      Still an Alchemist, just one with a Patron..

      --
      Good-bye
    25. Re: I doubt this by castionsosa · · Score: 1

      You can also use an IL code obfuscator, one was bundled with VS.NET for a while. Of course, it won't completely stop someone from finding a signature, but it will raise the bar tremendously, so that it would only be used against a very high value target. Another advantage is that sometimes the obfuscator actually does some slight optimizing work as well.

    26. Re:I doubt this by SuricouRaven · · Score: 1

      Most alchemists had patrons. That's how they funded their alchemy.

      The deal usually involved the alchemist agreeing to make gold if the patron provided the workshop, living expenses and money for essential reagents. The alchemist would spend a bit of time doing their alchemy thing, then disappear mysteriously when the patron started to get impatient about the lack of gold production.

    27. Re:I doubt this by TemporalBeing · · Score: 1

      If you arent uploading to GitHub, you are an Alchemist, not a Scientist.

      I'm neither an Alchemist nor a Scientist in writing my code. I'm more of an Engineer. An Alchemist does something repetitively and happens upon the same results more by chance than anything else.
      A Scientist does better by making it more predictable, but there is little real structure or design to the work, leading to a lot of errors and lots of rework.
      An Engineer designs, architects, and makes reproducible work with low errors with little rework.

      --
      Truth is like the sun. You can shut it out for a time, but it ain't goin' away. - Elvis Presley (source: imdb.com)
    28. Re:I doubt this by Actually,+I+do+RTFA · · Score: 1

      If by that you mean I turn bytes into dollars, I suppose. I don't get why you think developing proprietary software is that different from OSS, except for the guarantees to the customers, deadlines, QA process, support to do it right.

      I don't use GitHub because I don't contribute to OSS.

      --
      Your ad here. Ask me how!
  3. Privacy implications? by Registered+Coward+v2 · · Score: 3, Insightful

    People have been analyzing writing styles for a long time to try to identify authors. Expecting your coding style to be obfuscated by compiling it has proven to be as wrong as thinking your identity is shielded if you publish under a pseudonym. If you make your code publicly available you really shouldn't have any expectation of privacy.

    --
    I'm a consultant - I convert gibberish into cash-flow.
    1. Re:Privacy implications? by Anonymous Coward · · Score: 0

      Hint: you didn't know this before reading the summary, so don't act douchey and superior about it.

    2. Re:Privacy implications? by Anonymous Coward · · Score: 4, Insightful

      I doubt it. Your code once compiled will be very similar to most other similarly skilled programmers in that language, unless you go out of your way to be obfuscate things - i.e. a poor coder. Compilers, libraries, APIs, language versions and proprietary extensions are beyond your coding style. This entire premise assumes there will be no false-positives, which will be the vast majority of hits. So basically, they're casting nets, and claiming success when they get one, ignoring the other thousand. Once you're at the binary, coding style has all but gone (assuming you're not doing assembler, which even then, will come down to the same few solutions to a given functional requirement).

    3. Re:Privacy implications? by Anonymous Coward · · Score: 1

      Expecting your coding style to be obfuscated by compiling it has proven to be as wrong as thinking your identity is shielded if you publish under a pseudonym.

      I was going to take the time to format my reply just like an APK post to illustrate a point, but I'm too lazy this morning. Point being, if you simply copy 90% of what you publish, then any fingerprinting is going to most likely end up pointing back to the original author.
      While most people don't do that as a matter of habit when posting comments or writing, the wholescale re-use of code is extremely common especially in open source projects. In fact, that's kind of the point. So once you compile to binary, this method is far more likely to "finger" the person who originally wrote the code, not the person who made some modifications and compiled it to binary.

    4. Re:Privacy implications? by NotInHere · · Score: 1

      Everybody knows that whitespace is translated to nop's you insensitive clod!

    5. Re:Privacy implications? by malditaenvidia · · Score: 1

      The problem would be proving the code in question was written by someone, due to "coding styles". That sounds legally sketchy as all hell.

    6. Re:Privacy implications? by Registered+Coward+v2 · · Score: 1

      The problem would be proving the code in question was written by someone, due to "coding styles". That sounds legally sketchy as all hell.

      I agree; but there is a difference between what would hold up in court and using the results to identify who may have written the code and using that to narrow the scope of an investigation or even to prove original authorship. Based on the article, it seems to work on very specific snippets of code written to perform a specific task; weather it would be useful to analyze large swaths of code is another question altogether; especially since such code is likely to have a number roof contributors as well as standard routines incorporated into the code.

      --
      I'm a consultant - I convert gibberish into cash-flow.
    7. Re:Privacy implications? by Anonymous Coward · · Score: 0

      That sounds legally sketchy as all hell.

      That hasn't stopped any government lately, especially the US government. Stingrays, parallel construction, unleashing exploits on Tor, all of that is "legally sketchy" but they're still happy to convict people that way. The government is a beast that's out of control and answers to no one, welcome to the police state.

    8. Re:Privacy implications? by KGIII · · Score: 1

      Heh... I have a fairly distinct writing style and I'm aware of it. Even when I post AC, it's usually pretty damned obvious who it is. Of course, I usually make it a point to indicate who I am as I'm, ultimately, responsible for what I say - in my opinion. Right or wrong, I wrote it, I own it.

      I imagine that my code, it's not so very good, would stand out like a sore thumb when compared with other samples. For starters, I never knew best practices nor did I take much in the way of programming courses. I learned to program because I had to. I didn't even really much like computers at the time. (I've explained this before, I thought they were a waste of time in their current configuration - but they were good for doing math.)

      It wasn't until I ended up hiring professional programmers (they rewrote the entire code body eventually) that I actually learned from them. I learned some of the best practices, and I learned some more of the theory, and I learned to clean up my code a bit - and a bunch of other things far too long to mention. Unlike most, I was comfortable admitting that they were better than I - that'd be why I hired them. I'd be remiss in my duty to not have taken the chance to learn from them. Without them, I'd be still employed and doing useful things all day instead of retired and bugging you guys with endless novellas and questions.

      So, probably towards the end of it - I'd have coded in a slightly different style but my comments might indicate it was me. I'd have mimicked the styles that I'd learned from them but I'm sure there would be traces of my own bad practices left in there.

      Like a Slashdot post sent as an AC, it's probably not too hard to figure out who I am.

      --
      "So long and thanks for all the fish."
  4. Heck by vikingpower · · Score: 3, Funny

    gotta change my indentation style and public void( String s1 ) whitespace habit, now the guvnmunt automagically can also get these out of binaries built from my code. O gawd, I'm afraid now.

    --
    Religous speak to God. Insane are spoken to by God. When all shut up, one can finally hear Shostakovich in peace
    1. Re:Heck by prefec2 · · Score: 1

      This is only a part of your coding style. In most cases the part where the braces go and the indentation are part of the company or language code style. And it should be identical for everyone in your organisation. They can be enforced by checkstyle (in Java). The style also includes the use of any kind of design pattern. For example, do you implement factories in the same class or in a separate class. Do you use sub interfaces in Java.

    2. Re:Heck by Volanin · · Score: 1

      I feel you. This is my whitespace style as well; it gets so much trashing from my colleagues... And now they come to take away its freedom; that in binary, every whitespace is ignored equally. It's a sad day.

      --
      If I clone myself, can I call it a thread?
      If a girl winks to us, can I call it a race condition?
    3. Re:Heck by Anonymous Coward · · Score: 1

      Those are not the things that matter. One example of things that matter: loop structure - initialization, stopping condition; e.g. start from 0 and count up or start from X and count down. have a separate iteration variable.

      plenty of other higher order things like function overloading, object inheritance, recursive calls, etc. can be analyzed and probability matrix can be created of who is the most likely author from a given list.

    4. Re:Heck by JaredOfEuropa · · Score: 3, Interesting

      Ideally there is such an enforced coding standard, but I have worked in situations with merged teams or projects where coding styles were rather mixed. From what I could see, cosmetic stuff like braces and indentations caused some annoyance but it didn't really lead to much lost coding time, increased effort in fixing or changing things, or an increase in bugs.

      Anyway, brace placement won't survive compilation so this method is useless for rooting out the K&R traitors.

      --
      If construction was anything like programming, an incorrectly fitted lock would bring down the entire building...
    5. Re:Heck by hattig · · Score: 1

      Yeah, they're doing Design Pattern Analysis (or similar) alongside analysing binary metadata, how the software behaves acts as a fingerprint to the coder who wrote it. Strip everything from those naughty binary executables people!

      Ultimately, I somehow doubt that in a small application the binary can actually distinguish the author that precisely. There are only so many coding behaviour styles and patterns.

    6. Re:Heck by Anonymous Coward · · Score: 0

      Of course. Was just kidding. Something that I do a lot in Java: making everything final that can be made final (in classes / methods where it matters). Some more things that could be recognized as definitely my style: try-catch-finally inside a catch; no anonymous inner classes, ever; in Java 8, in certain projects: use of functional interfaces combined with lambdas; a certain predilection for static{ } blocks. And yes, design patterns. Every developer has certain patterns he / she likes, in my case: Bridge, Command, Flyweight. I am sure one could "fingerprint" me with this. A developer whose style I like combines the singleton pattern with enums, I'd recognize his code at 1200 metres distance....

    7. Re:Heck by Anonymous Coward · · Score: 0

      Brace placement affects variable scope, which affects initialization.

      Choosing to initialize variables inside of that scope, as opposed to say, everything at the top of the method, can help identify you as a programmer.

    8. Re:Heck by rubycodez · · Score: 1

      I'm sure we'll figure out what "style" even makes it into a binary. Whom do they think they're messing with here?

    9. Re:Heck by Anonymous Coward · · Score: 0

      gotta change my indentation style and public void( String s1 ) whitespace habit, now the guvnmunt automagically can also get these out of binaries built from my code. O gawd, I'm afraid now.

      Gubmint: "Your binaries please..."

    10. Re: Heck by Anonymous Coward · · Score: 0

      In the source code yes, but we are talking about binaries. There is no guarantee that the compiler would differentiate between a global and a local variable in the produced binary.

    11. Re:Heck by Anonymous Coward · · Score: 0

      Mod parent up !

    12. Re:Heck by Anonymous Coward · · Score: 0

      Gubmint: "Your binaries, please...."

      Me: "Your Design Pattern Police badge, please ? Any order from the try-catch-finally Investigation Judge?"

    13. Re:Heck by squiggleslash · · Score: 1

      For example, do you implement factories in the same class or in a separate class. Do you use sub interfaces in Java

      That depends on whether I've just learned about them or not. If so, yes. If it's been a while since I learned, I've weaned myself off them and I'm using the next fad^H^H^H uh cool feature.

      Just like everyone else, amiright? ;-)

      So that probably wouldn't help in terms of fingerprinting developers...

      --
      You are not alone. This is not normal. None of this is normal.
    14. Re:Heck by Anonymous Coward · · Score: 0

      Over time you settle with a specific set. However, the programming is also affected by the design. So I have my doubts about the detection approach.

    15. Re: Heck by toddestan · · Score: 1

      That's my thought. Wouldn't an optimizing compiler basically clobber a lot of what one would use to try to do this?

      Also, it seems one way to try to defeat this (assuming it works) would be tinker around with the compiler options to use a different set of optimizations than you usually compile with.

    16. Re: Heck by Anonymous Coward · · Score: 0

      It better fucking do differentiate between the two. A compiler that doesn't differentiate between global and local variables is not a compiler, but a piece of shit.

    17. Re:Heck by Anonymous Coward · · Score: 0

      For example, do you implement factories in the same class or in a separate class.

      I can see how that might apply for Java, for which there used to be byte-code obfuscation products, so we've known for a long time that significant source-level info was preserved in .class files. But for a language like C++, almost no source-level info is preserved (for most implementations, except optional debug info), optimization will clobber much of the fine code structure, and member functions can potentially be located and ordered in almost any way by the linker (remember, the VFT only holds pointers to virtual function entry points). Leading compilers now support PGO, so potentially even statically linked library code could end up having a different binary pattern within different applications.

      - T

  5. Fuck that. by Anonymous Coward · · Score: 3, Insightful

    Aren't we being tracked enough as it is?
    Why for fucks sake why?

    My new years resolution will to remove all my code from all public repositories.

    1. Re:Fuck that. by Anonymous Coward · · Score: 0

      Why? So they can pwn u, like a slave!

    2. Re:Fuck that. by SQLGuru · · Score: 1, Interesting

      It's versioned.......and cloned.......and forked. Good luck with that.

      I think it's funny (ironic, not ha ha) that many of the people espousing Open Source as being perfect are generally the same ones that have the biggest desire for digital privacy. And because of their push for OSS, they will be some of the first to lose their privacy.

      ** I think OSS has it's place, as does closed source. I also have a desire for some privacy but recognize that I have to give up some of that privacy in order to have some level of convenience. But I'm not an extremist in either direction for either spectrum.

    3. Re:Fuck that. by Ferocitus · · Score: 1

      Ok, they've got me.
      Now what? Gloat over having found a lump of gristle in a thin gruel?

      --
      USB, USB, USB!
    4. Re:Fuck that. by Anonymous Coward · · Score: 0

      There is money to be made from tracking. So, your opinion of what is "enough" tracking doesn't matter at all.

      From the perspective of those who benefit from the tracking, there is never enough.

    5. Re:Fuck that. by Anonymous Coward · · Score: 0

      This is what they want. To throw a wet blanket on shared and open source code.

      Why? Because how else did DVD CSS get cracked? Because how else would free software challenge entrenched interests? Because how else would disruptive software (really disruptive software, like good crypto, not obvious shit like Twitter) get out to the masses?

      If you withdraw your code because you can be personally identified by it, then you've just made humanity poorer and the tyrants stronger. Don't do that.

    6. Re:Fuck that. by Anonymous Coward · · Score: 0

      And nothing of value was lost.

    7. Re:Fuck that. by Anonymous Coward · · Score: 0

      Why for fucks sake why?

      Internment camp, my friend, internment camp. And then they take your property, publicly call your wife a whore of the enemy, make you do something useless and inefficient like breaking rocks and shoot you when you're physically worn out.

    8. Re:Fuck that. by Anonymous Coward · · Score: 0

      It's explicitly public data. Nobody forced anyone to upload their projects into a public Github account. Since this is all fair for data analysis, somebody took the time to make it happen.

  6. Accuracy 52% with 600 programmers and 8 samples by El_Muerte_TDS · · Score: 4, Insightful

    Good luck when your programmer pool is a couple of thousand and your samples consist out of obfuscated and underhanded software which is often produced by malware creators.

    1. Re:Accuracy 52% with 600 programmers and 8 samples by Anonymous Coward · · Score: 0

      Jeez you people are thick. This would only be one tool used in an investigation. Then you could take a closer look at who is left and go from there.

    2. Re:Accuracy 52% with 600 programmers and 8 samples by Anonymous Coward · · Score: 1

      It does not need to be perfect, it even does not need to be good. This will just generate another data point which will be used with many other data points to find, track and control people.
      Certainty is not required, as long as all data combined results in a reasonable likelihood for the intended purpose, as deemed by the overlord wielding this tool.

    3. Re:Accuracy 52% with 600 programmers and 8 samples by Anonymous Coward · · Score: 0

      maybe you're the thick one? good luck with your investigation when your useless algorithm with a 50/50 "success" rate throws you off track with a false negative.

    4. Re:Accuracy 52% with 600 programmers and 8 samples by Anonymous Coward · · Score: 0

      good luck with your investigation when your useless algorithm with a 50/50 "success" rate throws you off track with a false negative.

      Police drug sniffing dogs have an accuracy of around or under 50% and the courts have ruled that's just fine. It's We the People who are going to need the good luck. Cops don't need it, they can just throw 100 bogus charges at the wall and see what sticks, day in and day out. It's not like they get fired for failure.

    5. Re:Accuracy 52% with 600 programmers and 8 samples by AHuxley · · Score: 1

      Nations can spend big on their clandestine campus study efforts and over the years can project any nations style they want.
      Did a nation embrace Basic? teach with Ada? early C? Pascal? Have decades of common business oriented language in academia, assembler language, academics who enjoyed lots of free "big iron" access?
      Like hires like, like learns from like.
      Or a large user group of newer Microsoft consumers that are self taught on PC's with newer programming ideas and lots of code reuse?
      With the efforts of the NSA, GCHQ that "kind" of code can then be found with a "correct", "expected" ip range, code style and time zone stamps with a few common terms, phrases, words for "trusted" security experts to stumble over and tell the tech, media about.
      The message can then be amplified by "independent" researchers and sock puppets ready to shape the origin nation conversations online.

      --
      Domestic spying is now "Benign Information Gathering"
  7. Better title: some compilers not optimizing by Anonymous Coward · · Score: 0

    One of the somewhat longstanding assumptions in software development was that hand-optimization was no longer necessary because compiler optimization was sufficient to identify when to unroll loops, when to double-stack recursion, and all the other little performance boosting tricks.
    I have no plans to read the study, but it sounds like at least one compiler is failing to do that. Since it is only using a few basic optimizations, different ways of coding the same behavior will still be observably different in the binary.

    The real takehome is that we should all keep in mind the nature of the hardware and software we are coding for, and consider writing for ideal precision rather than how you would describe the process for another human.

  8. Even worse by Anonymous Coward · · Score: 0

    They can tell if you will commit a crime. Better to get you sooner than later. Already being done in places like Cook county.

  9. Bullshit by Anonymous Coward · · Score: 1

    So what happens when someone copies and pastes from 10 different authors to make a project.

    1. Re:Bullshit by Anonymous Coward · · Score: 0

      Pointing out a case where a thing won't work doesn't mean that thing is "bullshit."

    2. Re:Bullshit by Anonymous Coward · · Score: 0

      It is, if it can easily be defeated.

  10. True, BULLSHIT by Anonymous Coward · · Score: 0

    though, It only works for people who write in fucking high level languages.
    Optimized code in a low level language, can not, nor will it ever lead back to anyone else but GOD!

    Have fun!

  11. Not too surprising. by jellomizer · · Score: 1

    However, a lot of people have similar enough coding styles, so you may be able to break it down to particular camps of styles. Also many people change their style based on the language they are coding in. Also over time their style may evolve and change.
    In my career I try to keep my mind open, and I see an other style of coding, other than judging it inferior to mine, I would like to understand it, and if I like it I will incorporate it into my style.

    But compiling your code, will not hide how you coded it, I can usually tell how the program is written and the style just by using the application and not even looking at the binary code. Some tasks take a while to run, while others seem quick. What features are flexible and what are fixed. Different styles tend to tolerate particular tradeoffs.

    --
    If something is so important that you feel the need to post it on the internet... It probably isn't that important.
  12. Stackoverflow is the culprit! by WarmBoota · · Score: 4, Funny

    Good luck tracking me!! I copy all of my code from Stackoverflow!

    --
    90% of everything is crap. Also, crap is relative.
    1. Re:Stackoverflow is the culprit! by Anonymous Coward · · Score: 1

      StackSort connects to StackOverflow, searches for 'sort a list', and downloads and runs code snippets until the list is sorted.
      https://xkcd.com/1185/

      (captcha: truisms)

  13. Oh really? by Viol8 · · Score: 4, Insightful

    If you RTFA it seems their sample size was 20 programmers. Occasionally they went up to 100 and they're getting something like 60-80% accuracy. BFD.

    Guys - when you've sampled the compiled, optimised binary output (with all debug info stripped) of a million coders all using different compilers on different architectures and are getting at least a 99% accuracy rate, get back to us. In the meantime, I'm sure you'll get some nice marks from your supervisors but I won't be losing any sleep.

    1. Re: Oh really? by corychristison · · Score: 2

      Don't forget different versions of the same compiler. Eg. gcc-1.0 may have a different binary output than gcc-2.0

    2. Re:Oh really? by PPH · · Score: 1

      getting at least a 99% accuracy rate

      If they are hoping to use this as evidence in a trial, maybe. But to reduce the size of a list of candidate suspects for further investigation, 60 to 80% could be OK.

      --
      Have gnu, will travel.
    3. Re:Oh really? by Kjella · · Score: 1

      Guys - when you've sampled the compiled, optimised binary output (with all debug info stripped) of a million coders all using different compilers on different architectures and are getting at least a 99% accuracy rate, get back to us. In the meantime, I'm sure you'll get some nice marks from your supervisors but I won't be losing any sleep.

      I wouldn't call it totally useless, imagine you found an unknown binary running on some internal server and it turns out to be a custom inside hack job deployed with stolen credentials. Maybe you even know the thief must have physically been in the victim's office. You now have a relatively limited set of suspects, a binary and a lot of source to compare with. If we're talking classified information, industrial espionage or some other really high end material this could be one lead in the investigation.

      --
      Live today, because you never know what tomorrow brings
    4. Re:Oh really? by Anonymous Coward · · Score: 0

      Yeah more likely than not it's the left over bits of debug code that give shit away. A lot of programmers have their unique way of doing debugs.

    5. Re:Oh really? by Aighearach · · Score: 1

      If you RTFA it seems their sample size was 20 programmers.

      Right, so that tells you that they're idiots because they can't do what the idiots speculate, or that the idiots speculating misunderstood the purpose of the tool?

      when you've sampled the compiled... output... of a million coders all using different compilers on different architectures and are getting at least a 99% accuracy rate, get back to [me]

      There are multiple problems in your analysis. First, there are not millions of compilers or architectures. If you take the compilers and architectures that make up 95% of what is used, you've only got a few platforms and a few compilers, not different ones for millions of programmers. This may sound pedantic, but the problem you imply with this part is actually one of the sample size being too small. If each programmer had a different platform/compiler combination, then even a large number of programmers would have too small a sample size of comparables. But the problem is the opposite of that. (LMFAO) The problem is that since there are only a few different platforms and compilers, lots of people (most people) are using the same few combinations, and so the sample size is actually too large with millions of programmers.

      And worse, why do you think it would need to be 99% useful to be significant? It seems you grabbed the wrong end of the stick there. If it was 20% accurate with a sample size of 1m my goodness, you could really narrow things down with that. Combined with other factors like cell phone movements, and machine learning methods so that you're using the actual calculated liklihood and not a binary yes/no, well now it might spit out a vastly reduced list of possibilities.

      And working the other direction, in the state the tech is in now, if you can reduce your million candidates to 100 based on cell phone movements combined with other known information, then you have a 60% chance of picking out the 1 person? So if instead you calibrate it to give you the top 5 instead of the top 1, the accuracy would probably be in the high 90s. Getting down to that small group is hard, but then traditional techniques can be employed successfully once you get there, eg, investigation.

    6. Re:Oh really? by Anonymous Coward · · Score: 0

      Think ahead just a little bit. Please. Just a little.

  14. Useless by Anonymous Coward · · Score: 0

    Well, thank God you can't change your coding style and thank God there aren't style standards for projects one has to adhere to.
    Given this rather useless insight, how difficult would it be for someone to write a piece of code that randomly alters your style?
    30 minutes?

  15. Easy to defeat by ArcadeNut · · Score: 1

    Just run your app through an obfuscator and it's completely masked. Problem solved.

    --
    Visit the Arcade Restoration Workshop @ http://www.arcaderestoration.com
    1. Re:Easy to defeat by Anonymous Coward · · Score: 0

      Depends on how the obfuscation is done. If all the statements are kept in the same place and only the names are obfuscated, I doubt it will help much at all.

    2. Re:Easy to defeat by Anonymous Coward · · Score: 0

      I would write a more thorough obfuscator, but I don't want it traced back to me.

  16. Machine learning isn't quite that deep by Anonymous Coward · · Score: 0

    If the machine says Eketek encoded it, then Eketek alters the portions which get flagged as his own and builds a new assembly. Repeat until the machine blames someone for it. Problem solved.

  17. Change over time by MBGMorden · · Score: 1

    Even their test size seemed to have low accuracy, but I wonder how well this even works over time. I know my code from 5-6 years ago looks nothing like code that I write today.

    --
    "People who think they know everything are very annoying to those of us who do."-Mark Twain
  18. Two problems for Slashdot readers to work by BitterKraut · · Score: 1

    Problem 1.) Who wrote this https://de.wikipedia.org/wiki/... ? Problem 2.) In the movie First Blood, Part II (a.k.a. Rambo II), when the camera pans through the interiors of Marshall Murdock's CIA base building, parts of the code listing of some computer program can be seen scrolling through some of the screens there. Who wrote that code? Hint to Problem 2: The person in question is also a Slashdot member.

  19. Next up: coding style obfiscators by Anonymous Coward · · Score: 0

    Seems like the evolution of this will be programs that obfuscate coding styles. But then could the obfuscator itself be tracked back to its author too? Echo, echo, echo...

  20. Old News by 0xG · · Score: 1

    This technology was used to determine how many coders were used for the Stuxnet attack.
    That was back in 2010.
    Using that, they determined that a team of 20 people were used, indicating a state-sponsored attack of remarkable complexity...

    --
    A pox on web designers who feel that window.innerWidth == screen.availWidth
  21. We need google code translater by Anonymous Coward · · Score: 0

    then translate my code from c-> perl -> lisp -> c
    No one will be able to trace me anymore!

    Google translation: It's not a bug, it's a feature!
    Captcha: rectum

  22. Vala by Anonymous Coward · · Score: 0

    I wonder how well this works with languages like Vala, which get translated to another language (C, in Vala's case) before that is then compiled. How much of your coding style is still visible on the other side then?

  23. Coding style vs 'problem solving style' by John+Allsup · · Score: 1

    Any nontrivial programming exercise involves problem solving. Faced with a particular recurring problem, a programmer will learn methods to solve it. There are many choices. Most programmers, after having learned a small collection of 'good enough' solutions to common problems will continue to use them whenever 'good enough is good enough', the time and effort of relearning seeming unproductive.

    This is no different, conceptually, than in sports when certain sportspeople play in a discernible style. Nobody is perfectly uniformly good at all aspects of a discipline. (It would be interesting to see if one could take a list of statistics from tennis matches and use them to identify the players.)

    --
    John_Chalisque
    1. Re:Coding style vs 'problem solving style' by BitterKraut · · Score: 1

      I think that, in the case of tennis players, it will be much easier to identify highly discriminating features of players than in the case of computer programmers. This is so because imagining to actually have what it takes to be a great tennis player is much easier than imagining to have the skills of a great programmer: If you can imagine having the skills of a great programmer, you do in fact have them. So, why not examine the stock example: Identifying chess players by their moves?

    2. Re:Coding style vs 'problem solving style' by yacc143 · · Score: 1

      Yes and no. The better people in the industry continue to learn, every day.

      Actually, my current team lead expects us to learn all the time, and is completely willing to take the hit in longer ticket handling times.
      OTOH, my current boss is an outlier in my experience in this industry.

    3. Re:Coding style vs 'problem solving style' by Anonymous Coward · · Score: 0

      Who says that 'they' will be looking at either the code from a good programmer, or looking for the author among 'good' programmers. I'm sure that there are a lot more poor or average programmers out there than 'good' programmers, and that poor/average programmers are more likely to have distinct styles.

  24. Volkswagen Code? by KatchooNJ · · Score: 1

    Maybe now we can track down the guy that wrote that Volkswagen code? I'll be right there... need to grab my pitchfork.

    --
    "Never give up, for that is just the time and place when the tide will change." -Harriet Beecher Stowe ^_^
    1. Re:Volkswagen Code? by BitterKraut · · Score: 2

      At 32c3 https://www.youtube.com/watch?... , Daniel Lange and Felix Domke presented their analysis of Volkswagen's "Dieselgate" software. It seems that that one doesn't look like ordinary code at all, but rather like code patterns generated from tables that relate sensory data to engine control parameters. Think of one of the earliest motivations for building computing machines in the first place: To create parameter tables for artillery aiming!

    2. Re:Volkswagen Code? by 0100010001010011 · · Score: 1

      It's most likely made in Simulink and compiled to C and then for the target platform. Simulink is used everywhere in automated controls from the automotive up through aircraft.

      That's exactly what this code does, I work with A2L files all the time, it's how we calibrate our engines. It's how everyone calibrates their engines.

  25. Lint Checkers by Anonymous Coward · · Score: 0

    Ah good thing I use perl-critic.

  26. Not looking good for me by MarkH · · Score: 1

    Did a couple of scripts in quiet time between Xmas and new year. Took the chance to move from perl to python. I use git hub as it is there.

    Think my rating will be 'dufus head'

    1. Re:Not looking good for me by Anonymous Coward · · Score: 0

      Why would you use github at all?

  27. Not so fast by yacc143 · · Score: 1

    Important part:

            Finally, we do not consider executable binaries that are obfuscated
            to hinder reverse engineering. While simple systems,
            such as packers [2] or encryption stubs that merely restore the
            original executable binary into memory during execution may
            be analyzed by simply recovering the unpacked or decrypted
            executable binary from memory, more complex approaches are
            becoming increasingly commonplace, particularly in malware.

    So, there are numerous issues here:

    1.) getting the samples for training (e.g. the authors already mention this as a problem) => github and friends distribute source code, and it's not necessarily trivial to get the compiler and options right to recreate the correct binary.

    2.) If you would for example profile me online, you'd learn from code repositories that I know python, and you might from post interfere that I know other languages. My Python repositories will not help you identify my binaries build in C.

    3.) And worst, the code where this deanonymization would be most useful, e.g. malware, is very hard to handle, as it's usually obfuscated to the max. Worse malware has been known to mutate itself on replication to avoid leaving a signature for virus scanners.

    Anyway, nice ML paper. ;)

  28. Really? by idbeholda · · Score: 1

    Generally, any programming language has an upper limit regarding the number of commands that are recognized, which the same cannot be said of spoken/written languages. The only thing that will actually be discovered are the differences in algorithms, not the unique number of programmers to a particular dialect.

  29. Catching nulls by drolli · · Score: 1

    The place where (and how) you catch nulls is very programmer-specific in my experience and often evades the style-check.

    1. Re:Catching nulls by ledow · · Score: 1

      But...

      Say that's a "signature". There are only "n" so many places you can put the null-check that will work properly.

      Say you can list "m" such things. Then, at most, you can categorise every programmer into one of m x n groups (and, in fact, you might find that certain m's and n's go hand-in-hand, etc.).

      So, if you have something like github - that has 11 million users and 30m repos at last count. Let's assume that most of those 11 million users, then, are programmers that commit code. You'd need to find over 3000 such signatures, each with 3000 possibilities. Or several HUNDRED THOUSAND such signatures, each with a few hundred options.

      Yes, you don't need to narrow down to an individual, but if you're going to use this to try to do - what? Convict malware authors? Then you need to be pretty certain. "Beyond reasonable doubt".

      And I'd like to point out that I've committed code that I didn't actually write, or that I tweaked from others in minor ways to fit my preferred style. So there's all kinds of complications.

      This is bunkem and bollocks. It's about as scientific as starsigns, as useful as "smelling" the code to see if you can detect a whiff of the programmer's body odour, and as admissible as graphology.

      And, if I thought I was writing malware, I can tell you now that I'd do everything I could to disguise the origins of the code, including coding style. I'd make it as boring and non-personal as possible. And likely anything I do would be optimised to oblivion by any compiler anyway.

    2. Re:Catching nulls by MerlynEmrys67 · · Score: 1
      I don't really believe this. I have only used the pattern
      Allocation
      check for NULL
      error
      continue

      There are a few different things to do in the error, do you goto error handling at the end? Do you start unwinding previous work and exit out? Yes, I have done all of those in different situations, with different coding standards around me, but the basic NULL check is the next thing that happens directly after the allocation. Note that compilers don't care about formatting and whitespace, so if you do it on one line, 3 lines or with 100 page breaks in the middle - it doesn't care to the compilers parser that will feed into the binary. This isn't a check for formatting, but for algorithm styles, and I bet they can't pull a programmer's code out of a teams code around them all using the same style guidelines (that would affect the expression of algorithms)

      --
      I have mod points and I am not afraid to use them
    3. Re:Catching nulls by drolli · · Score: 1

      Well i have seen people placing constructs which "propagate" nulls in data structures...... (not that i am a fan of this, i would have liked to make guy swallow his keyboard). The strategies can to do so can be funny.

  30. Obfuscation plus some other shuffling... by gestalt_n_pepper · · Score: 1

    will make any such strategy useless in short order. Source code translators and syntax standardization tools might be another approach.

    Anyway, it's a big yawn, however, some enterprising con artists will sell this to clueless government bureaucrats for big bucks. Bureaucrat will get his bonus. Con artist company will get their money. Win, win. It won't work, of course, but when has that ever mattered in the government world?

    --
    Please do not read this sig. Thank you.
  31. Blend into the crowd by 14erCleaner · · Score: 1

    Cut-and-paste other programmer's code!

    --
    Have you read my blog lately?
  32. Of limited use, but an interesting comment on CS by Anonymous Coward · · Score: 1

    This study seems to have a high error rate. (70-80% correct, less for big programmer populations)
    If might be useful for de-priortizing some leads, but seems a bit like a divining rod.

    What is interesting is what it says about what programmers do.
    They continually make choices as to how to implement things.
    The choices are limited by their judgement and bag of tricks.
    What they have seen, what has worked in the past, and what they manage to dream up.

    Perhaps this research is actually creating and comparing inventories of these bags?
    If so, then they are not just measuring the properties of one programmer, but a network of programmers trading tricks.
    It should also produce some hints of which programmers have interacted in the past.
    'Interact' does not mean actually are aware of each other, but rather are aware of each of their code.

  33. What if the style was "idiomatic"? by mark-t · · Score: 3, Interesting

    It seems to me like the easiest way to avoid being identified in this regard would be to write code that follows any published general style guidelines or otherwise very common conventions.

    As a side effect, it will make your source code more readable to others, which is beneficial if you are on a programming team.

  34. possible upside by Gravis+Zero · · Score: 3, Interesting

    while i don't think you'll be able to identify an exact person, i do think this technology could be used to identify code that is prone to error and exploitation or even code that is for exploitation.

    --
    Anons need not reply. Questions end with a question mark.
  35. Stupid, but moral of the story: by netsavior · · Score: 1

    There is no way this could be even close to conclusive, but the moral of the story is - if it is stupid, but a judge will call it probable cause, then it isn't stupid.

    The truth is it doesn't need to be conclusive, it just has to look conclusive to a 60 year old law professional with no programming experience.

  36. It's going to lead to a lot of false accusations by cshark · · Score: 1

    As someone who knows a fair amount about compilers and interpreters, I would be highly skeptical of that underlying statement to begin with. Going down this path is a road that stretches the bounds of credulity.

    But I think I would also dispute the notion that programmers have unique coding styles in the age of widely accepted standards and practices.
    Practices, I would concede are coding styles.

    Though, even then... it's not like we're talking about something like assembly, where the style you use would really be pronounced.

    We're talking about mostly .net and java applications, that use commodity skillsets, that adhere to certain sets of rules and guidelines. You don't give up your best practices and behaviors when you decide to write something that breaks the law.

    You don't change for github, either. In general, most programmers work with whatever is acceptable for the platform they're using, and the rest falls into place.

    The exception, of course, being amateurs, and possibly some hobbyists... who don't really know enough to do anything all that malicious to begin with.

    This is a witch hunt, plain and simple.

    Unless of course, they're actually analyzing writing styles, and they're trying to bring all of us advanced level programmers out by saying something idiotic, knowing that we'll all comment on it.

    Fuckit. They're probably doing that anyway.

    --

    This signature has Super Cow Powers

  37. 01010111 01101000 01100001 01110100 00111111 by fustakrakich · · Score: 1

    01001001 00100000 01100011 01101111 01100100 01100101 00100000 01101001 01101110 00100000 01100010 01101001 01101110 01100001 01110010 01111001

    --
    “He’s not deformed, he’s just drunk!”
    1. Re:01010111 01101000 01100001 01110100 00111111 by Anonymous Coward · · Score: 0

      fustakrakich, is that you?

    2. Re:01010111 01101000 01100001 01110100 00111111 by Anonymous Coward · · Score: 0

      OH great. Now I have to delete all the CowboyNeal porn that downloaded.

    3. Re:01010111 01101000 01100001 01110100 00111111 by Anonymous Coward · · Score: 0

      Asshole, steal my code again and I'll stick you!

  38. How this will be used by Alain+Williams · · Score: 1

    I suspect that the value is not in answering the question "who the hell wrote this - which programmer in Internet land ?" but in identification a programmer out of a small group of suspects, eg "was this written by the known malware team in Boston, Beijing or Kiev ?". So: it will further narrow the field out of an already small group of suspects.

    This has an interesting implication on GPL enforcement. Today if Nasty Corp Inc takes a large chunk of code from Git Hub and makes it part of a proprietary product (eg: sell it & do not provide source), then even if you suspect that they have taken your code it is hard to prove it; yes you may be able to get disclosure by going to court but that costs a lot of money and is hard if they are in a different jurisdiction. Now you will be able to get a good idea if the code is yours before spending significant time and money chasing Nasty Corp.

  39. Advantage to languages like Go by Anonymous Coward · · Score: 0

    Perhaps a language that is heavily opinionated in both style and patterns (ie what is considered "idiomatic") can help frustrate this kind of analysis.

  40. From the department of woo. by Anonymous Coward · · Score: 0

    Another bullshit method to lump on the pile alongside handwriting analysis, lie detection and parallel construction.

  41. In other news: Investigators find malware creator by guruevi · · Score: 1

    Linus Torvalds has been indicted for creating numerous pieces of malware. "His coding style is unmistakable" prosecutors said quoting numerable code fixes he made after scolding commentaries on other people's coding style.

    --
    Custom electronics and digital signage for your business: www.evcircuits.com
  42. Uh huh. So what happens when... by thermowax · · Score: 2

    ...you run the object code through a permuter like shikata ga nai?

    I suspect the successful detection rate may be a bit lower.

  43. Re:Of limited use, but an interesting comment on C by deodiaus2 · · Score: 2

    Did the same team that developed that code also run an accuracy assessment? Was there a "prize" (contract payment) associated with meeting certain accuracy? I remember reading about facial recognition systems which worked well in labs, but fail in the field.
    As soon as developers become aware that they might be identified, I think that they might do things (spoof, run beautify and strip comments) to throw such a system off.

  44. This is why I use this thing called Inheritence by WillAffleckUW · · Score: 1

    That way, the smoking code leads back to the person who I stole it from, rather than to me.

    --
    -- Tigger warning: This post may contain tiggers! --
  45. Is Android "software designed to commit crimes"? by tepples · · Score: 1

    I'd think the only time this would work is if a programmer was a contributor to open source projects, then went bad and started writing software designed to commit crimes.

    Say someone contributes to open-source projects and then contributes to the Android project. A U.S. appeals court found Android to infringe Oracle's copyright, pending a forthcoming phase of the trial to determine whether API interoperability is a valid rationale for fair use. Does Android count as "software designed to commit crimes" because copyright infringement is a crime?

  46. Why does GitHub deserve a monopoly? by tepples · · Score: 1

    If you arent uploading to GitHub, you are an Alchemist, not a Scientist.

    If by "alchemist" you mean "someone practicing obsolete practices worthy of derision", this sounds like you're trying to say GitHub ought to have a monopoly on hosting free software projects, as opposed to SourceForge which shares a parent company with Slashdot. Do you work for GitHub?

  47. Clearly Sub-Optimal Compilers are to blame by Anonymous Coward · · Score: 0

    If we had better compilers that would optimize out this stuff, or compilers that did optimization by default to mask this, this wouldn't be an issue.

    1. Re:Clearly Sub-Optimal Compilers are to blame by PJ6 · · Score: 1

      If we had better compilers that would optimize out this stuff, or compilers that did optimization by default to mask this, this wouldn't be an issue.

      You can't optimize out architecture. And not everyone approaches programming like getting a recipe from a cookbook - masters have their signatures.

  48. Snake oil by Anonymous Coward · · Score: 0

    This one is easy, problem is it's like fingerprints, which aren't actually unique but provide a convenient way to find someone to persecute. (Not a typo that).

    Coding style will be even less unique than fingerprints.

    You could possibly use either to prove 'unlikely to be involved' but if it's used it'll most likely be used to 'prove' the innocent guilty.

  49. coding style changes over time by kwikrick · · Score: 1

    i'm pretty sure my coding style has changed significantly over time, from project to project, due to experience, learning from past mistakes, influence from other programmers, etc. Also, I've worked on projects where 10 different programmers have touched the same code. Good luck trying to identify me from any two pieces of code.

    --
    assignment != equality != identity
  50. Paranoid much? by Anonymous Coward · · Score: 0

    n/t

  51. not really news by Anonymous Coward · · Score: 0

    Many, many years ago, I was reverse engineering software for a variety of microcontroller based products. We ran prom dumps through disassemblers, etc. It was easy to tell how many different programmers had a hand in the code. Things like lengths of subroutines, looping techniques, how data was declared and laid out. Even though the original programmers were working in C, it's easy to tell.

    This is no different than the typical 80-90% accuracy you can get on random samples of text from your emails or other writings. While 80% on any one sample isn't all that impressive, when you combine it with other evidence that is orthogonal, it's pretty easy to unambiguously identify an author, especially if you have a halfway decent sized corpus to work from.

    The Enron case and the resulting discovery motions created a HUGE database of emails and memos to work with.

    Software is no different.

  52. Not gona by Anonymous Coward · · Score: 0

    This is not going to hold up in any court, except Iran.

  53. KAPOOF! This destroys core arg of lazy coders by Anonymous Coward · · Score: 0

    I have lost count of the times younger and lazier coders have told me that their high-level compiled code was every bit as efficient as assembly code. There seems to be an absolute conviction that compilers generate the tightest and most-optimized code. If that is so, however, then those compilers should generate the same code no matter who writes the high-level code. If the compilers truly generate different code with uniquely-identifiable characteristics depending upon the coding styles of the individual coders, then those compilers are NOT generating the tight highly-optimized results their supporters claim. Just a thought for you guys that never bother with low level code...

    As with ANY tool in a toolbox, there's a time and a place for an assembler, a compiler, and an interpreter - NONE are ALWAYS the best. I'm certainly not an advocate for writing most apps in low-level code, but it's also not correct to assume that it's just fine to do everything in a high-level language and "let the compiler do all the hard work".

  54. Re: Coding style by EdmundSS · · Score: 1
    Do you write:

    void foo()
    {
    String result;
    if (...)
    result = a;
    else (...)
    result = b;
    else (...)
    result = c;
    return result;
    }

    or

    void foo()
    {
    String result;
    if (...)
    return a;
    else (...)
    return b;
    else (...)
    return c;
    }

    Both are common, but compile to different code. Do you code to 'a procedure should live on a page'? How about 'a procedure should have a purpose'? Return errors or throw exceptions? Return values or modified arguments? Kitchen sink constructors? Getters & setters or fluent builders? There's seven variables without stopping to think, dividing coders 128 ways, and I'm sure you could find another dozen or so, taking it to one-in-a-million level. There's no need to obfuscate...