New Method To Detect and Prove GPL Violations
qwerty writes "A paper to be presented at the upcoming academic conference Automated Software Engineering describes a new method to detect code theft and could be used to detect GPL violations in particular. While the co-called birthmarking method is demonstrated for Java, it is general enough to work for other languages as well. The API Benchmark observes the interaction between an application and (dynamic) libraries that are part of the runtime system. This captures the observable behavior of the program and cannot be easily foiled using code obfuscation techniques, as shown in the paper (PDF). Once such a birthmark is captured, it can be searched for in other programs. By capturing the birthmarks from popular open-source frameworks, GPL-violating applications could be identified."
I used to be a research assistent, and at university, we used this technique to see if students copied their assignments. They could rename variables, move pieces of text, change comments all the way they liked, but the execution profile stayed the same. We caught a lot of students, and they never figured out how we did it.
lets just set the code free. lets not chase it down the street to make sure it stays free, just let it go as it will.
What is the false positive rate for this method? What if two programs just happen to do the same thing and the authors happened to choose similar ways to do it. Would this method conclude that one originated with the other? It's not a copyright violation because neither is a derivative work of the other.
Also, it occurs to me that this method would probably not be as useful as expected for detecting GPL violations. It would think it would only be effective for checking where you have source code available, or at the very least enough symbol table information to make comparisons, which you are not likely to have if somebody is violating the GPL because that implies no source code anyways (and almost certainly no symbol table information for the binary).
File under 'M' for 'Manic ranting'
An identical library call signature for a nontrivial part of the execution could be produced by a clean-room analysis or even independent development of an equivalent component. Neither of these is a GPL violation.
This is not to say that the technique wouldn't be useful for hunting down GPL violations. But a positive is not difinitive by itself.
Meanwhile code obfuscation (even automatically generated obfuscation) could easily modify at least the timing, if not the order, of such calls.
Nevertheless this is a powerful tool: An hunk of GPL code that hasn't had its flow obfuscated systematically (even code that HAS been obfuscated but not systematically) will have large swaths of code that trips the detector. And it doesn't require reverse engineering until after the alarm goes off.
Good job, guys.
Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way
GGA! The GNU Genuine Advantage program!
Karma cannot be described by words alone.
Pitchfork? ... Check ... Check ... Check ... Check ... Check
Torch?
Map of Corporate Castle locations?
FSF Lawyers programmed to be speed dialed in emergencies?
Desire to burn the non-believers?
Okay, I'm ready! What IRC Channel are we meeting in?
load "$",8,1
I looked through the paper, and it is cool stuff. But I couldn't see where it supposed the system would work well for other languages, and I wonder if it really would be so good.
Java has a very large standard library that is always dynamically linked, and hence can easily be instrumented as the technique requires. C allows static linking which would make such hooking much more difficult. Additionally Java executes in a very standard environment due to the Virtual Machine, where as other languages may have varying ABIs type sizes and other properties that could add significant noise to the birthmark.
That said, system calls are always hookable and reasonably standard, so maybe this technique could be applied successfully there for malware detection or similar?
-- Mike
This is very cool and potentially useful. By itself, it wouldn't be enough to force compliance or win a violation suit, it could well be enough to meet the threshold for filing a suit and forcing source code analysis in discovery. Really, it is a great tool to have to ensure that open source license terms are respected by removing the "code anonymity" inherent in a binary.
The Tao of math: The numbers you can count are not the real numbers.
A couple years ago, a manager outsourced some programming work to India. When I reviewed their work, I was impressed, but the code was inconsistent (quality, indent style, variable names, etc). I figured maybe parts were written by a new programmer. A couple days later, I accidentally discovered that a lot of the code (the part that impressed me) had been copied from a GPL program. I alerted my manager, but he didn't care. I alerted the outsource company, they didn't care. I alerted our legal department, and they seemed to care a lot.
Long story short, the manager got fired and I replaced him. We ended up using the original GPL software with some modifications (which were contributed back).
I have released code under BSD license (as well as GPL, ZLib/LibPNG/, Boost, public domain and proprietary, and probably a few others).
The LEAST of my concern in releasing ANY open source is some childish popularity contest.
The only valid reason for me has always been the hope of getting something in return. In the case of BSD, this return is usually "applications that work better". Without the BSD TCP-stack, Windows would probably be worse quality, how would that have benefitted anybody except the anti-Microsoft zealot?
Slashdot social media options: AIM, ICQ, Yahoo, Jabber and Mobile Text. Why no MySpace?