ESR to Shred SCO Claims?
webmaven writes "According to this article in eWEEK, ESR has released a utility called comparator for analyzing the similarity of source code trees. The technical details are interesting, in that ESR says he is using an implementation of a refined version of the 'shred' algorithm, with higher performance (on machines with enough RAM) than other versions. ESR won't say whether he intends the comparator to be used to compare older Unix code to Linux so as to be able to refute SCO's claims, but it's obviously well suited for such a purpose. Interestingly, as the shred algorithm can run reports on source trees using only the MD5 signature shreds (once generated), it is possible to use it to compare trees without direct access to the source code itself, leading to a possible use in comparing various proprietary source trees with each other and with Freely available code bases such as Linux and *BSD without requiring actual disclosure of the proprietary source code (a neutral third party could generate the shreds on a company's premises, and leave without taking a copy of the source with them). I'll be interested to see if (or which of) the proprietary vendors allow their source trees to be 'shredded' for such comparisons, and whether this becomes a standard forensic technique in source-code copyright and trade-secret disputes."
I think the question here is not about whether there is common code between SCO and Linux. There is no doubt that there will be common code because of the common origins. The issue here is that SCO does not own that code.
This shouldn't be relied upon in the court of law. Although I acknowledge that SCO likely has no IP claim over Linux, it should have a fair case. A program that would rule out code similarities does not rule out code that is based on the SCO code. There are hundreds of ways to do a single thing, and if the GNU/Linux took ideas from the SCO kernel, SCO may be as eligible for compensation as if it were directly copied from SCO.
And why did you staple the trout to the RAM?
Because lossy compression would be useless. When decompressed the source code wouldn't work anymore.
Source code isn't loss-tolerant (or whatever)
ESR shows us once again why exactly he has so much respect from the community. Well done, that man.
Anyway -- who cares? There's no question there are plenty of common chunks between Linux and SCO-owned source. And that there are ways to find them. The question is what they are (which SCO isn't saying) and what their common origin is and where that origin falls in the murky history of the Unix codebase. It's not as if anyone has been saying, "We're helpless in the face of this computational problem. If only there were a way to compare large bodies of text for common elements!"
Never mind that there are probably people who can compare both codebases in their heads.
Maybe he's made some major algorithmic breakthrough. (I doubt it but, but I'll leave that to the experts.) But this story is just him yapping again.
What I'm listening to now on Pandora...
The point is that we don't need SCO to do anything. Presumably any of the many people with legal rights to SCO source code can publish the hash list without divulging any of SCO's (ahem) "IP". Even more interesting is the theoretical possibility of comparing historical releases of SCO trees against GPL-licensed code, thus (perhaps) demonstrating that SCO has illegally violated the IP of OSS developers. Of course, hash comparisons alone would be unlikely to convince a judge/jury of anything. They ought to be sufficient grounds for some embarrasing subpoenas, and maybe some really neat cease-and-desist orders, though.
Upper mangeement in most enterprises have a low level of technical knowledge. To them the thought of something called shredding coming anywhere near the 'voodoo' of software development would be abhorrent.
Chances are slim to none that a software company would allow it's "shredded" source code to be publicly released. What happens if the proprietary source is found to violate the GPL?
Proprietary (closed) source companies have a tremendous advantage over open source software when it comes to violating intellectual property. Who will ever know if they did it? A source code "comparator" eliminates that crucial advantage.
While I fully support ESR and the rest of the open source movement's defense of Linux against SCO, I have a feeling that this tool's results will not immediately be accepted by established media simply because of ESR's bias. A reporter looking into the SCO story who knows little about open source wouldn't trust a tool made by one side of the disagreement.
It seems very important to me that "third parties" and experts who are not an integral part of the open-source movement validate that comparator works as intended and is effective at detecting code similarities. Hopefully we'll see some articles on respected sites in the next week or so with conclusive analyses of comparator. Not to mention a chance for someone to use it on SCO's code!
Oh, and "Yes, I'm being deliberately vague and tantalizing" is quite funny.
Well, I was looking at ESR's description of the code (I haven't read the code yet), and it seems to say that he takes 3 line slices, MD5s them, then compares them for identical points. I'm sure he compensates for funky whitespace and whatnot like diff and patch do...
But if even one bit of the source is different, the MD5 hash will be quite different. So, the code slices have to be IDENTICAL. This is not a very good system because a simple find-replace could defeat it. A variable's name changed by one letter, or even capitalization, will defeat it.
Unless the code reveals much more complex tricks than ESR describes in the help file, this tool wouldn't be much use in the SCO case. Hell, it wouldn't be much use catching college class cheaters even.
Slashdot. It's Not For Common Sense
While you might be able to deal with whitespace, you do still have the problem that you're really only looking at whole-file matches for identity. You can't find one function lifted from some other source. You can't find code that's had even minimal cosmetic surgery on the variable names.
While a high degree of exact matching between two trees would demonstrate related code, lack of a high degree of identical files as determined by this method does not demonstrate that two code trees are unrelated. It's perhaps an interesting metric for comparing two projects that you already know are related, like two forks of a project or two versions of one project. But this technique is nearly useless as an anti-SCO defense.
Comparing the hashes doesn't give you a definitive answer; it does, however, tell you where to look. Or which submitters to ask for clarification on the origins of potentially infringing code. That's more than we have now!
"Freedom means freedom for everybody" -- Dick Cheney
It gives software houses a way of publishing commercial code for copyright purposes. If you claim copyright on code, you can publish the MD5 shred sigs for the code. No one can rip you off, but you can enforce your rights in a court.
Even better - no one now has an excuse for not publishing. That means that we can make sure the kernel never comes within spitting distance of anyone else's property again. And if it does - well they should have published.
Now if SCO aren't willing to publish their MD5 shreds, then that can only be because they have no case. In which case - game over!
On the other hand, if they do, the world at large can then go through their published shreds and see exactly whose code SCO have been ripping off. Given the likely origins of those samples they exhibited a while back, I'd say that's likely to be quite a bit.
This looks like the best news for the war against everyone's favourite Stupidly Corrupt Organisation since the whole mess kicked off.
Don't let THEM immanentize the Eschaton!
Think of the chance that any given line of source code in an arbitrary program is repeated somewhere else in a large open source program such as the Linux Kernel. This is even more true if some degree of fuzziness is added to handle changes such as adding or removing spaces in insignificant places, removing comments, (and there are many other things like brace style which affect multiple lines so you might want to physically reformat between lines to a standard format....
If the number of lines is even only 1% that are found somewhere in the open source code base, I think a source who wants to keep their code base secret will have a big problem with someone computing the checksums. In reality, I wouldn't be suprised to see a much-higher percentage of lines leaked this way. And this is not the only way leaking can occur (think of application of simple cryptography).
I would not want to be the one publishing the checksums of the closed source due to possible legal liability. The checksums are a derived work in any case.
SCO was unaware of the origins of much of the code and this "explains how they could wheel out the old malloc() code and the BPF (Berkeley Packet Filter) code, not realising that both were now under BSD licences - and in fact they hadn't even written the BPF code," Dr Toomey said.
The SCO Group (not old SCO) hasn't written any code in SysV UNIX.
Anyway.. One could hope that when this is all over, the UNIX sources will be bought up from the carcass of SCOX and open-sourced, finally putting it out of its misery..
That is, as long as SysV UNIX doesn't have more stolen code in addition to the BSD code we all know about..
The sooner the zombie of UNIX is put to rest, the better for all the live Unices.
It's a tossup between ESR and Perens as my favorite Open Source advocates. Perens is funny and answers emails and offers to help anybody out. ESR likes guns and writes extremely well. They both have some big brass ones, leading the fight against SCO.
As for Stallman... I think he's still bitching about Debian and their not completely, totatlly, 100% free packages. I haven't seen him contribute anything in a long time except complaints and rhetoric.
This guy is way out there
While Dark McBride and Chris Sontag shoot their mouths off, the community develops tools to finally make something clearer. :)
-Arnulf
What if this ESR tool runs and finds commonality, and the research shows that, in fact, SCO's rights were breached. Remember, this type of analysis is a two-edged sword. The purpose of this ESR is to remove doubt... but remember doubt could be removed either direction.
So, given that hypothetical, what would people here think? Would you forive SCO? Would you concede SCO's point, but think that SCO defended their rights in a very poor manner? (this, btw, is what I would probably do). Would you stick your fingers in your ears and refuse to accept the outcome, and believe in some vast -wing conspiracy?
Obviously, the Linx movement would carry on. I don't think the death of Linux is even worth discussion. Some recourse would happen, probably monetary damages, and the offending code would be removed.
My real curiosity is how people's attitudes or feelings would change (or not change) if it turns out SCO is right (however unlikely that is).
Sarcasm and hyperbole are the final refuges for weak minds
Did you read the article? Those are some of the most self-aggrandizing quotes I've ever seen in real life. SCO lawyers should "be afraid" of him. He "perfected" the algorithm. His 1500 line program is a complete masterwork; both elegant beyond compare and a paragon of maintainability!
You don't ever see, say, Linus, Larry, or RMS talking themselves up like that.
Posted with Mozilla
Exactly. Any coder worth his salt could put a "shred" program together in a couple of days. ESR's claim to fame is more about being a loudmouthed idiotarian (woops, meant "libertarian", but whats the difference?) than anything else.
Compare C parse trees. That's right, look at the parse trees, use some fancy graph algorithm to compare the calculations and parse tree nodes.
Someone mod this up I think I'm on to something!
2 years and no mod points. Join reddit. Because openness is good.
Here is the reason: the people that "stole" SCO's code (if indeed that happened) probably were not acting with ill intent. They probably thought they were doing genuine, valid reuse, in which case, why hide it? Obfuscating runs the risk of introducing new bugs.
OSS programmers, even the ones that cut corners, are not malicious in my experience. There are honest mistakes made, because, well, they are lone programmers, not lawyers, or professional managers, or finacial experts, or whatever.
However, if code was diliberatly obfuscated, that would be very, very bad news for Linux. That shows that it was not an honest mistake, but the programmer knew something about the origins and they needed to be hidden. At the best, he could argue that he didn't think that it was an IP violation, he was just trying to make himself look better by not giving credit. The other side could argue he obviously new he was breaking the law.
Of course, as I said, I honestly don't think this case will come about. Even if code found its way in, I don't think it was a programmer say "Hey, I'm going to do this, but it is illegal, so I will cover my tracks."
Sarcasm and hyperbole are the final refuges for weak minds
Right. Because as we all know, people who pay Microsoft the huge bag 'o money that it costs to see their source are primarily interested in the pursuits of OSS to see if Microsoft has copied anything it shouldn't have. And Microsoft's NDA surely gives them the right to do this.
If anyone is able to prove Microsoft is doing something illegal via the shared source initiative, they'll probably have to do it illegally.
Mod me down and I will become more powerful than you can possibly imagine!
While IANAL, I don't suspect you are either. Copyright is not something that applies to ideas - it applies to expressions of ideas. I'll quote the Apple vs Microsoft case note by Joseph Meyers:
In other words, the question on the table is whether portions of the Linux kernel are a derivative work of SCO's code - not whether it uses SCO's ideas.If you've licensed code from microsft, and it turns out to be GPL, the license under which you got the code is invalid, so it wasn't illegal to determine if they improperly took code.
On the other hand, if all their code checks out, testing for that may violate their NDA, but it'd be difficult for them to show you checked their code if you don't mention it.
Need a Catering Connection
Finding obfuscated copied code is a difficult problem to solve. Presumably, SCO has put forth much effort into that, but they refuse to make public their claims.
Straight forward copying of code is much easier to find, and much easier to show is copying in a court. If we look at all the instances of duplicate code, and determine if they are license violations or not, it will be a start to making SCO go away.
Need a Catering Connection
> A single change in the white space would make the
> MD5keys not match. any code run through an
> obfuscator woul dnot match. a change in case
> would cause it not to match.
Sorry to flame here chaps but the comparator code already strips all white space characters from the lines before checksuming them. Thus trivial change such as indentation size will be ignored (though switching from 'proper' K&R indenting to some other form will break it).
So, this method of identifying copied code would only work if the code had never been run through an obfuscator.
You've hit the nail on the head, possibly without knowing it. The source code needs to be run through an obfuscator *before* shredding. Actually, I'm thinking a special obfuscator, let me explain.
Let's take a piece of C source, not randomly chosen:
malloc(mp, size) struct map *mp; { register int a; register struct map *bp; for (bp = mp; bp->m_size; bp++) { if (bp->m_size >= size) { a = bp->m_addr; bp->m_addr =+ size; if ((bp->m_size =- size) == 0) do { bp++; (bp-1)->m_addr = bp->m_addr; } while ((bp-1)->m_size = bp->m_size); return(a); } } return(0); } Now, the structure of the code is 99% of what matters. Variable names can change, but few people would change anything beyond that. Let's modify the code in a couple of important ways. First, all variable names are changed to new names, on a per-line basis. Blank lines and unneeded blanks are all removed. Each statement is on its own line, and formatting styles (such as curly bracket placement) are standardized. malloc(a, b) struct a *b; { register int a; register struct map *b; for (a=b;a->c;a++) { if (a->b>= c) { a=b->c; a->b=+c; if ((a->b=-c)==0) do { a++; (a-1)->b=a->b; } while ((a-1)->b=a->b); return(a); } } return(0); }This might not be perfect, but it should do the trick. A programmer can change variable names, spacing, or format, but as long as the code is the same, it'll match. Obviously, changing the code would have an impact, but nearly every line would have to be changed for it to not match, and in a substantial way. That's literally not always possible to even do in a way that would trick this function.
Anyone want to write it?
Michael
Do you have ESP?
Copyright is misapplied to source code. Either REVEAL THE SOURCE or you only get protection on that which you "publish" - namely, the binary.
Put up or shut up; no source, no copyright on the source. You won't share it, you don't need it protected.
Actually fetchmail proves that he can code.
This program just proves that md5 is not the correct hash for doing this kind of comparison. It is TOO GOOD of a one way hash, and will only return is positive if the lines being compared are 100% equal.
Finkployd
The more points you discover and disprove now with SCO's claims.. the higher quality, more refined, and detailed SCO's evidence will be when this setup finally gets to a court in front of a judge.
Having many thousands of bright minds working on our side much more balances the advantage SCO can get by snooping on our discourse, if they can even come close to following it all, that is. We outnumber them, it's stupid not to capitalize on that.
Just think, if the word doesn't go out, there are many people who might not have come out of the woodwork to contribute their valuable input, historical recollection, interesting files, legal insight, whatever. We work in the open, we share information, we cooperate, we are many in number. They work in the dark, they trust nobody, they're afraid to ask for help, they are few. It's open source versus closed source all over again.
Also, we each do our own thinking, we try to come up with the part we can contribute, then we go looking for the best place to contribute it. Multiply by 10's of thousands. Compare to a few fevered minds going over and over the same rotten thoughts then sending out marching orders. Seen two systems like that before? Right, it's a free market economy versus Soviet-style central planning. In the end, the free market won because it is more efficient.
With the help of the open source community, they are slowly changing their weapon of choice from a shotgun to a rifle.
A rifle will not help you much against a herd of 50,000 enraged penguins stampeding towards you at an average speed in excess of 100 miles per hour.
Have you got your LWN subscription yet?
And Microsoft's NDA surely gives them the right to do this.
A term in any contract, including any NDA, as stipulated by any party, which would obligate the other party to not report a violation of law, either statute or criminial, is PER SE unlawful and cannot be enforced within any jurisdiction of of most first world countries. Any contract bearing such a stipulation would in fact be at significant risk of invalidating the ENTIRE contract, not just the unlawful provisions therein.
C//
Let's apply a powerful legal tool: The silly analogy.
Take a copyrighted work (Harry Potter and The Chamber of Secrets, for example).
Now, rearrange all the letters randomly, and pick (say) every 10th letter. Apply rot13 to the result, and print it.
Is this derivative work? If you think it is, then, yes, copyright holders should be able to control MD5 hashes produced from their work.
I can see this tool becoming helpful for so much more than smashing SCO. Any situation where data comparison is useful, but the data itself must remain secret. All paranoid types (corporate or governmental) will love it. Lawyers could make much use of it.
And, given the dataset it generates, it could be extended to do other useful things such as detect redundant or cut-'n-pasted code, including bugs of the "pasted it in twice" sort.
By removing whitespace you collapsed a number of distinct substrings, i.e. what used to be different substrings of the form A\s+B are now represented as just one substring AB. A smaller set of distinct substrings leads to better compression.
Did your momma have any children that learned to think?
Source code gets no copyright protection: corporations keep their source as a "trade secret" and only get protection on the executable. It is illegal to redistribute (copy) the executable, and the source is entirely within their control (and their responsibility). No real "furtherance of the arts" is accomplished except within the limited scope of usage of the tool itself. If a work is infringed at the source level, therefore, it is (nearly) impossible to prove without revealing "trade secrets" and, therefore, exposing the company to further risk.
Source code gets copyright protection (as constitutionally mandated)
Corporations have to register the source code, and therefore are given fulll protection on both works. It is just as illegal to redistribute (share) the source beyond the scope allowed by the rights holder, and if a work is infringed there is no risk to the rights holder in defending the work. "Furtherance of the arts" is addressed, as well as the rights of the work's creator.
Corporations are allowed "copyright" on works they do not share.
It becomes nearly impossible for libeled parties to defend themselves, but "rights holders" are free to make claims as they see fit. Which gives "rights holders" basically free reign to make accusations which they may never be forced to address in court, and leaves victims nearly defenseless until the (very slow) court gets around to addressing the issue. Neither "furtherance of the arts" nor protection of (libeled) rights holders is served, since the more powerful party remains free to withold (copyrighted) "evidence" that no one is allowed to see.
How does this system serve rights holders whose works may have been infringed upon, but are forced from the marketplace by another "rights holder" with more money? How does that system serve the public interest? How does it promote progress?
Can you answer any of these questions using sound logic?