ESR to Shred SCO Claims?
webmaven writes "According to this article in eWEEK, ESR has released a utility called comparator for analyzing the similarity of source code trees. The technical details are interesting, in that ESR says he is using an implementation of a refined version of the 'shred' algorithm, with higher performance (on machines with enough RAM) than other versions. ESR won't say whether he intends the comparator to be used to compare older Unix code to Linux so as to be able to refute SCO's claims, but it's obviously well suited for such a purpose. Interestingly, as the shred algorithm can run reports on source trees using only the MD5 signature shreds (once generated), it is possible to use it to compare trees without direct access to the source code itself, leading to a possible use in comparing various proprietary source trees with each other and with Freely available code bases such as Linux and *BSD without requiring actual disclosure of the proprietary source code (a neutral third party could generate the shreds on a company's premises, and leave without taking a copy of the source with them). I'll be interested to see if (or which of) the proprietary vendors allow their source trees to be 'shredded' for such comparisons, and whether this becomes a standard forensic technique in source-code copyright and trade-secret disputes."
microsoft can just shred their source tree and start anew. maybe...
My problem? I was perfectly gruntled, until some numbnuts came by and dissed me.
If there is, why couldn't MD5 shreds be used as a lossy compression scheme for code?
Go here to create your own Slashdot dis
This will only serve as another black eye on the Open Source community. ESR should know better that to shred SCO material prior to a trial.
I think the question here is not about whether there is common code between SCO and Linux. There is no doubt that there will be common code because of the common origins. The issue here is that SCO does not own that code.
This shouldn't be relied upon in the court of law. Although I acknowledge that SCO likely has no IP claim over Linux, it should have a fair case. A program that would rule out code similarities does not rule out code that is based on the SCO code. There are hundreds of ways to do a single thing, and if the GNU/Linux took ideas from the SCO kernel, SCO may be as eligible for compensation as if it were directly copied from SCO.
And why did you staple the trout to the RAM?
It might be interesting to see how different families of Linux/Unix compare... maybe generate a veritable "family tree" of relationships.
Of course, that also depends more on how differences are actually calculated. Still, could make an interesting project to relate OSes based on how much shared code they still retain and show it in a graphical tree format, ala "family tree." 8)
Diplomacy is the art of saying, "Nice doggie!" until you can find a rock.
Article text follows:
SCO may not know origin of code, says Australian UNIX historian
By Sam Varghese
September 9, 2003
More doubts have been cast on the heritage of System V Unix code, which the SCO Group claims as its own, by an Australian who runs the Unix Heritage Society.
Dr Warren Toomey, now a computer science lecturer at Bond University, said today: "I'd like to point out that SCO (the present SCO Group) probably doesn't have an idea where they got much of their code. The fact that I had to send SCO (the Santa Cruz Organisation or the old SCO) everything up to and including Sys III says an awful lot."
He said that even though SCO owned the copyright on Sys III, a few years ago it did not have a copy of the source code. "I was dealing with one of their people at the time, trying to get some code released under a reasonable licence. I sent them the code as a gesture because I knew they did not have a copy," he said with a chuckle.
Dr Toomey's statements come a few days after Greg Rose, an Australian Unix hacker from the 1970s, raised the possibility that there may be code contributed by people, including himself, which has made its way into System V Unix and is thus being used by companies like the SCO Group.
Dr Toomey said this was one reason why the code samples which the SCO Group had shown at its annual forum had turned out to be widely published code.
SCO was unaware of the origins of much of the code and this "explains how they could wheel out the old malloc() code and the BPF (Berkeley Packet Filter) code, not realising that both were now under BSD licences - and in fact they hadn't even written the BPF code," Dr Toomey said.
He said that there was lots of code which had been developed at the University of New South Wales in the 70s which went to AT&T and was incorporated into UNIX without any copyright notices.
"At that time the development that was going on was similar to open source - the only difference was that the developers all had to have copies of the code licensed from AT&T," he said.
Dr Toomey, who served 12 years with the Australian Defence Force Academy, an offshoot of the University of New South Wales, before joining Bond University, said he had source code for Unices from the 3rd version of UNIX which came out in 1974 to the present day. "I don't have Sys V code but there are people with licences for that code who are members of the Unix Heritage Society. We can compare code samples any time," he said.
He agreed that the codebase of Sys V was a terribly tangled mess. "It is very difficult to trace origins now. There is an awful lot of non-AT&T and non-SCO code in Sys V. There is a lot of BSD code there," he said.
In March, the SCO Group filed a billion-dollar lawsuit against IBM, for "misappropriation of trade secrets, tortious interference, unfair competition and breach of contract."
SCO also claimed that Linux was an unauthorised derivative of Unix and warned commercial Linux users that they could be legally liable for violation of intellectual copyright. SCO later expanded its claims against IBM to US$3 billion in June when it said it was withdrawing IBM's licence for its own Unix, AIX.
IBM has counter-sued SCO while Red Hat Linux has sued SCO to stop it from making "unsubstantiated and untrue public statements attacking Red Hat Linux and the integrity of the Open Source software development process."
-----
Wordforge writing contest now open: deadline 2003-03-28
a world in progress...
Perhaps if you parsed them both, and compared the resulting object code, right before compilation?
That way if your variable is called numOfPorts and mine is called countOfPorts, the parsed code is the same for both, when stuff like that becomes meaningless.
Even if not, SCO seems to be saying that much of the code is copy-n-paste anyways.
I don't need no instructions to know how to rock!!!!
The more points you discover and disprove now with SCO's claims.. the higher quality, more refined, and detailed SCO's evidence will be when this setup finally gets to a court in front of a judge. If they went to court two months ago or even today, they would have been sent home quickly with bascially easy to disprove evidence. With the help of the open source community, they are slowly changing their weapon of choice from a shotgun to a rifle.
Bad boys rape our young girls but Violet gives willingly.
The point is that we don't need SCO to do anything. Presumably any of the many people with legal rights to SCO source code can publish the hash list without divulging any of SCO's (ahem) "IP". Even more interesting is the theoretical possibility of comparing historical releases of SCO trees against GPL-licensed code, thus (perhaps) demonstrating that SCO has illegally violated the IP of OSS developers. Of course, hash comparisons alone would be unlikely to convince a judge/jury of anything. They ought to be sufficient grounds for some embarrasing subpoenas, and maybe some really neat cease-and-desist orders, though.
From it's manual:
"The -w causes all whitespace in the file (including blank lines) to be ignored for comparison purposes (line numbers in the output report will nevertheless be correct). This is recommended for comparing C code; among other things it means the comparison won't be fooled by differences in indent style."
KDE GUI version should be called Krang since Shredder would obviously be used from the command line (shell). Maybe it should have helper apps called Bebop and Rocksteady. And if the need should arise, the project shouldn't fork...it should splinter.
By computing MD5 hashes of consecutive (overlapping) line triplets, the shred algorithm makes it easy to identify copied code, without ever seeing the actual code. This might be a perfect way for companies to allow a third party to compare code, without giving away any trade secrets in the process.
Of course, since MD5 is a very good cryptographic hash function, *any* one-bit change in the source will result in, on average, half of the bits in the result being flipped. So, this method of identifying copied code would only work if the code had never been run through an obfuscator. It would also be defeatable by running the source through a script to have its variable names search-and-replaced with similar names (such as replacing every variable name with a new name consisting of the old name plus "_newname")....
In short, this might be a useful technique for allowing a third party to look for trivial wholesale copying of code, but it would be useless for finding a motivated miscreant, determined to steal code without being caught.
Downside: Uh... it just came out... and it's making some big, big claims involving fuzzy logic. I think it's gonna need some testing first, eh?
Also, anybody else think it only works on larger sections of code than just say 10 lines?
SIG: HUP
Thanks ESR. You've just put a team of mathematicians at SCO who were somehow related to MIT out of their jobs.
Chances are slim to none that a software company would allow it's "shredded" source code to be publicly released. What happens if the proprietary source is found to violate the GPL?
Proprietary (closed) source companies have a tremendous advantage over open source software when it comes to violating intellectual property. Who will ever know if they did it? A source code "comparator" eliminates that crucial advantage.
While I fully support ESR and the rest of the open source movement's defense of Linux against SCO, I have a feeling that this tool's results will not immediately be accepted by established media simply because of ESR's bias. A reporter looking into the SCO story who knows little about open source wouldn't trust a tool made by one side of the disagreement.
It seems very important to me that "third parties" and experts who are not an integral part of the open-source movement validate that comparator works as intended and is effective at detecting code similarities. Hopefully we'll see some articles on respected sites in the next week or so with conclusive analyses of comparator. Not to mention a chance for someone to use it on SCO's code!
Oh, and "Yes, I'm being deliberately vague and tantalizing" is quite funny.
This is perhaps a better project and it would be interesting to see this tool run against the source.
History Flow The following is from their website:history flow
visualizing dynamic, evolving documents and the interactions of multiple collaborating authors:
Motivation
Most documents are the product of continual evolution. An essay may undergo dozens of revisions; source code for a computer program may undergo thousands. And as online collaboration becomes increasingly common, we see more and more ever-evolving group-authored texts. This site is a preliminary report on a simple visual technique, history flow, that provides a clear view of complex records of contributions and collaboration.
Fantasy remains a human right; we make in our measure and in our derivative mode... -- JRR Tolkien
Comparing the hashes doesn't give you a definitive answer; it does, however, tell you where to look. Or which submitters to ask for clarification on the origins of potentially infringing code. That's more than we have now!
"Freedom means freedom for everybody" -- Dick Cheney
int main()
{
printf("These source trees appear to be entirely different!\n");
return 0;
}
And note that it is not comparing the MD5's of whole files, it is comparing MD5's of three-line "shreds" of files
If we can show that SCO's violating the BSD license, maybe we can convince some BSD copyright holder to sue them first, and demand as part of discovery the MD5 checksums from "shred", showing duplicated BSD code but no duplicated BSD copyright.
What if this ESR tool runs and finds commonality, and the research shows that, in fact, SCO's rights were breached. Remember, this type of analysis is a two-edged sword. The purpose of this ESR is to remove doubt... but remember doubt could be removed either direction.
So, given that hypothetical, what would people here think? Would you forive SCO? Would you concede SCO's point, but think that SCO defended their rights in a very poor manner? (this, btw, is what I would probably do). Would you stick your fingers in your ears and refuse to accept the outcome, and believe in some vast -wing conspiracy?
Obviously, the Linx movement would carry on. I don't think the death of Linux is even worth discussion. Some recourse would happen, probably monetary damages, and the offending code would be removed.
My real curiosity is how people's attitudes or feelings would change (or not change) if it turns out SCO is right (however unlikely that is).
Sarcasm and hyperbole are the final refuges for weak minds
Presumably any of the many people with legal rights to SCO source code can publish the hash list without divulging any of SCO's (ahem) "IP".
Would these hashes of SCO source code be considered derivative works? That could have copyright implications...
create a C language parser that reduced the C-code down to op codes
like gcc?
In all fairness, SCO's value is not in being purchased so that the source code can be freed...
SCO's value is in acting as a totem against future companies who would try this same stunt....Their value is in their smoking carcass with Daryl's chared head mounted promanently on a high pike...
At this point, there can be no comprimise with people who commit fraud to inflate their stock price and to promote FUD.... I believe that Daryl KNOWS that his claims are false...he deserves to fry....
I say, "smoking head on stake" for all the SCO/Canopy group members.... leave all the execs at SCO without a job and discredited like the MCI/ENRON execs....Leave all the investors holding worthless stock certs....Somebody needs to be an example, and SCO volunteered by inflating/changing/hyping/FUDing their claims.
I could have had a little sympathy for them if they had just filed their suit and shut-up until the trial....but at $17/share now, we need to destroy some wallets to remind everyone that it's not over till the gavel falls......
Pardon me, but a lot of you guys are missing the point of this comparator.
1) There are people out there with legit source licenses to SVR5 source trees. And not just Unix OEMs. Various people in large companies with SVR4/5 source licenses etc.
2) Such people cannot release the source code, and may (if paranoid of how they interpret 'derived works') not want to publish hashed MD5 codes of SVR5.
3) However, it should be possible for a legit SVR5 source licensee to publish openly a list of areas of code that are similar across trees, without that list being either A) a derived work, B) violating their NDAs (um, do check the fine print first though) and C) spending tons of their own, presumably expensive time, digging through stuff.
Then Linux advocates can then sift through the resulting large pieces of code and doublecheck/crosscheck the origins of it. At the very least, we'd have a public list of suspicious areas of Linux and could determine that certain parts are A) BSD-licensed, B) are verified as original by a known Linux coder, and C) don't fall into the above categories and remain 'suspicious'. This presumably is what ESR is referring to by "various persons will apply it in useful ways. Yes, I'm being deliberately vague and tantalizing". Let's say that its likely the percentages in A and B will be large.
Of course it's true that there could be code that this primitive tool doesn't catch. But SCO probably started their analysis by using tools like this also. Looking through millions lines of code by hand is no cakewalk, so one will inevitably start with code like this in such an investigation. (Unless one is concerned about one specific predetermined critical/sensitive piece of code.)
Oh, and the other thing about this tool that is nice IMHO? It demonstrates a "good faith" effort on the part of Linux advocates and coders to correct the problem -- despite the barriers raised by SCO (no code release except via NDA).
Finally, running this tool across a Linux and a BSD release should turn up some data that is both interesting and relevant for this dispute. I'm almost tempted to try that myself.
--LP
Copyright is misapplied to source code. Either REVEAL THE SOURCE or you only get protection on that which you "publish" - namely, the binary.
Put up or shut up; no source, no copyright on the source. You won't share it, you don't need it protected.
They would be divulging SCO's biggest trade secret, that all their claims are just FUD.
Engineering is the art of compromise.
RTFM:
Name
comparator, filterator -- fast comparisons among large source trees
Synopsis
comparator -c [-d dir] [-o file] [-s shredsize] [-w] [-x] path...
[snip]
The -s option changes the shred size. Smaller shred sizes are more sensitive to small code duplications, but produce correspondingly noisier output. Larger ones will suppress both noise and small similarities.
[snip]
The -w causes all whitespace in the file (including blank lines) to be ignored for comparison purposes (line numbers in the output report will nevertheless be correct). This is recommended for comparing C code; among other things it means the comparison won't be fooled by differences in indent style.
Using the appropriate switches will address two of your points. I'm sure it wouldn't be extraordinarily difficult to modify the code to ignore other things such as string constants, variable names, etc.
Actually fetchmail proves that he can code.
This program just proves that md5 is not the correct hash for doing this kind of comparison. It is TOO GOOD of a one way hash, and will only return is positive if the lines being compared are 100% equal.
Finkployd
So, you've downloaded Comparator, and run tests, then.
I didn't need to, the following is in the readme:
He's wrong BTW (and he is smart enough to know it, which makes this a deliberate deception). A work is no less subject to copyright if someone does a global search and replace on a variable name.
You may want to check out "The Emperor Has No Clothes", a look at ESR's real code contributions.
http://www.talknerdy.org