ESR to Shred SCO Claims?
webmaven writes "According to this article in eWEEK, ESR has released a utility called comparator for analyzing the similarity of source code trees. The technical details are interesting, in that ESR says he is using an implementation of a refined version of the 'shred' algorithm, with higher performance (on machines with enough RAM) than other versions. ESR won't say whether he intends the comparator to be used to compare older Unix code to Linux so as to be able to refute SCO's claims, but it's obviously well suited for such a purpose. Interestingly, as the shred algorithm can run reports on source trees using only the MD5 signature shreds (once generated), it is possible to use it to compare trees without direct access to the source code itself, leading to a possible use in comparing various proprietary source trees with each other and with Freely available code bases such as Linux and *BSD without requiring actual disclosure of the proprietary source code (a neutral third party could generate the shreds on a company's premises, and leave without taking a copy of the source with them). I'll be interested to see if (or which of) the proprietary vendors allow their source trees to be 'shredded' for such comparisons, and whether this becomes a standard forensic technique in source-code copyright and trade-secret disputes."
Actually, combine this with the "shared source" program from MS and it would be easy to see if MS did (or did not) copy GPL code into Windows as some suggest.
Opus: the Swiss army knife of audio codec
The truth is out there, we will finally get to it without signing a SCO NDA. This should end the case before it begins. SHRED ON!
"Curiosity killed the cat, but for a while I was a suspect."- Steven Wright
Ths think is that the hashes could be generated my any organisation that has access to the SysV source code. There are many of them (IBM being one).
Opus: the Swiss army knife of audio codec
If you're comparing two sets of code vis. their MD5 sums, then won't that miss matching lines that differ by even one character - like, say, a space?
----
Not to be confused with Col.
It might be interesting to see how different families of Linux/Unix compare... maybe generate a veritable "family tree" of relationships.
Of course, that also depends more on how differences are actually calculated. Still, could make an interesting project to relate OSes based on how much shared code they still retain and show it in a graphical tree format, ala "family tree." 8)
Diplomacy is the art of saying, "Nice doggie!" until you can find a rock.
Most people *I* know consider ESR to be a bloated windbag with a penchant for fanatical gunrights. He's regarded as pretty much being on the same level as the late Jon Katz.
The more points you discover and disprove now with SCO's claims.. the higher quality, more refined, and detailed SCO's evidence will be when this setup finally gets to a court in front of a judge. If they went to court two months ago or even today, they would have been sent home quickly with bascially easy to disprove evidence. With the help of the open source community, they are slowly changing their weapon of choice from a shotgun to a rifle.
Bad boys rape our young girls but Violet gives willingly.
That's true in general. However, SCO has explicitly stated that thousands of lines of code have been illegaly copied *verbatim* from System V. This tool could at least prove that they lied (because of the verbatim copy allegation).
Opus: the Swiss army knife of audio codec
You know the sad thing about all this? I can't tell the difference between the auto-generator or your average Slashdotter. Does this mean that the auto-generator passes the Turing Test, or that the average Slashdotter doesn't?
A deep unwavering belief is a sure sign you're missing something...
This is perhaps a better project and it would be interesting to see this tool run against the source.
History Flow The following is from their website:history flow
visualizing dynamic, evolving documents and the interactions of multiple collaborating authors:
Motivation
Most documents are the product of continual evolution. An essay may undergo dozens of revisions; source code for a computer program may undergo thousands. And as online collaboration becomes increasingly common, we see more and more ever-evolving group-authored texts. This site is a preliminary report on a simple visual technique, history flow, that provides a clear view of complex records of contributions and collaboration.
Fantasy remains a human right; we make in our measure and in our derivative mode... -- JRR Tolkien
check out this research project coming out of berkeley CAP
Drop in the code you are interested in and it will tell you where its found in a bunch of open source stuff, including the linux kernel.
If we can show that SCO's violating the BSD license, maybe we can convince some BSD copyright holder to sue them first, and demand as part of discovery the MD5 checksums from "shred", showing duplicated BSD code but no duplicated BSD copyright.
But IBM already has a copy of SCO's code; they licensed it after all. They can release the output of "shred" without violating their agreements with SCO.
Proprietary (closed) source companies have a tremendous advantage over open source software when it comes to violating intellectual property. Who will ever know if they did it? A source code "comparator" eliminates that crucial advantage
Not really.. Open-source software usually has a nice setup with mailing-lists, CVS, etc. Most of the code is well accounted-for. The same is not as true with a lot of proprietary software.
Remember, it's not enough that two pieces of code match to prove an infringement in court.
In fact, the court will most likely take into consideration the fact defending code is open-source, and the burden of proving that they originated the code would be increased for the plaintiff.
Also, failing to prove that they originated the code could leave them open to a countersuit in which the tables would be turned against them, since they obviously had access to the open-sourced code.
Presumably any of the many people with legal rights to SCO source code can publish the hash list without divulging any of SCO's (ahem) "IP".
Would these hashes of SCO source code be considered derivative works? That could have copyright implications...
It will just tell someone two trees are similar/identical. The important thing to prove in court is who copied from whom.
THIS is exactly why Open Source works. It's not because of IBM or Red Hat or geeks from Finland. It's because people in the community are willing to step up to any challenge.
Thanks, ESR.
The global economy is a great thing until you feel it locally.
Why isn't this a press release?
If I go to Yahoo, and look at news related to SCOX, this doesn't show up. Here is the open source community trying to help find any misappropriated IP - and no one that doesn't read slashdot/eWeek will know about it!
Isn't there someone who subscribes to a wire service, that can issue a press release? In order to fight FUD, we have to get info out to people that don't read slashdot!!
Mike
I cant dig up the slashdot post, but here is The Inquirer article from Jun 18th. Someone did this well before esr did.
Its not new, Its not esr's Idea, Its almost 3 months old!!!
I'm interested in algorithms that could be used to compare code. Moss and CAP from Berkeley are not interesting because the algorithm is secret (AFAIK).
What algorithms other than ESR's comparator are there? (I recall but can't locate a recent comment on Slashdot that said something like "most plagiarism detection programs used by professors use the XXXX algorithm".)
Also, comments can potentially discarded without effecting the compilation of the program.
Thus, you can take a program:And turn it into:You've saved yourself space here. Now, here's the wierd thing, I wouldn't expect this to save any space after gzip'ing, or bzip'ing. I mean, after all, you're primarily just removing one character. But it turns out that on a particular file of mine:
-rw-r--r-- 1 dfoesch staff 9184 Sep 9 19:00 navajo.c
-rw-r--r-- 1 dfoesch staff 3213 Sep 9 18:58 navajo.c.bz2
-rw-r--r-- 1 dfoesch staff 1832 Sep 9 18:58 navajo.c.nospaces.bz2
And gzip is the same. This is thus a lossy compression for source code that doesn't actually modify the semantics or syntax of the program. (Of course, this won't work for language like Python.)
Yes, the result it unreadable, but then you just run indent, with your favorite coding-style setup, and viola! It's back to "normal", but different. Just like lossy compression is supposed to work.
I am unamerican, and proud of it!
Apparently your reading comprehension skills are right on par with the dolt who modded the post down.
In order that the method should not be fooled by simple changes, at least the following is required
* White space must be ignored
* Comparison must be at the statement level, not the code line level
* Variable names must be replaced by standard placeholders
* Routine names, other than standard library calls, must be replaced by standard placeholders
* (Probably difficult) logic will be needed in the tool to detect and ignore noops: how do you deal with
The trouble is: a high proportion of the code sections thus simplified will fall into a relatively small number of possibilities, vulnerable to dictionary type attacks. Thus, most of the code could be reconstructed, though admittedly as obfuscated source code. IMHO this provides a valid objection to its use.
I think we are all missing a big point here. SCO registered their copyright in SysV. It was hard to do. They had to create a copy of the source code and file it with the Patent and Trademark Office. That puppy is there so that *we* can look at it. This is specifically *fair use*. It is there so that individuals can protect themselves by comparing what they have to what has been registered. No match means no problem. It just doesn't get any more *fair use* than that. Just have somebody nip up to the PTO, copy the registration for comparison purposes only, (really!) then do the comparisons. How hard is that? Yes, IAAL, but this is not legal advice. Hire your own mouthpiece.
So, you've downloaded Comparator, and run tests, then.
I didn't need to, the following is in the readme:
He's wrong BTW (and he is smart enough to know it, which makes this a deliberate deception). A work is no less subject to copyright if someone does a global search and replace on a variable name.
Actually, combine this with the "shared source" program from MS and it would be easy to see if MS did (or did not) copy GPL code into Windows as some suggest.
More importantly, get something like this accepted in a court of law as a legitimate way to do an initial assessment of code yet still preserve a litigants right to code privacy, and you are going to have not just MS but a number of big companies shaking in their boots. Not necessarily because they did steal anything but because they have to realize it is a possibility that one of their coders did without company knowledge. Doesn't matter, they are still liable.
But, get a method like this accepted in a court of law and you are going to see it used again. I think this has a huge potential to hurt closed software. And perhaps a potential to convince MS to stop funding SCO, perhaps even to apply pressure to get them to start backing down.
Get a clue. Nobody who copyrights a work is under any obligation to widely spread around the work. Copyright is inherent in any written work. I can write a poem intended only for my lover, just give the one copy of the poem to that lover, and it's protected by copyright. Break into my lover's house, steal a copy of the poem, and publish it, and you've broken copyright and I have standing to nail you good for it.
Patents, in order to be patented, need to be fully disclosed. That's inherent in the patent process, you're saying 'this is MY idea, here's the whole deal laid out, I assert that it's mine.' There's no comparable oblication for copyright.
People like you who try to mush it all up are just trying to loot other people's property.
A Good Intro to NetBS
A lot of comments focus on the problems that a global search and replace will pose to the technique. I think we can improve the agorithm by doing the following:
What we are looking for here are pieces of code with the same structure: the same for loops, while loops, variable assignment, function names, and so on. The idea would be to substitute all literals by a standard placeholder, and then generate the md5 checksums on the block level (as somebody has previously suggested).
To be able to cheat this technique, a modification in the structure of the code is required. And in the case that exactly that has been done, it is arguably wether that can be considered copyright infringement.
The best use of this technology would to test the SCO LKP for stolen Linux code.
Confirming that SCO had incorporated Open Source code that they had access to under the GPL would destroy their credibility and open them up to countersuits. The process would only have to reveal enough similarities to have subpoenas ordered for the actual code involved. Then we could prove the theft with SCO own source code.
I suspect that those who know that Linux code was used to create LKP would come forward once the code has been discover and posted for all to see.