ESR to Shred SCO Claims?

← Back to Stories (view on slashdot.org)

Posted by michael on Tuesday September 9, 2003 @09:52AM from the woodchipper dept.

webmaven writes "According to this article in eWEEK, ESR has released a utility called comparator for analyzing the similarity of source code trees. The technical details are interesting, in that ESR says he is using an implementation of a refined version of the 'shred' algorithm, with higher performance (on machines with enough RAM) than other versions. ESR won't say whether he intends the comparator to be used to compare older Unix code to Linux so as to be able to refute SCO's claims, but it's obviously well suited for such a purpose. Interestingly, as the shred algorithm can run reports on source trees using only the MD5 signature shreds (once generated), it is possible to use it to compare trees without direct access to the source code itself, leading to a possible use in comparing various proprietary source trees with each other and with Freely available code bases such as Linux and *BSD without requiring actual disclosure of the proprietary source code (a neutral third party could generate the shreds on a company's premises, and leave without taking a copy of the source with them). I'll be interested to see if (or which of) the proprietary vendors allow their source trees to be 'shredded' for such comparisons, and whether this becomes a standard forensic technique in source-code copyright and trade-secret disputes."

31 of 554 comments (clear)

Min score:

Reason:

Sort:

Re:maybe... by jmv · 2003-09-09 09:58 · Score: 5, Interesting

Actually, combine this with the "shared source" program from MS and it would be easy to see if MS did (or did not) copy GPL code into Windows as some suggest.

--
Opus: the Swiss army knife of audio codec
The truth is out there by Teahouse · 2003-09-09 09:58 · Score: 2, Interesting

The truth is out there, we will finally get to it without signing a SCO NDA. This should end the case before it begins. SHRED ON!

--
"Curiosity killed the cat, but for a while I was a suspect."- Steven Wright
Re:SCO! by jmv · 2003-09-09 10:00 · Score: 2, Interesting

Ths think is that the hashes could be generated my any organisation that has access to the SysV source code. There are many of them (IBM being one).

--
Opus: the Swiss army knife of audio codec
Can Someone Explain? by Klync · 2003-09-09 10:00 · Score: 2, Interesting

If you're comparing two sets of code vis. their MD5 sums, then won't that miss matching lines that differ by even one character - like, say, a space?

--

----
Not to be confused with Col.
1. Re:Can Someone Explain? by stratjakt · 2003-09-09 10:03 · Score: 4, Interesting
  
  Perhaps if you parsed them both, and compared the resulting object code, right before compilation?
  
  That way if your variable is called numOfPorts and mine is called countOfPorts, the parsed code is the same for both, when stuff like that becomes meaningless.
  
  Even if not, SCO seems to be saying that much of the code is copy-n-paste anyways.
  
  --
  I don't need no instructions to know how to rock!!!!
Other uses? by Not_Wiggins · 2003-09-09 10:00 · Score: 4, Interesting

It might be interesting to see how different families of Linux/Unix compare... maybe generate a veritable "family tree" of relationships.

Of course, that also depends more on how differences are actually calculated. Still, could make an interesting project to relate OSes based on how much shared code they still retain and show it in a graphical tree format, ala "family tree." 8)

--
Diplomacy is the art of saying, "Nice doggie!" until you can find a rock.
What respect? by Anonymous Coward · 2003-09-09 10:03 · Score: 3, Interesting

Most people *I* know consider ESR to be a bloated windbag with a penchant for fanatical gunrights. He's regarded as pretty much being on the same level as the late Jon Katz.
Be careful... by nolife · 2003-09-09 10:04 · Score: 4, Interesting

The more points you discover and disprove now with SCO's claims.. the higher quality, more refined, and detailed SCO's evidence will be when this setup finally gets to a court in front of a judge. If they went to court two months ago or even today, they would have been sent home quickly with bascially easy to disprove evidence. With the help of the open source community, they are slowly changing their weapon of choice from a shotgun to a rifle.

--
Bad boys rape our young girls but Violet gives willingly.
Re:Nah... by jmv · 2003-09-09 10:05 · Score: 3, Interesting

That's true in general. However, SCO has explicitly stated that thousands of lines of code have been illegaly copied *verbatim* from System V. This tool could at least prove that they lied (because of the verbatim copy allegation).

--
Opus: the Swiss army knife of audio codec
Re:fire the "laser" by be-fan · 2003-09-09 10:09 · Score: 3, Interesting

You know the sad thing about all this? I can't tell the difference between the auto-generator or your average Slashdotter. Does this mean that the auto-generator passes the Turing Test, or that the average Slashdotter doesn't?

--
A deep unwavering belief is a sure sign you're missing something...
IBM has a project called History Flow by TedTschopp · 2003-09-09 10:12 · Score: 5, Interesting

This is perhaps a better project and it would be interesting to see this tool run against the source.
History Flow The following is from their website:

history flow
visualizing dynamic, evolving documents and the interactions of multiple collaborating authors:

Motivation
Most documents are the product of continual evolution. An essay may undergo dozens of revisions; source code for a computer program may undergo thousands. And as online collaboration becomes increasingly common, we see more and more ever-evolving group-authored texts. This site is a preliminary report on a simple visual technique, history flow, that provides a clear view of complex records of contributions and collaboration.

--
Fantasy remains a human right; we make in our measure and in our derivative mode... -- JRR Tolkien
Its been around for years by Anonymous Coward · 2003-09-09 10:15 · Score: 3, Interesting

check out this research project coming out of berkeley CAP

Drop in the code you are interested in and it will tell you where its found in a bunch of open source stuff, including the linux kernel.
Who says SCO gets to court first? by JoeBuck · 2003-09-09 10:33 · Score: 4, Interesting

If we can show that SCO's violating the BSD license, maybe we can convince some BSD copyright holder to sue them first, and demand as part of discovery the MD5 checksums from "shred", showing duplicated BSD code but no duplicated BSD copyright.
Re:Slim to None by JoeBuck · 2003-09-09 10:36 · Score: 4, Interesting

But IBM already has a copy of SCO's code; they licensed it after all. They can release the output of "shred" without violating their agreements with SCO.
Re:Slim to None by k98sven · 2003-09-09 10:37 · Score: 2, Interesting

Proprietary (closed) source companies have a tremendous advantage over open source software when it comes to violating intellectual property. Who will ever know if they did it? A source code "comparator" eliminates that crucial advantage

Not really.. Open-source software usually has a nice setup with mailing-lists, CVS, etc. Most of the code is well accounted-for. The same is not as true with a lot of proprietary software.

Remember, it's not enough that two pieces of code match to prove an infringement in court.
In fact, the court will most likely take into consideration the fact defending code is open-source, and the burden of proving that they originated the code would be increased for the plaintiff.

Also, failing to prove that they originated the code could leave them open to a countersuit in which the tables would be turned against them, since they obviously had access to the open-sourced code.
derivative work? by donutz · 2003-09-09 10:53 · Score: 4, Interesting

Presumably any of the many people with legal rights to SCO source code can publish the hash list without divulging any of SCO's (ahem) "IP".

Would these hashes of SCO source code be considered derivative works? That could have copyright implications...
1. Re:derivative work? by ls+-lR · 2003-09-09 21:18 · Score: 2, Interesting
  
  I think we all agree that the obvious "duh" answer is that "of course they wouldn't be derivative works." But SCO has proven that it has a knack for just making stuff up or interpreting things funny. However, even based on the letter of the law I don't think this would qualify as a "transformation." That would seem to apply to a case where you shift the representation of the data to a different format but retain its essence, such as copying a DVD to a VHS tape. However, creating MD5 sums does not seem like it would be a transformation in that sense, in that the new work has none of the qualities of the original -- it's not code, it won't compile, it cannot be used to divine any algorithms, methods, etc. In sort it's completely useless, other than for comparing to other source code fragments.
Not as useful in court by klui · 2003-09-09 10:59 · Score: 2, Interesting

It will just tell someone two trees are similar/identical. The important thing to prove in court is who copied from whom.
Open Source by digidave · 2003-09-09 11:46 · Score: 3, Interesting

THIS is exactly why Open Source works. It's not because of IBM or Red Hat or geeks from Finland. It's because people in the community are willing to step up to any challenge.

Thanks, ESR.

--
The global economy is a great thing until you feel it locally.
Press release! by mflaster · 2003-09-09 11:49 · Score: 2, Interesting

Why isn't this a press release?

If I go to Yahoo, and look at news related to SCOX, this doesn't show up. Here is the open source community trying to help find any misappropriated IP - and no one that doesn't read slashdot/eWeek will know about it!

Isn't there someone who subscribes to a wire service, that can issue a press release? In order to fight FUD, we have to get info out to people that don't read slashdot!!

Mike
Someone Did this in June. by Popsikle · 2003-09-09 12:47 · Score: 2, Interesting

I cant dig up the slashdot post, but here is The Inquirer article from Jun 18th. Someone did this well before esr did.

Its not new, Its not esr's Idea, Its almost 3 months old!!!
Comparison algorithms? by Ivan+the+Terrible · 2003-09-09 12:58 · Score: 2, Interesting

I'm interested in algorithms that could be used to compare code. Moss and CAP from Berkeley are not interesting because the algorithm is secret (AFAIK).
What algorithms other than ESR's comparator are there? (I recall but can't locate a recent comment on Slashdot that said something like "most plagiarism detection programs used by professors use the XXXX algorithm".)
Re:Is there really that much data there? by Krach42 · 2003-09-09 13:11 · Score: 2, Interesting

Actually, some source code is loss-tolerant. Take C for example. In C the only significant whitespace is between any two elements of the set { identifiers, numbers }, and any that occurs in quotes, or character constants.

Also, comments can potentially discarded without effecting the compilation of the program.

Thus, you can take a program:
int main(void) { printf("Hello World!\n"); return 0; }
And turn it into:
int main(void){printf("Hello World!\n");return 0;}
You've saved yourself space here. Now, here's the wierd thing, I wouldn't expect this to save any space after gzip'ing, or bzip'ing. I mean, after all, you're primarily just removing one character. But it turns out that on a particular file of mine:

-rw-r--r-- 1 dfoesch staff 9184 Sep 9 19:00 navajo.c
-rw-r--r-- 1 dfoesch staff 3213 Sep 9 18:58 navajo.c.bz2
-rw-r--r-- 1 dfoesch staff 1832 Sep 9 18:58 navajo.c.nospaces.bz2

And gzip is the same. This is thus a lossy compression for source code that doesn't actually modify the semantics or syntax of the program. (Of course, this won't work for language like Python.)

Yes, the result it unreadable, but then you just run indent, with your favorite coding-style setup, and viola! It's back to "normal", but different. Just like lossy compression is supposed to work.

--

I am unamerican, and proud of it!
Re:No source = no copyright by poptones · 2003-09-09 13:30 · Score: 3, Interesting

Apparently your reading comprehension skills are right on par with the dolt who modded the post down.
Nobody has mentioned this yet ... by Mostly+a+lurker · 2003-09-09 14:26 · Score: 3, Interesting

As currently designed, Shred would obviously not defeat deliberate source misappropriation. If (big if) the method could adapted such that it could not be easily fooled by a determined violator (and without revealing how the code works) then I believe registration of the results should be required by law. BUT ...
In order that the method should not be fooled by simple changes, at least the following is required
* White space must be ignored
* Comparison must be at the statement level, not the code line level
* Variable names must be replaced by standard placeholders
* Routine names, other than standard library calls, must be replaced by standard placeholders
* (Probably difficult) logic will be needed in the tool to detect and ignore noops: how do you deal with
i++;
%include noop.i;
a[i]=b[i];

The trouble is: a high proportion of the code sections thus simplified will fall into a relatively small number of possibilities, vulnerable to dictionary type attacks. Thus, most of the code could be reconstructed, though admittedly as obfuscated source code. IMHO this provides a valid objection to its use.
No Trade Secrets in Registered Copyrights by Iparadox · 2003-09-09 14:44 · Score: 2, Interesting

I think we are all missing a big point here. SCO registered their copyright in SysV. It was hard to do. They had to create a copy of the source code and file it with the Patent and Trademark Office. That puppy is there so that *we* can look at it. This is specifically *fair use*. It is there so that individuals can protect themselves by comparing what they have to what has been registered. No match means no problem. It just doesn't get any more *fair use* than that. Just have somebody nip up to the PTO, copy the registration for comparison purposes only, (really!) then do the comparisons. How hard is that? Yes, IAAL, but this is not legal advice. Hire your own mouthpiece.
Re:MD5 easily fooled by dmiller · 2003-09-09 15:21 · Score: 4, Interesting

So, you've downloaded Comparator, and run tests, then.

I didn't need to, the following is in the readme:

comparator does not attempt to do semantic analysis and catch relatively trivial changes like renaming of variables, etc. This is because comparator is designed not as a tool to detect plagiarism of ideas (the subject of patent law), but as a tool to detect copying of the expression of ideas (the subject of copyright law).

He's wrong BTW (and he is smart enough to know it, which makes this a deliberate deception). A work is no less subject to copyright if someone does a global search and replace on a variable name.
Better yet, a reason to get MS to stop funding SCO by isn't+my+name · 2003-09-09 15:49 · Score: 2, Interesting

Actually, combine this with the "shared source" program from MS and it would be easy to see if MS did (or did not) copy GPL code into Windows as some suggest.

More importantly, get something like this accepted in a court of law as a legitimate way to do an initial assessment of code yet still preserve a litigants right to code privacy, and you are going to have not just MS but a number of big companies shaking in their boots. Not necessarily because they did steal anything but because they have to realize it is a possibility that one of their coders did without company knowledge. Doesn't matter, they are still liable.

But, get a method like this accepted in a court of law and you are going to see it used again. I think this has a huge potential to hurt closed software. And perhaps a potential to convince MS to stop funding SCO, perhaps even to apply pressure to get them to start backing down.
Re:No source = no copyright by IM6100 · 2003-09-09 16:10 · Score: 3, Interesting

Get a clue. Nobody who copyrights a work is under any obligation to widely spread around the work. Copyright is inherent in any written work. I can write a poem intended only for my lover, just give the one copy of the poem to that lover, and it's protected by copyright. Break into my lover's house, steal a copy of the poem, and publish it, and you've broken copyright and I have standing to nail you good for it.

Patents, in order to be patented, need to be fully disclosed. That's inherent in the patent process, you're saying 'this is MY idea, here's the whole deal laid out, I assert that it's mine.' There's no comparable oblication for copyright.

People like you who try to mush it all up are just trying to loot other people's property.

--
A Good Intro to NetBS
Possible improvements by gonvaled · 2003-09-09 19:37 · Score: 2, Interesting

A lot of comments focus on the problems that a global search and replace will pose to the technique. I think we can improve the agorithm by doing the following:

What we are looking for here are pieces of code with the same structure: the same for loops, while loops, variable assignment, function names, and so on. The idea would be to substitute all literals by a standard placeholder, and then generate the md5 checksums on the block level (as somebody has previously suggested).

To be able to cheat this technique, a modification in the structure of the code is required. And in the case that exactly that has been done, it is arguably wether that can be considered copyright infringement.
Test SCO Linux Kernel Personality by LightSail · 2003-09-10 01:22 · Score: 2, Interesting

The best use of this technology would to test the SCO LKP for stolen Linux code.
Confirming that SCO had incorporated Open Source code that they had access to under the GPL would destroy their credibility and open them up to countersuits. The process would only have to reveal enough similarities to have subpoenas ordered for the actual code involved. Then we could prove the theft with SCO own source code.

I suspect that those who know that Linux code was used to create LKP would come forward once the code has been discover and posted for all to see.