Slashdot Mirror


ESR to Shred SCO Claims?

webmaven writes "According to this article in eWEEK, ESR has released a utility called comparator for analyzing the similarity of source code trees. The technical details are interesting, in that ESR says he is using an implementation of a refined version of the 'shred' algorithm, with higher performance (on machines with enough RAM) than other versions. ESR won't say whether he intends the comparator to be used to compare older Unix code to Linux so as to be able to refute SCO's claims, but it's obviously well suited for such a purpose. Interestingly, as the shred algorithm can run reports on source trees using only the MD5 signature shreds (once generated), it is possible to use it to compare trees without direct access to the source code itself, leading to a possible use in comparing various proprietary source trees with each other and with Freely available code bases such as Linux and *BSD without requiring actual disclosure of the proprietary source code (a neutral third party could generate the shreds on a company's premises, and leave without taking a copy of the source with them). I'll be interested to see if (or which of) the proprietary vendors allow their source trees to be 'shredded' for such comparisons, and whether this becomes a standard forensic technique in source-code copyright and trade-secret disputes."

16 of 554 comments (clear)

  1. ESR ADMITS TO ENRON PRACTICES by Anonymous Coward · · Score: 5, Funny

    This will only serve as another black eye on the Open Source community. ESR should know better that to shred SCO material prior to a trial.

  2. Doubt it will help by Brahmastra · · Score: 5, Insightful

    I think the question here is not about whether there is common code between SCO and Linux. There is no doubt that there will be common code because of the common origins. The issue here is that SCO does not own that code.

    1. Re:Doubt it will help by Azog · · Score: 5, Insightful

      Well, this would still help determine what the common code is.

      If ESR is given the big list of MD5 sums of SCO's kernel by someone who has legitimate access to it, and he runs his shred tool to compare it to the Linux kernel, and a bunch of stuff turns up matching (as expected) he can still see WHAT was matching because he has the Linux sources.

      So then he can look at that and say, "hmmm, it looks like part of this ethernet driver is the same, and this NAT implementation, and bits and pieces of the VFAT filesystem code..." and then, find out how those got to be the way they are in Linux.

      If it can be proved that the matching code is totally legit in Linux, (which is what I would expect) then it follows that either (a) SCO actually stole stuff out of Linux, rather than the reverse, or (b) Linux and SCO both took the code from a third source, like BSD.

      Otherwise, option (c) is that Linux actually contains code from SCO which it should not. But this is still an improvement on the current situation, because it would allow the Linux development team to FIX THE PROBLEM.

      Either way, (sooner or later, depending on if Linux fixes are required) it will shoot SCO's claims so full of holes that any reputable journalist reporting on SCO's latest insane claims will have to mention that "... but the source code has been analyzed and all code in Linux similar to SCO's software has been shown to be completely legitimate...", or "... but all code in Linux which SCO might have had a valid issue about has been removed..."

      SCO's big stick right now is FUD. Fear, Uncertainity, and Doubt. The shred tool can remove the uncertainty and doubt. Only SCO will still have the Fear. :-)

      --
      Torrey Hoffman (Azog)
      "HTML needs a rant tag" - Alan Cox
  3. Re:maybe... by jmv · · Score: 5, Interesting

    Actually, combine this with the "shared source" program from MS and it would be easy to see if MS did (or did not) copy GPL code into Windows as some suggest.

  4. SCO may not know origin of code by Malfourmed · · Score: 5, Informative
    The Sydney Morning Herald continues its mainstream coverage of the SCO vs IBM roadshow by posting an article where Dr Warren Toomey, a Unix historian, says that SCO may not know the origin of their own code.

    Article text follows:

    SCO may not know origin of code, says Australian UNIX historian

    By Sam Varghese
    September 9, 2003

    More doubts have been cast on the heritage of System V Unix code, which the SCO Group claims as its own, by an Australian who runs the Unix Heritage Society.

    Dr Warren Toomey, now a computer science lecturer at Bond University, said today: "I'd like to point out that SCO (the present SCO Group) probably doesn't have an idea where they got much of their code. The fact that I had to send SCO (the Santa Cruz Organisation or the old SCO) everything up to and including Sys III says an awful lot."

    He said that even though SCO owned the copyright on Sys III, a few years ago it did not have a copy of the source code. "I was dealing with one of their people at the time, trying to get some code released under a reasonable licence. I sent them the code as a gesture because I knew they did not have a copy," he said with a chuckle.

    Dr Toomey's statements come a few days after Greg Rose, an Australian Unix hacker from the 1970s, raised the possibility that there may be code contributed by people, including himself, which has made its way into System V Unix and is thus being used by companies like the SCO Group.

    Dr Toomey said this was one reason why the code samples which the SCO Group had shown at its annual forum had turned out to be widely published code.

    SCO was unaware of the origins of much of the code and this "explains how they could wheel out the old malloc() code and the BPF (Berkeley Packet Filter) code, not realising that both were now under BSD licences - and in fact they hadn't even written the BPF code," Dr Toomey said.

    He said that there was lots of code which had been developed at the University of New South Wales in the 70s which went to AT&T and was incorporated into UNIX without any copyright notices.

    "At that time the development that was going on was similar to open source - the only difference was that the developers all had to have copies of the code licensed from AT&T," he said.

    Dr Toomey, who served 12 years with the Australian Defence Force Academy, an offshoot of the University of New South Wales, before joining Bond University, said he had source code for Unices from the 3rd version of UNIX which came out in 1974 to the present day. "I don't have Sys V code but there are people with licences for that code who are members of the Unix Heritage Society. We can compare code samples any time," he said.

    He agreed that the codebase of Sys V was a terribly tangled mess. "It is very difficult to trace origins now. There is an awful lot of non-AT&T and non-SCO code in Sys V. There is a lot of BSD code there," he said.

    In March, the SCO Group filed a billion-dollar lawsuit against IBM, for "misappropriation of trade secrets, tortious interference, unfair competition and breach of contract."

    SCO also claimed that Linux was an unauthorised derivative of Unix and warned commercial Linux users that they could be legally liable for violation of intellectual copyright. SCO later expanded its claims against IBM to US$3 billion in June when it said it was withdrawing IBM's licence for its own Unix, AIX.

    IBM has counter-sued SCO while Red Hat Linux has sued SCO to stop it from making "unsubstantiated and untrue public statements attacking Red Hat Linux and the integrity of the Open Source software development process."

    -----

    Wordforge writing contest now open: deadline 2003-03-28

  5. Re:SCO! by mik · · Score: 5, Insightful

    The point is that we don't need SCO to do anything. Presumably any of the many people with legal rights to SCO source code can publish the hash list without divulging any of SCO's (ahem) "IP". Even more interesting is the theoretical possibility of comparing historical releases of SCO trees against GPL-licensed code, thus (perhaps) demonstrating that SCO has illegally violated the IP of OSS developers. Of course, hash comparisons alone would be unlikely to convince a judge/jury of anything. They ought to be sufficient grounds for some embarrasing subpoenas, and maybe some really neat cease-and-desist orders, though.

  6. Re:Can Someone Explain? by Sterling+Christensen · · Score: 5, Informative

    From it's manual:
    "The -w causes all whitespace in the file (including blank lines) to be ignored for comparison purposes (line numbers in the output report will nevertheless be correct). This is recommended for comparing C code; among other things it means the comparison won't be fooled by differences in indent style."

  7. This is actually a darn good idea by RocketRick · · Score: 5, Informative

    By computing MD5 hashes of consecutive (overlapping) line triplets, the shred algorithm makes it easy to identify copied code, without ever seeing the actual code. This might be a perfect way for companies to allow a third party to compare code, without giving away any trade secrets in the process.

    Of course, since MD5 is a very good cryptographic hash function, *any* one-bit change in the source will result in, on average, half of the bits in the result being flipped. So, this method of identifying copied code would only work if the code had never been run through an obfuscator. It would also be defeatable by running the source through a script to have its variable names search-and-replaced with similar names (such as replacing every variable name with a new name consisting of the old name plus "_newname")....

    In short, this might be a useful technique for allowing a third party to look for trivial wholesale copying of code, but it would be useless for finding a motivated miscreant, determined to steal code without being caught.

    1. Re:This is actually a darn good idea by Trailer+Trash · · Score: 5, Insightful

      So, this method of identifying copied code would only work if the code had never been run through an obfuscator.

      You've hit the nail on the head, possibly without knowing it. The source code needs to be run through an obfuscator *before* shredding. Actually, I'm thinking a special obfuscator, let me explain.

      Let's take a piece of C source, not randomly chosen:

      malloc(mp, size) struct map *mp; { register int a; register struct map *bp; for (bp = mp; bp->m_size; bp++) { if (bp->m_size >= size) { a = bp->m_addr; bp->m_addr =+ size; if ((bp->m_size =- size) == 0) do { bp++; (bp-1)->m_addr = bp->m_addr; } while ((bp-1)->m_size = bp->m_size); return(a); } } return(0); } Now, the structure of the code is 99% of what matters. Variable names can change, but few people would change anything beyond that. Let's modify the code in a couple of important ways. First, all variable names are changed to new names, on a per-line basis. Blank lines and unneeded blanks are all removed. Each statement is on its own line, and formatting styles (such as curly bracket placement) are standardized. malloc(a, b) struct a *b; { register int a; register struct map *b; for (a=b;a->c;a++) { if (a->b>= c) { a=b->c; a->b=+c; if ((a->b=-c)==0) do { a++; (a-1)->b=a->b; } while ((a-1)->b=a->b); return(a); } } return(0); }

      This might not be perfect, but it should do the trick. A programmer can change variable names, spacing, or format, but as long as the code is the same, it'll match. Obviously, changing the code would have an impact, but nearly every line would have to be changed for it to not match, and in a substantial way. That's literally not always possible to even do in a way that would trick this function.

      Anyone want to write it?

      Michael

  8. Slim to None by tomRakewell · · Score: 5, Insightful

    Chances are slim to none that a software company would allow it's "shredded" source code to be publicly released. What happens if the proprietary source is found to violate the GPL?

    Proprietary (closed) source companies have a tremendous advantage over open source software when it comes to violating intellectual property. Who will ever know if they did it? A source code "comparator" eliminates that crucial advantage.

  9. Results Will Appear "Tainted" by zapf · · Score: 5, Insightful

    While I fully support ESR and the rest of the open source movement's defense of Linux against SCO, I have a feeling that this tool's results will not immediately be accepted by established media simply because of ESR's bias. A reporter looking into the SCO story who knows little about open source wouldn't trust a tool made by one side of the disagreement.

    It seems very important to me that "third parties" and experts who are not an integral part of the open-source movement validate that comparator works as intended and is effective at detecting code similarities. Hopefully we'll see some articles on respected sites in the next week or so with conclusive analyses of comparator. Not to mention a chance for someone to use it on SCO's code!

    Oh, and "Yes, I'm being deliberately vague and tantalizing" is quite funny.

    1. Re:Results Will Appear "Tainted" by Brandybuck · · Score: 5, Insightful

      A reporter looking into the SCO story who knows little about open source wouldn't trust a tool made by one side of the disagreement.

      Then why would a reporter trust the press releases that SCO puts out on an daily basis?

      The unfortunate reality is that they DO trust them. We may all think this is a joke here in our insular community, but the great majority of reporters report the press releases "as is". Then the analysts come along and refine those press releases into easily digestible chunks. Then the pundits come along with preconceptions based on those chunks. Ever wonder why the SCO stock keeps going up and up and up? It's because the only thing the general public knows about this issue has come from SCO.

      Anything that can help get the truth before the public eye is a Good Thing(tm). A tool that can mathematically "prove" that SCO is lying is valuable, even if most reporters suspect a bias.

      --
      Don't blame me, I didn't vote for either of them!
  10. IBM has a project called History Flow by TedTschopp · · Score: 5, Interesting

    This is perhaps a better project and it would be interesting to see this tool run against the source.

    History Flow The following is from their website:

    history flow
    visualizing dynamic, evolving documents and the interactions of multiple collaborating authors:

    Motivation
    Most documents are the product of continual evolution. An essay may undergo dozens of revisions; source code for a computer program may undergo thousands. And as online collaboration becomes increasingly common, we see more and more ever-evolving group-authored texts. This site is a preliminary report on a simple visual technique, history flow, that provides a clear view of complex records of contributions and collaboration.

    --
    Fantasy remains a human right; we make in our measure and in our derivative mode... -- JRR Tolkien
  11. I can write such a utility also! by pclminion · · Score: 5, Funny

    int main()
    {
    printf("These source trees appear to be entirely different!\n");
    return 0;
    }

  12. SCO's trade secrets --- it's all FUD by EmbeddedJanitor · · Score: 5, Funny

    They would be divulging SCO's biggest trade secret, that all their claims are just FUD.

    --
    Engineering is the art of compromise.
  13. Re:maybe... by Courageous · · Score: 5, Insightful

    And Microsoft's NDA surely gives them the right to do this.

    A term in any contract, including any NDA, as stipulated by any party, which would obligate the other party to not report a violation of law, either statute or criminial, is PER SE unlawful and cannot be enforced within any jurisdiction of of most first world countries. Any contract bearing such a stipulation would in fact be at significant risk of invalidating the ENTIRE contract, not just the unlawful provisions therein.

    C//