Slashdot Mirror


ESR to Shred SCO Claims?

webmaven writes "According to this article in eWEEK, ESR has released a utility called comparator for analyzing the similarity of source code trees. The technical details are interesting, in that ESR says he is using an implementation of a refined version of the 'shred' algorithm, with higher performance (on machines with enough RAM) than other versions. ESR won't say whether he intends the comparator to be used to compare older Unix code to Linux so as to be able to refute SCO's claims, but it's obviously well suited for such a purpose. Interestingly, as the shred algorithm can run reports on source trees using only the MD5 signature shreds (once generated), it is possible to use it to compare trees without direct access to the source code itself, leading to a possible use in comparing various proprietary source trees with each other and with Freely available code bases such as Linux and *BSD without requiring actual disclosure of the proprietary source code (a neutral third party could generate the shreds on a company's premises, and leave without taking a copy of the source with them). I'll be interested to see if (or which of) the proprietary vendors allow their source trees to be 'shredded' for such comparisons, and whether this becomes a standard forensic technique in source-code copyright and trade-secret disputes."

40 of 554 comments (clear)

  1. Re:maybe... by jmv · · Score: 5, Interesting

    Actually, combine this with the "shared source" program from MS and it would be easy to see if MS did (or did not) copy GPL code into Windows as some suggest.

  2. The truth is out there by Teahouse · · Score: 2, Interesting

    The truth is out there, we will finally get to it without signing a SCO NDA. This should end the case before it begins. SHRED ON!

    --
    "Curiosity killed the cat, but for a while I was a suspect."- Steven Wright
  3. Re:SCO! by jmv · · Score: 2, Interesting

    Ths think is that the hashes could be generated my any organisation that has access to the SysV source code. There are many of them (IBM being one).

  4. Can Someone Explain? by Klync · · Score: 2, Interesting

    If you're comparing two sets of code vis. their MD5 sums, then won't that miss matching lines that differ by even one character - like, say, a space?

    --

    ----
    Not to be confused with Col.
    1. Re:Can Someone Explain? by stratjakt · · Score: 4, Interesting

      Perhaps if you parsed them both, and compared the resulting object code, right before compilation?

      That way if your variable is called numOfPorts and mine is called countOfPorts, the parsed code is the same for both, when stuff like that becomes meaningless.

      Even if not, SCO seems to be saying that much of the code is copy-n-paste anyways.

      --
      I don't need no instructions to know how to rock!!!!
  5. Other uses? by Not_Wiggins · · Score: 4, Interesting

    It might be interesting to see how different families of Linux/Unix compare... maybe generate a veritable "family tree" of relationships.

    Of course, that also depends more on how differences are actually calculated. Still, could make an interesting project to relate OSes based on how much shared code they still retain and show it in a graphical tree format, ala "family tree." 8)

    --
    Diplomacy is the art of saying, "Nice doggie!" until you can find a rock.
  6. What respect? by Anonymous Coward · · Score: 3, Interesting

    Most people *I* know consider ESR to be a bloated windbag with a penchant for fanatical gunrights. He's regarded as pretty much being on the same level as the late Jon Katz.

  7. Be careful... by nolife · · Score: 4, Interesting

    The more points you discover and disprove now with SCO's claims.. the higher quality, more refined, and detailed SCO's evidence will be when this setup finally gets to a court in front of a judge. If they went to court two months ago or even today, they would have been sent home quickly with bascially easy to disprove evidence. With the help of the open source community, they are slowly changing their weapon of choice from a shotgun to a rifle.

    --
    Bad boys rape our young girls but Violet gives willingly.
  8. Re:Nah... by jmv · · Score: 3, Interesting

    That's true in general. However, SCO has explicitly stated that thousands of lines of code have been illegaly copied *verbatim* from System V. This tool could at least prove that they lied (because of the verbatim copy allegation).

  9. Re:fire the "laser" by be-fan · · Score: 3, Interesting

    You know the sad thing about all this? I can't tell the difference between the auto-generator or your average Slashdotter. Does this mean that the auto-generator passes the Turing Test, or that the average Slashdotter doesn't?

    --
    A deep unwavering belief is a sure sign you're missing something...
  10. Nonsensical idea by YU+Nicks+NE+Way · · Score: 1, Interesting

    Great. So cool. And so stupid.

    First, IBM, Sequent, SGI and Linux wouldn't be off the hook if the provenance of each line of code were proven to have come from other sources. There are a number of trade secret issues that still could crop up.

    But let's assume that Raymond's work was actually run on the SCO source and on Linux. Would the results be meaningful?

    No.

    Suppose I have a routine that comes originally from source B. I work for a company which has the right to copy B, but which redistributes the results of its work under a closed license. Call that new source S. It so happens that the code my company got from B had a nasty bug in it, and I spent a month finding a fix for that bug. Suppose also that the fix is quite small relative to the original code, as is ususally the case. A shredder is going to find significant similarities between at routine as implemented in source B and source in S. Now, suppose source L comes along. The authors of L had the right to copy from B, but not from S. They have a very similar routine, originally derived from B. After shredding, the routines in B, S, and L will all look similar -- but whether there's an infringement between S and L will depend solely on a tiny fragment of the code. Without disclosing that fragment, there is no way to determine if there's in infringment or not.

  11. IBM has a project called History Flow by TedTschopp · · Score: 5, Interesting

    This is perhaps a better project and it would be interesting to see this tool run against the source.

    History Flow The following is from their website:

    history flow
    visualizing dynamic, evolving documents and the interactions of multiple collaborating authors:

    Motivation
    Most documents are the product of continual evolution. An essay may undergo dozens of revisions; source code for a computer program may undergo thousands. And as online collaboration becomes increasingly common, we see more and more ever-evolving group-authored texts. This site is a preliminary report on a simple visual technique, history flow, that provides a clear view of complex records of contributions and collaboration.

    --
    Fantasy remains a human right; we make in our measure and in our derivative mode... -- JRR Tolkien
  12. Its been around for years by Anonymous Coward · · Score: 3, Interesting

    check out this research project coming out of berkeley CAP

    Drop in the code you are interested in and it will tell you where its found in a bunch of open source stuff, including the linux kernel.

  13. Re:The real question is: by Anonymous Coward · · Score: 1, Interesting

    That's a legitimate question. If the CS students that code Linux have learned anything it's how to obfuscate someone else's code to avoid getting caught cheating. Hell, I do it all the time when I don't want to write a stupid program for class. Obfuscation is an artform.

  14. Who says SCO gets to court first? by JoeBuck · · Score: 4, Interesting

    If we can show that SCO's violating the BSD license, maybe we can convince some BSD copyright holder to sue them first, and demand as part of discovery the MD5 checksums from "shred", showing duplicated BSD code but no duplicated BSD copyright.

    1. Re:Who says SCO gets to court first? by Anonymous Coward · · Score: 1, Interesting

      The various BSD flavours are next in line to feel SCO's wrath after they're done with Linux. SCO has already said they have issues with BSD.

  15. Obfuscation Observations by Anonymous Coward · · Score: 1, Interesting
    So, this method of identifying copied code would only work if the code had never been run through an obfuscator. It would also be defeatable by running the source through a script to have its variable names search-and-replaced with similar names (such as replacing every variable name with a new name consisting of the old name plus "_newname")
    But the sources for Linux are already available, so SCO knows the OS community isn't cheating - it can independently run the shred algorithm and prove the same results come out. SCO, OTOH, is refusing to let anyone see the code that isn't under NDA. This lets one or more persons who are willing to forego ever writing kernel code walk into SCO's office, run shred on the code, and walk out without even a copy of the code. And while SCO would be able to obfuscate, they have no incentive to do so - on the contrary, they have the strongest incentive to keep everything exactly the same, if not to fudge the other direction by renaming variables to match the ones in Linux.

    Then we'll have the ability to identify the lines of code in the Linux tree that are the same as lines that SCO says is in their codebase. And show exactly where that code came from. Why? Because the process is OPEN! There are LKML archives with all-out flamewars over some of that code. There are companies whose legal departments have vetted the code they've contributed, and have files that document the process in excruciating detail.

    We will undoubtedly find some of the 'copied' code is in fact BSD code, and the shred algorithm will show that the code differs exactly where the California Regents' copyright notice has been taken out, which will prove that SCO violated not only the GPL, but the BSDL as well. And just like AT&T before them, they'll lose big.

    Posting as AC from work, but you know who I am...
    SVM, ERGO MONSTRO

  16. Re:Slim to None by JoeBuck · · Score: 4, Interesting

    But IBM already has a copy of SCO's code; they licensed it after all. They can release the output of "shred" without violating their agreements with SCO.

  17. Re:Slim to None by k98sven · · Score: 2, Interesting

    Proprietary (closed) source companies have a tremendous advantage over open source software when it comes to violating intellectual property. Who will ever know if they did it? A source code "comparator" eliminates that crucial advantage

    Not really.. Open-source software usually has a nice setup with mailing-lists, CVS, etc. Most of the code is well accounted-for. The same is not as true with a lot of proprietary software.

    Remember, it's not enough that two pieces of code match to prove an infringement in court.
    In fact, the court will most likely take into consideration the fact defending code is open-source, and the burden of proving that they originated the code would be increased for the plaintiff.

    Also, failing to prove that they originated the code could leave them open to a countersuit in which the tables would be turned against them, since they obviously had access to the open-sourced code.

  18. derivative work? by donutz · · Score: 4, Interesting

    Presumably any of the many people with legal rights to SCO source code can publish the hash list without divulging any of SCO's (ahem) "IP".

    Would these hashes of SCO source code be considered derivative works? That could have copyright implications...

    1. Re:derivative work? by ls+-lR · · Score: 2, Interesting

      I think we all agree that the obvious "duh" answer is that "of course they wouldn't be derivative works." But SCO has proven that it has a knack for just making stuff up or interpreting things funny. However, even based on the letter of the law I don't think this would qualify as a "transformation." That would seem to apply to a case where you shift the representation of the data to a different format but retain its essence, such as copying a DVD to a VHS tape. However, creating MD5 sums does not seem like it would be a transformation in that sense, in that the new work has none of the qualities of the original -- it's not code, it won't compile, it cannot be used to divine any algorithms, methods, etc. In sort it's completely useless, other than for comparing to other source code fragments.

  19. Not as useful in court by klui · · Score: 2, Interesting

    It will just tell someone two trees are similar/identical. The important thing to prove in court is who copied from whom.

  20. Re:maybe... by Anonymous Coward · · Score: 1, Interesting

    Would the output of comparator, MD5sums of overlapping lines of code, be considered a derivative work for copyright purposes?

    Possibly more relevant, would the terms of MS's shared source license allow it to be used or distributed?

    I anticipate this tool will be useless more often than not, simply because the slightest systemic change would result in zero matches. Replacing tabs with spaces, two spaces with three, or even line-feeds with carriage-returns would yield 100% false negatives if you use this to identify copyright violations.

  21. Re:The real question is: by Anonymous Coward · · Score: 1, Interesting

    OSS programmers are usually highly-connected programmers, not lone ones. I haven't met one in years that doesn't have an IRC window open in the background for chitchat while their brain is idling. All the proprietary programmers I've met (mainly shareware authors) have been sad, lonely, bitter, microsoft-haters, while the OSS crowd just treat MS with mild derision.

    Lawyers, professional managers and financial experts all make huge, glaring mistakes, sometimes even making the news.

  22. Don't these guys have anything better to do? by Anonymous Coward · · Score: 1, Interesting

    Seriously. Perens and ESR are fueling SCO's flames by giving them poorly-thought-out statements to cull choice quotes from to support SCO's case. And SCO's words are the only ones seeing mainstream attention (check who's stories are linked from any pages about the SCOX stock prices, and you'll see the public is only getting SCO's side of the story).

    In my most reasonable, humble opinion, anyone who is not an IBM lawyer really needs to STFU concerning this matter. The wise man waits his turn to speak.

  23. Open Source by digidave · · Score: 3, Interesting

    THIS is exactly why Open Source works. It's not because of IBM or Red Hat or geeks from Finland. It's because people in the community are willing to step up to any challenge.

    Thanks, ESR.

    --
    The global economy is a great thing until you feel it locally.
  24. Press release! by mflaster · · Score: 2, Interesting

    Why isn't this a press release?

    If I go to Yahoo, and look at news related to SCOX, this doesn't show up. Here is the open source community trying to help find any misappropriated IP - and no one that doesn't read slashdot/eWeek will know about it!

    Isn't there someone who subscribes to a wire service, that can issue a press release? In order to fight FUD, we have to get info out to people that don't read slashdot!!

    Mike

  25. Someone Did this in June. by Popsikle · · Score: 2, Interesting

    I cant dig up the slashdot post, but here is The Inquirer article from Jun 18th. Someone did this well before esr did.

    Its not new, Its not esr's Idea, Its almost 3 months old!!!

  26. Comparison algorithms? by Ivan+the+Terrible · · Score: 2, Interesting

    I'm interested in algorithms that could be used to compare code. Moss and CAP from Berkeley are not interesting because the algorithm is secret (AFAIK).

    What algorithms other than ESR's comparator are there? (I recall but can't locate a recent comment on Slashdot that said something like "most plagiarism detection programs used by professors use the XXXX algorithm".)

  27. Re:Is there really that much data there? by Krach42 · · Score: 2, Interesting
    Actually, some source code is loss-tolerant. Take C for example. In C the only significant whitespace is between any two elements of the set { identifiers, numbers }, and any that occurs in quotes, or character constants.

    Also, comments can potentially discarded without effecting the compilation of the program.

    Thus, you can take a program:
    int main(void)
    {
    printf("Hello World!\n");

    return 0;
    }
    And turn it into:
    int main(void){printf("Hello World!\n");return 0;}
    You've saved yourself space here. Now, here's the wierd thing, I wouldn't expect this to save any space after gzip'ing, or bzip'ing. I mean, after all, you're primarily just removing one character. But it turns out that on a particular file of mine:

    -rw-r--r-- 1 dfoesch staff 9184 Sep 9 19:00 navajo.c
    -rw-r--r-- 1 dfoesch staff 3213 Sep 9 18:58 navajo.c.bz2
    -rw-r--r-- 1 dfoesch staff 1832 Sep 9 18:58 navajo.c.nospaces.bz2

    And gzip is the same. This is thus a lossy compression for source code that doesn't actually modify the semantics or syntax of the program. (Of course, this won't work for language like Python.)

    Yes, the result it unreadable, but then you just run indent, with your favorite coding-style setup, and viola! It's back to "normal", but different. Just like lossy compression is supposed to work.
    --

    I am unamerican, and proud of it!
  28. Re:No source = no copyright by poptones · · Score: 3, Interesting

    Apparently your reading comprehension skills are right on par with the dolt who modded the post down.

  29. Nobody has mentioned this yet ... by Mostly+a+lurker · · Score: 3, Interesting
    As currently designed, Shred would obviously not defeat deliberate source misappropriation. If (big if) the method could adapted such that it could not be easily fooled by a determined violator (and without revealing how the code works) then I believe registration of the results should be required by law. BUT ...

    In order that the method should not be fooled by simple changes, at least the following is required

    * White space must be ignored

    * Comparison must be at the statement level, not the code line level

    * Variable names must be replaced by standard placeholders

    * Routine names, other than standard library calls, must be replaced by standard placeholders

    * (Probably difficult) logic will be needed in the tool to detect and ignore noops: how do you deal with

    i++;
    %include noop.i;
    a[i]=b[i];

    The trouble is: a high proportion of the code sections thus simplified will fall into a relatively small number of possibilities, vulnerable to dictionary type attacks. Thus, most of the code could be reconstructed, though admittedly as obfuscated source code. IMHO this provides a valid objection to its use.

  30. No Trade Secrets in Registered Copyrights by Iparadox · · Score: 2, Interesting

    I think we are all missing a big point here. SCO registered their copyright in SysV. It was hard to do. They had to create a copy of the source code and file it with the Patent and Trademark Office. That puppy is there so that *we* can look at it. This is specifically *fair use*. It is there so that individuals can protect themselves by comparing what they have to what has been registered. No match means no problem. It just doesn't get any more *fair use* than that. Just have somebody nip up to the PTO, copy the registration for comparison purposes only, (really!) then do the comparisons. How hard is that? Yes, IAAL, but this is not legal advice. Hire your own mouthpiece.

  31. Re:SCO! by Anonymous Coward · · Score: 1, Interesting

    Nobody has to publish the SCO hashes. If their able to run the tool to create them, they are surely able to run the tool to do the comparison with Linux.

    Now...

    If the comparison shows "no code matches" you can say so. A SCO licensee saying "We've looked into the problem ourselves, and feel Linux is unique." tells nothing about SCO, or their secrets.

    If the comparison suggests SCO adopted Linux code, one would be obligated to report same to proper authorities. NDA, or not. As a licensee failure to report, indeed failure to do the due diligence of this test, may rope you into commission of willful infringement and/or conspiracy to do so.

    If the comparison suggests infringements, and you cannot determine the source, you may be obligated to determine same under rules of due diligence. This would include filing appropriate reports with authorities.

    If the comparison suggests Linux adopted SCO code, then you, as one SCO licensee to another, can likely exchange that information. Further, you or your guilty peers could (and probably must) publish corrections for your error.

    Now, the comparator won't catch is code that's been "infected" by SCO's newly envisoned concept of the world's first, *truely* deadly, viral license. IBM claims to hold valid copyrights as independent works for code they also contributed to the Unix V codebase. SCO claims to "control" any and all such code, and all that came in contact with it, however remotely. (Yes, I assume SCO fells they now excercise license control over nearly all of IBM's code base assets. Mainframe to wrist watch. I can't imagine how their theory can hold otherwise, actually.)

  32. Re:MD5 easily fooled by dmiller · · Score: 4, Interesting

    So, you've downloaded Comparator, and run tests, then.

    I didn't need to, the following is in the readme:

    comparator does not attempt to do semantic analysis and catch relatively trivial changes like renaming of variables, etc. This is because comparator is designed not as a tool to detect plagiarism of ideas (the subject of patent law), but as a tool to detect copying of the expression of ideas (the subject of copyright law).

    He's wrong BTW (and he is smart enough to know it, which makes this a deliberate deception). A work is no less subject to copyright if someone does a global search and replace on a variable name.

  33. Better yet, a reason to get MS to stop funding SCO by isn't+my+name · · Score: 2, Interesting

    Actually, combine this with the "shared source" program from MS and it would be easy to see if MS did (or did not) copy GPL code into Windows as some suggest.

    More importantly, get something like this accepted in a court of law as a legitimate way to do an initial assessment of code yet still preserve a litigants right to code privacy, and you are going to have not just MS but a number of big companies shaking in their boots. Not necessarily because they did steal anything but because they have to realize it is a possibility that one of their coders did without company knowledge. Doesn't matter, they are still liable.

    But, get a method like this accepted in a court of law and you are going to see it used again. I think this has a huge potential to hurt closed software. And perhaps a potential to convince MS to stop funding SCO, perhaps even to apply pressure to get them to start backing down.

  34. Re:No source = no copyright by IM6100 · · Score: 3, Interesting

    Get a clue. Nobody who copyrights a work is under any obligation to widely spread around the work. Copyright is inherent in any written work. I can write a poem intended only for my lover, just give the one copy of the poem to that lover, and it's protected by copyright. Break into my lover's house, steal a copy of the poem, and publish it, and you've broken copyright and I have standing to nail you good for it.

    Patents, in order to be patented, need to be fully disclosed. That's inherent in the patent process, you're saying 'this is MY idea, here's the whole deal laid out, I assert that it's mine.' There's no comparable oblication for copyright.

    People like you who try to mush it all up are just trying to loot other people's property.

    --
    A Good Intro to NetBS
  35. Possible improvements by gonvaled · · Score: 2, Interesting

    A lot of comments focus on the problems that a global search and replace will pose to the technique. I think we can improve the agorithm by doing the following:

    What we are looking for here are pieces of code with the same structure: the same for loops, while loops, variable assignment, function names, and so on. The idea would be to substitute all literals by a standard placeholder, and then generate the md5 checksums on the block level (as somebody has previously suggested).

    To be able to cheat this technique, a modification in the structure of the code is required. And in the case that exactly that has been done, it is arguably wether that can be considered copyright infringement.

  36. Test SCO Linux Kernel Personality by LightSail · · Score: 2, Interesting

    The best use of this technology would to test the SCO LKP for stolen Linux code.
    Confirming that SCO had incorporated Open Source code that they had access to under the GPL would destroy their credibility and open them up to countersuits. The process would only have to reveal enough similarities to have subpoenas ordered for the actual code involved. Then we could prove the theft with SCO own source code.

    I suspect that those who know that Linux code was used to create LKP would come forward once the code has been discover and posted for all to see.

  37. Check and Mate by Anonymous Coward · · Score: 1, Interesting

    This is the end for SCO. There are only three possibilities here:

    (1) That there is no infringing code that belongs exclusively to SCO. If that is the case, then its game over for SCO; perhaps followed by jail time for Darl and friends at some federal prison where they would discover a new meaning for the phrase "pump and dump".

    (2) That there seems to be code that was duplicated directly and exclusively from the Sys V source tree and that doesn't originate from any other public source. If it's a trivial amount of code, simply replace it and move on. The unlikely perpetrator alone becomes responsible for any damages to SCO.

    (3) If its non-trivial you can simply remove it from the kernel as long as it doesn't impact anyone seriously. Make it part of the final 2.6.0 or a 2.4.x interim kernel release. For example, lets say the IBM journaling file system is exactly the same. Simply remove it from the kernel until at which time IBM settles its lawsuit and resolves those matters. As long as people don't really need JFS, why encumber the kernel? I've never used it, preferring ReiserFS or ext3. Same goes for other supposed code expressions such as RCU or NUMA, although I suppose the copyright issues on those would be easy to solve since the amount of code in question is on the order of 5000 lines. If all you have to do is change the way its expressed, that should be trivial. In any case, derivitive works laws shield any code that is specifically tied to hardware implementation from being considered before the court.

    Bottom line. SCO is really screwed now. Their only recourse is hope beyond hope that they can get someone to agree with their derived works claim on some non hardware/software patent code. At that point the only thing they can do is get compensated by the infringing party. No way they will be able to shake down linux users since they will already have been paid.

    Oh, I forgot the fourth possibility -- that no one that has access to the Sys V sources will be will willing to run 'comparator' on it and generate shreads and that SCO will also refuse. This of course would be sufficient to dismiss their case. A declaratory judgement could be handed down by the federal judge supported by expert testimony that 'comparator' is a valid comparison. Any reasonable expert Software Engineer/Cryptographer would do. Perhaps Bruce Schneir could be the expert witness.

    End of story. Thank you for playing SCO, please drive through.