Slashdot Mirror


ESR to Shred SCO Claims?

webmaven writes "According to this article in eWEEK, ESR has released a utility called comparator for analyzing the similarity of source code trees. The technical details are interesting, in that ESR says he is using an implementation of a refined version of the 'shred' algorithm, with higher performance (on machines with enough RAM) than other versions. ESR won't say whether he intends the comparator to be used to compare older Unix code to Linux so as to be able to refute SCO's claims, but it's obviously well suited for such a purpose. Interestingly, as the shred algorithm can run reports on source trees using only the MD5 signature shreds (once generated), it is possible to use it to compare trees without direct access to the source code itself, leading to a possible use in comparing various proprietary source trees with each other and with Freely available code bases such as Linux and *BSD without requiring actual disclosure of the proprietary source code (a neutral third party could generate the shreds on a company's premises, and leave without taking a copy of the source with them). I'll be interested to see if (or which of) the proprietary vendors allow their source trees to be 'shredded' for such comparisons, and whether this becomes a standard forensic technique in source-code copyright and trade-secret disputes."

554 comments

  1. maybe... by b17bmbr · · Score: 4, Funny

    microsoft can just shred their source tree and start anew. maybe...

    --
    My problem? I was perfectly gruntled, until some numbnuts came by and dissed me.
    1. Re:maybe... by jmv · · Score: 5, Interesting

      Actually, combine this with the "shared source" program from MS and it would be easy to see if MS did (or did not) copy GPL code into Windows as some suggest.

    2. Re:maybe... by Anonymous Coward · · Score: 3, Informative

      look how Microsoft is directly trying to bias the case more with onesided biased news: check out [from today]: [msn.com article from supposed tech analyst Jonathan Cohen

      then read:
      More on Jonathan Cohen

      Microsoft MSN a biased propaganda machine. Only shows one side of the facts (the lies).

    3. Re:maybe... by Anonymous Coward · · Score: 0

      Well, first they'll be more than happy to publish this crap.

      I'm really annoyed that CNBC would even relay this info, considering everything that's going on. If it wasn't for Maria Bartiromo, I wouldn't watch at all...

    4. Re:maybe... by fireboy1919 · · Score: 4, Insightful

      Right. Because as we all know, people who pay Microsoft the huge bag 'o money that it costs to see their source are primarily interested in the pursuits of OSS to see if Microsoft has copied anything it shouldn't have. And Microsoft's NDA surely gives them the right to do this.

      If anyone is able to prove Microsoft is doing something illegal via the shared source initiative, they'll probably have to do it illegally.

      --
      Mod me down and I will become more powerful than you can possibly imagine!
    5. Re:maybe... by toast0 · · Score: 3, Insightful

      If you've licensed code from microsft, and it turns out to be GPL, the license under which you got the code is invalid, so it wasn't illegal to determine if they improperly took code.

      On the other hand, if all their code checks out, testing for that may violate their NDA, but it'd be difficult for them to show you checked their code if you don't mention it.

    6. Re:maybe... by Anonymous Coward · · Score: 1, Interesting

      Would the output of comparator, MD5sums of overlapping lines of code, be considered a derivative work for copyright purposes?

      Possibly more relevant, would the terms of MS's shared source license allow it to be used or distributed?

      I anticipate this tool will be useless more often than not, simply because the slightest systemic change would result in zero matches. Replacing tabs with spaces, two spaces with three, or even line-feeds with carriage-returns would yield 100% false negatives if you use this to identify copyright violations.

    7. Re:maybe... by spectecjr · · Score: 1

      Would the output of comparator, MD5sums of overlapping lines of code, be considered a derivative work for copyright purposes?

      Shouldn't be... after all, it's just a relatively unique identifier for the information - it doesn't actually contain any information.

      Think of an MD5 sum as exceptionally lossy compression.

      --
      Coming soon - pyrogyra
    8. Re:maybe... by Anonymous Coward · · Score: 0

      good question... an integer number should not be copyrightable, or is it? regarding the trivial changes, it is always possible to use a c beautyfier or a simple indenter before calculating the md5. however, if the sources analized are in the hands of more than one person it is possible to detect such things by comparing more copies of the same sources. for example sco is not the only one with access to systemV code.

    9. Re:maybe... by mobets · · Score: 1

      ... slightest systemic change would result in zero matches.

      I thought of that too, but then I thought, they could just stip out all white space. Then do it with and without the comments. Although, it would be posible to beat this if you change all of the variable names, or constant strings so that there is at least one change every other line. Maybe the shredder ignores the variable names too?

      --

      It was me, I did it, I moved your cheese
    10. Re:maybe... by DugzDC · · Score: 1

      Lazy, and have only scanned the above, but how can this really be used to compare source trees?
      My understanding of hashes is that a decent alg. will flip around 50% of bits in the ciphertext for any single bit of change in the plaintext. So unless source files are identical, hashes generally won't match. I doubt source files could be planted straight in - some kind of integration would be required (i.e. source code mods). So how could this be used to find 'dodgy' code? (Although this is a waste of processor time anyway. Everyone knows this is balls.)
      But, as I said above, this is /. and I ain't read the above, just getting in some bedtime reading. Enlighten me. But quick, my eyelids are drooping...

    11. Re:maybe... by Webmonger · · Score: 1

      But lossily compressed things are derived works-- Otherwise the RIAA wouldn't be suing 12-year-old P2P users. I'd draw the distinction that with lossy compression, the expression isn't lost. Whereas with MD5 (and other hashes), the expression is irretrievably lost.

    12. Re:maybe... by Parsec · · Score: 1

      What if you compiled the code? If debugging information is stripped, shouldn't "x=5" be the same as "myVariable=5; /* blah blah blah */"?

    13. Re:maybe... by Courageous · · Score: 5, Insightful

      And Microsoft's NDA surely gives them the right to do this.

      A term in any contract, including any NDA, as stipulated by any party, which would obligate the other party to not report a violation of law, either statute or criminial, is PER SE unlawful and cannot be enforced within any jurisdiction of of most first world countries. Any contract bearing such a stipulation would in fact be at significant risk of invalidating the ENTIRE contract, not just the unlawful provisions therein.

      C//

    14. Re:maybe... by Anonymous Coward · · Score: 0
      You are a fucking genius! Microsoft is biased toward Microsoft! Stop the presses!

      Maybe for your next post you could tell us that the sun is going to rise tomorrow morning.

    15. Re:maybe... by spectecjr · · Score: 1

      But lossily compressed things are derived works-- Otherwise the RIAA wouldn't be suing 12-year-old P2P users. I'd draw the distinction that with lossy compression, the expression isn't lost. Whereas with MD5 (and other hashes), the expression is irretrievably lost.

      Only because the MD5 hash is nearly completely lossy. Once you can't retrieve any useful information, it's no longer a derived work.

      For example, if you compressed an MP3 file down to 128 bits, I'm pretty certain that no-one would sue you for copyright infringement if you then spread it over a p2p network - mainly because at that point, you wouldn't have anything recognizable left. It'd be reasonably unique noise, but noise is all it would be.

      --
      Coming soon - pyrogyra
    16. Re:maybe... by Webmonger · · Score: 2, Informative

      Hashes and lossy compression are different things. They're designed for completely different purposes and implemented for the purpose they serve. That's why LAME won't compress an mp3 to less than 8kbps, much less 128 bits. It's why md5sum doesn't have a --reproduce-original switch.

      For a given input and parameters, any two (independently-developed) MP3 encoders will almost certainly produce different outputs. For a given input and parameters, different md5 implementations will produce the same result.

    17. Re:maybe... by spectecjr · · Score: 1

      Hashes and lossy compression are different things. They're designed for completely different purposes and implemented for the purpose they serve. That's why LAME won't compress an mp3 to less than 8kbps, much less 128 bits. It's why md5sum doesn't have a --reproduce-original switch.

      For a given input and parameters, any two (independently-developed) MP3 encoders will almost certainly produce different outputs. For a given input and parameters, different md5 implementations will produce the same result.


      That's because MP3 encoders are not specified by the MPEG specification. Only the decoders are. Therefore, two different independently-developed encoders will generate two different results -- because they use two different algorithms to compress the data.

      On the other hand, MD5 hashing is a very different kettle of fish -- it's a defined algorithm for performing the encoding.

      Sure, when it comes down to the nitty gritty, they perform two different functions. Conceptually, however, (and in terms of information theory) they're very very similar operations.

      --
      Coming soon - pyrogyra
    18. Re:maybe... by inode_buddha · · Score: 2, Funny

      Cool! This looks like the *perfect* tool to sort and find dupes in my pr0n collection...

      (AFAIK, nobody ever said the input to md5sum had to be human-readable)

      --
      C|N>K
    19. Re:maybe... by arivanov · · Score: 1
      stip out all white space.

      Changing file names and directory layouts should have similar effect. Unless you start comparing every file with every file which will make this a classic n-square problem and that is a well known way to climb a staircase going down.

      --
      Baker's Law: Misery no longer loves company. Nowadays it insists on it
      http://www.sigsegv.cx/
    20. Re:maybe... by Webmonger · · Score: 1

      If "MD5 hashing is a very different kettle of fish", why should we think of it as a kind of lossy compression?

      Conceptually, hashing is a way of producing a fairly unique identifier for something, while compression is about reducing redundancy in the storage format, and lossy compression is about treating non-redundant things as redundant in order to achieve higher compression ratios.

      You can say they're the same, but you also say they're not. I say they're not.

    21. Re:maybe... by Eunuchswear · · Score: 1

      an integer number should not be copyrightable, or is it?

      A CD is just an integer number. As is a DVD, or indeed any digitised information.
      --
      Watch this Heartland Institute video
    22. Re:maybe... by joostje · · Score: 1

      A term in any contract, including any NDA, as stipulated by any party, which would obligate the other party to not report a violation of law, either statute or criminial, is PER SE unlawful

      So, sure, you're allowed to report it -- but were you allowed to find out whether they violated the law in the first place?

    23. Re:maybe... by Glock27 · · Score: 1
      Right. Because as we all know, people who pay Microsoft the huge bag 'o money that it costs to see their source are primarily interested in the pursuits of OSS to see if Microsoft has copied anything it shouldn't have. And Microsoft's NDA surely gives them the right to do this.

      I'm pretty sure the NDA will allow the generation of MD5 checksums from Microsoft source. There might be a modification in the next version. ;-)

      Those MD5 'shreds' are no longer Microsoft property.

      --
      Galileo: "The Earth revolves around the Sun!"
      Score: -1 100% Flamebait
    24. Re:maybe... by Stephan+Schulz · · Score: 3, Informative
      I anticipate this tool will be useless more often than not, simply because the slightest systemic change would result in zero matches. Replacing tabs with spaces, two spaces with three, or even line-feeds with carriage-returns would yield 100% false negatives if you use this to identify copyright violations.
      I've read the man page that comes with the program, and such things are taken care of. There is an option that will ignore horizontal and vertical white space for comparison purposes, and another one that ignores curly braces (possibly as bad a source of false negatives as formatting).

      All in all, it seems to be quite a nice little tool.

      --

      Stephan

    25. Re:maybe... by pmz · · Score: 1

      moneycentral.msn.com

      Well, at least their priorities are clear.

    26. Re:maybe... by SillySlashdotName · · Score: 1

      I am not the original poster, and (s)he is already responding to this thread, but I think you are both arguing the same thing.

      With a MD5 sum/hash, I should be able, given enough motivation, time, and resources, to recreated the original exactly, so it is a non-lossy compression.

      I do understand that theoretically you could find two different sequences that hash to the same values, so it may be somewhat lossy.

      I think the original poster is saying that there is little difference between a compression technique that removes so much information that you can not get back to the original, and MD5 sums where you COULD get back to the original except that it would take more time, computational power and resources than are available - that in effect you also can't get back to the original from the MD5 sum.

      You are right, they are NOT the same thing. The original poster is right, they are effectively the same thing.

      --
      Acts of massive stupidity are almost never covered by warranty. --me.
    27. Re:maybe... by Courageous · · Score: 1

      I do not understand your question. The contract will not be able to lawfully sustain a provision that declares something like the following: "you have our source code, do with it what you please, but do not compare it to other source code to see if we are breaking the law." Such a provision would be unlawful.

      C//

    28. Re:maybe... by Webmonger · · Score: 1

      With a SHA-1 hash, you should not be able to recreate the original. This is an important propery of hashes.

      http://www.itl.nist.gov/fipspubs/fip180-1.htm

      "The SHA-1 is called secure because it is computationally infeasible to find a message which corresponds to a given message digest, or to find two different messages which produce the same message digest."

      Actually, they used to say the same thing about MD5, but nowadays, there's some concern that MD5 may be breakable. As far as I know, it hasn't actually been broken, but given enough time, motivation, resources. . .

      For secure uses, SHA-1 is the recommended algorithm nowadays, because it's considered impossible to reconstruct the original from a hash.

    29. Re:maybe... by Trepalium · · Score: 1

      I believe this tool treats the code as one big file. Unfortunately, though, no matter how you do it, it's a computationally intensive process. However, I believe his program is broken. Unless, of course, there is no duplicated code between 4.4BSD-Lite2 and Linux, or Linux 2.6.0-test4 and Linux 2.6.0-test4 (yes, comparing the source tree to itself).

      --
      I used up all my sick days, so I'm calling in dead.
    30. Re:maybe... by SillySlashdotName · · Score: 1

      "The SHA-1 is called secure because it is computationally infeasible to find a message which corresponds to a given message digest, or to find two different messages which produce the same message digest." (emphasis added)

      "With a SHA-1 hash, you should not be able to recreate the original. " agreed, but only because it is computationally infeasible, not because it is computationally _impossible_.

      But again, that is EFFECTIVELY the same thing!

      --
      Acts of massive stupidity are almost never covered by warranty. --me.
    31. Re:maybe... by Webmonger · · Score: 1

      That's like saying "should not be able to" and "cannot" are EFFECTIVELY the same thing!

    32. Re:maybe... by SillySlashdotName · · Score: 1

      In this case they are.

      You should not be able to recreate the oridinal given a MD5 sum because of the limitations of computational power, time, cost, energy usage, etc.

      is equivelent to

      You cannot recreate the original given a MD5 sum.

      "Should not be able to" in the sense that "we don't think that it is possible, now or in the forseeable future", not in the sense that you are not allowed.

      Not allowed and not possible are, of course, not effectively - or even remotely - the same at all and is not what I said or meant.

      --
      Acts of massive stupidity are almost never covered by warranty. --me.
    33. Re:maybe... by Nucleon500 · · Score: 1
      for i in *.jpg; do mv -f "$i" `md5sum "$i" | cut -b 1-32`.jpg; done
  2. SCO! by scovetta · · Score: 3, Funny

    Of course, we can just trust SCO to show the right hashes. Why would they lie?

    --
    Wer mit Ungeheuern kämpft, mag zusehn, dass er nicht dabei zum Ungeheuer wird. --Nietzsche
    1. Re:SCO! by jmv · · Score: 2, Interesting

      Ths think is that the hashes could be generated my any organisation that has access to the SysV source code. There are many of them (IBM being one).

    2. Re:SCO! by mik · · Score: 5, Insightful

      The point is that we don't need SCO to do anything. Presumably any of the many people with legal rights to SCO source code can publish the hash list without divulging any of SCO's (ahem) "IP". Even more interesting is the theoretical possibility of comparing historical releases of SCO trees against GPL-licensed code, thus (perhaps) demonstrating that SCO has illegally violated the IP of OSS developers. Of course, hash comparisons alone would be unlikely to convince a judge/jury of anything. They ought to be sufficient grounds for some embarrasing subpoenas, and maybe some really neat cease-and-desist orders, though.

    3. Re:SCO! by Anonymous Coward · · Score: 1, Interesting

      Nobody has to publish the SCO hashes. If their able to run the tool to create them, they are surely able to run the tool to do the comparison with Linux.

      Now...

      If the comparison shows "no code matches" you can say so. A SCO licensee saying "We've looked into the problem ourselves, and feel Linux is unique." tells nothing about SCO, or their secrets.

      If the comparison suggests SCO adopted Linux code, one would be obligated to report same to proper authorities. NDA, or not. As a licensee failure to report, indeed failure to do the due diligence of this test, may rope you into commission of willful infringement and/or conspiracy to do so.

      If the comparison suggests infringements, and you cannot determine the source, you may be obligated to determine same under rules of due diligence. This would include filing appropriate reports with authorities.

      If the comparison suggests Linux adopted SCO code, then you, as one SCO licensee to another, can likely exchange that information. Further, you or your guilty peers could (and probably must) publish corrections for your error.

      Now, the comparator won't catch is code that's been "infected" by SCO's newly envisoned concept of the world's first, *truely* deadly, viral license. IBM claims to hold valid copyrights as independent works for code they also contributed to the Unix V codebase. SCO claims to "control" any and all such code, and all that came in contact with it, however remotely. (Yes, I assume SCO fells they now excercise license control over nearly all of IBM's code base assets. Mainframe to wrist watch. I can't imagine how their theory can hold otherwise, actually.)

    4. Re:SCO! by sketerpot · · Score: 1

      If so (and it certainly seems probable) than why haven't people been looking for dupes before this, rather than just demanding that SCO cough up their alleged code or shut up? I mean, it's obvious that SCO isn't going to listen to reason, so why not take things into your own hands?

    5. Re:SCO! by jmv · · Score: 1

      I think the answer is legal considerations. It's easy to run into legal trouble when doing that. Even ESR won't say if he wrote the program to refute SCO's arguments.

  3. Is there really that much data there? by More+Karma+Than+God · · Score: 4, Funny

    If there is, why couldn't MD5 shreds be used as a lossy compression scheme for code?

    --
    Go here to create your own Slashdot dis
    1. Re:Is there really that much data there? by Sterling+Christensen · · Score: 2, Insightful

      Because lossy compression would be useless. When decompressed the source code wouldn't work anymore.

      Source code isn't loss-tolerant (or whatever)

    2. Re:Is there really that much data there? by Paradox · · Score: 3, Informative

      No. Hashes are one way functions. So it'd be kinda pointless. Further, comparing two hashes for anything but equality is meaningless with most good hashing schemes (unless you're a cryptographer).

      --
      Slashdot. It's Not For Common Sense
    3. Re:Is there really that much data there? by Anonymous Coward · · Score: 2, Funny

      Umm... why would you want lossy compression for code? Perhaps if it only lost the bugs?

    4. Re:Is there really that much data there? by Anonymous Coward · · Score: 0

      What good is lossy compression for code? What will you do when you uncompress your project and find it full of bugs?

    5. Re:Is there really that much data there? by Anonymous Coward · · Score: 0

      It's always funny reading the comments of the humor-challenged.

    6. Re:Is there really that much data there? by (startx) · · Score: 1

      the key is in your own sentence. lossy. That implies you are lossing information. It would be bad to loss any peice of anything you want to keep. image and sound compression can be lossy because they toss out parts we don't see or hear anyway, but do you really want to lose pieces of code?

    7. Re:Is there really that much data there? by RexHowland · · Score: 0, Redundant

      But what would the hash be of? Would each line of code be a separate hash, or would lines be combined?

      Wouldn't altering one letter in the code completely change the hash? If so, all you would need to do to avoid detection would be to make a few changes to minor things, and you would appear to have different hashes, even if the source code were essentially the same.

    8. Re:Is there really that much data there? by rmull · · Score: 3, Funny

      Depends. Who wrote it?

      --
      See you, space cowboy...
    9. Re:Is there really that much data there? by mz001b · · Score: 1
      If there is, why couldn't MD5 shreds be used as a lossy compression scheme for code?

      No, you are looking for this

    10. Re:Is there really that much data there? by Paradox · · Score: 1

      Yeah. See my other post in this thread, where I ask just that. I don't think that this tool is meant for comparing source trees. ESR is smarter than that.

      --
      Slashdot. It's Not For Common Sense
    11. Re:Is there really that much data there? by EverDense · · Score: 1

      No. Hashes are one way functions.

      Hence, the use of the word "Lossy". ;)

      --
      http://jesus.everdense.com/
    12. Re:Is there really that much data there? by karmavore · · Score: 1

      If it was full of bugs before then you will have something else to blame.

      --
      Speech: Free
      Beer: $699.00
    13. Re:Is there really that much data there? by B'Trey · · Score: 4, Informative

      RTFA. The code is split into overlapping "shreds" of three lines. For example, 7 lines of code would generate five hashes, consisting of the following lines:

      1,2,3
      2,3,4
      3,4,5
      4,5,6
      5,6,7

      Two source trees are shredded, then unique hashes are discarded. Anywhere there are three lines of code that are the same ANYWHERE in the source tree, it'll be spotted.

      Now, it's trivial to defeat this if you're specifically aiming to do so. However, for existing source trees (such as nearly countless variations of *nix) that already exist and are duplicated in numerous places, it works nicely. It's impossible to go back and modify the tree because too many copies exist.

      --

      "The legitimate powers of government extend only to such acts as are injurious to others." Thomas Jefferson.

    14. Re:Is there really that much data there? by snake_dad · · Score: 1
      It's always funny reading the comments of the humor-challenged.

      Oh please mod up ac.. he's the only one thusfar that gets it :/

      --
      karma capped .sig seeking available Slashdot poster for long-term relationship.
    15. Re:Is there really that much data there? by Anonymous Coward · · Score: 0

      but once u decompress it doesnt the contents become useless?

    16. Re:Is there really that much data there? by LiquidCoooled · · Score: 1

      But if the initial check were on the generated assembler tokens, then changing the variable name wouldnt matter would it? - you would just get lines of equal operation - at this point, not necessarily considered copied code, but certainly marked for follow up investigation.

      --
      liqbase :: faster than paper
    17. Re:Is there really that much data there? by B'Trey · · Score: 1

      Well, there are a number of issues to consider with that approach.

      First, you'd be talking about a SEVERE speed hit. You'd have to compile both trees, since compiler options, compiler verions and a great many other things would affect the assembler output.

      Second, part of the utility of this program is that you can distribute the hashes of a source tree without revealing anything about the source itself. While the standard NDA forbids revealing source code, I'd wager that very few of them forbid revealing hashes of the soruce. This means that IBM could probably release the hash of SCO's source without being in violation of their agreement. It might be posible to generate hashes of the assembler tokens which would have similar properties, but that leads us to the third point.

      I can't say for sure without some testing, but I'd guess that the chances of false positives would be much higher with assembler tokens than with source code. There are a great many utility functions that are relatively simple and straightforward. A modern optimizing compiler may very well come up with very similar assembler code for indepdently written functions that aren't particularly similar in source code.

      A fourth issue is that assembler code is also much more succinct than source code. Any assembler code is going to contain a great many PUSH AX; POP CX; CMP AL,0; MOV BH, 0; etc. that could easily generate false positives.

      Fifth, opposed to the third and fourth point, there's also the possibility of false negatives. An optimizing compiler may very well generate different assembler code for the same source if the surrounding code is different. A function liberated from one tree and inserted into a different one may be very different at the assembler level.

      Finally, using the assembler tokes would completely ignore comments and some if not all variable names. SCO is correct in that these are key indicators of a common source for two different pieces of code. (What they miss is that they may indicated a common ancestor for the two, not that one is necessarily a copy of the other.)

      This isn't to say that the idea doesn't have merit. But it introduces a great deal more complexity into the situation. I'd say such a program would be more of a complementary tool for RMS's shred program than an alternative or a replacement.

      --

      "The legitimate powers of government extend only to such acts as are injurious to others." Thomas Jefferson.

    18. Re:Is there really that much data there? by Anonymous Coward · · Score: 0

      After looking at the page on lzip I think I will use
      rm instead. That way it even compresses the directory entry.

    19. Re:Is there really that much data there? by Krach42 · · Score: 2, Interesting
      Actually, some source code is loss-tolerant. Take C for example. In C the only significant whitespace is between any two elements of the set { identifiers, numbers }, and any that occurs in quotes, or character constants.

      Also, comments can potentially discarded without effecting the compilation of the program.

      Thus, you can take a program:
      int main(void)
      {
      printf("Hello World!\n");

      return 0;
      }
      And turn it into:
      int main(void){printf("Hello World!\n");return 0;}
      You've saved yourself space here. Now, here's the wierd thing, I wouldn't expect this to save any space after gzip'ing, or bzip'ing. I mean, after all, you're primarily just removing one character. But it turns out that on a particular file of mine:

      -rw-r--r-- 1 dfoesch staff 9184 Sep 9 19:00 navajo.c
      -rw-r--r-- 1 dfoesch staff 3213 Sep 9 18:58 navajo.c.bz2
      -rw-r--r-- 1 dfoesch staff 1832 Sep 9 18:58 navajo.c.nospaces.bz2

      And gzip is the same. This is thus a lossy compression for source code that doesn't actually modify the semantics or syntax of the program. (Of course, this won't work for language like Python.)

      Yes, the result it unreadable, but then you just run indent, with your favorite coding-style setup, and viola! It's back to "normal", but different. Just like lossy compression is supposed to work.
      --

      I am unamerican, and proud of it!
    20. Re:Is there really that much data there? by jc42 · · Score: 1

      Actually, there's a really good compression scheme for code that isn't even lossy: Remove all comments and unnecessary white space.

      Of course, some code doesn't compress very much this way ...

      --
      Those who do study history are doomed to stand helplessly by while everyone else repeats it.
    21. Re:Is there really that much data there? by Anonymous Coward · · Score: 0

      A good lossy algorithm compresses by removing the bugs... ;-)

    22. Re:Is there really that much data there? by BubbleNOP · · Score: 2, Insightful

      By removing whitespace you collapsed a number of distinct substrings, i.e. what used to be different substrings of the form A\s+B are now represented as just one substring AB. A smaller set of distinct substrings leads to better compression.

    23. Re:Is there really that much data there? by NicenessHimself · · Score: 1

      And isn't a hash a 'derived work'??

    24. Re:Is there really that much data there? by Krach42 · · Score: 1

      Makes sense... but still the point remains, that there is such a thing as lossy compression for C code.

      --

      I am unamerican, and proud of it!
  4. ESR ADMITS TO ENRON PRACTICES by Anonymous Coward · · Score: 5, Funny

    This will only serve as another black eye on the Open Source community. ESR should know better that to shred SCO material prior to a trial.

    1. Re:ESR ADMITS TO ENRON PRACTICES by Anonymous Coward · · Score: 0

      I know that SCO practices Texas style accounting, but I never heard of Enron practices? Is that when you kill the person who knows what happened?

    2. Re:ESR ADMITS TO ENRON PRACTICES by dipipanone · · Score: 1

      No, that would be L. Ron practices.

  5. But the Important Question is... by BlackBolt · · Score: 2, Funny

    Did he write it in Python? And did he complete it in under 6 hours?

    1. Re:But the Important Question is... by TMB · · Score: 2, Informative

      From the README...

      Besides the production C code, the distribution also includes working Python versions. These were used to prototype the concept.

      No word on the latter... but it's ESR... so of course! ;-)

      [TMB]

    2. Re:But the Important Question is... by webmaven · · Score: 1

      If you take a look at the README file, you'll see that he prototyped comparator in Python, but the production code is in C.

      --
      The real Webmaven is user ID 27463. I don't rate an imposter, because my ID is such a lame-ass high number.
  6. I'm more interest in time saved by programmers by 2TecTom · · Score: 1

    ... then the number of lawyers it'll retire.

    --
    Words to men, as air to birds.
  7. Doubt it will help by Brahmastra · · Score: 5, Insightful

    I think the question here is not about whether there is common code between SCO and Linux. There is no doubt that there will be common code because of the common origins. The issue here is that SCO does not own that code.

    1. Re:Doubt it will help by djh101010 · · Score: 4, Insightful

      If there's going to be a line-by-line comparison, this is the tool to do it. Once those lines are identified, *then* it's simply a matter of finding out the origins of them; that's where we can roll it back to a textbook published in 1973 or whatever.

      Until the lines that are common are identified, it's impossible to defend against the accusations. Because of that, I bet Darling Darl won't allow it to be used. The question is, how to turn the inevitable refusal into something that shuts him (up|down).

    2. Re:Doubt it will help by jmv · · Score: 1

      Well, when you find a comment segment in the code, you can always look at the Linux code and see where it comes from. This works the same way as the malloc code from the SCO presentation that was eventually traced to BSD code (or at least something which SCO does not own).

    3. Re:Doubt it will help by afidel · · Score: 1

      Actually it was traced back to code written by Ritchie for a very early version of UNIX and which was placed in the public domain by SCO themselves several years ago. The fact is that SCO has no or almost no legitimate claims of copyright infringement for Linux. They may have a case against IBM for breach of contract but that is VERY unlikely. They are stretching the definition of derivitive work to the extreme. Ultimitly it is up to a judge.

      --
      There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
    4. Re:Doubt it will help by Azog · · Score: 5, Insightful

      Well, this would still help determine what the common code is.

      If ESR is given the big list of MD5 sums of SCO's kernel by someone who has legitimate access to it, and he runs his shred tool to compare it to the Linux kernel, and a bunch of stuff turns up matching (as expected) he can still see WHAT was matching because he has the Linux sources.

      So then he can look at that and say, "hmmm, it looks like part of this ethernet driver is the same, and this NAT implementation, and bits and pieces of the VFAT filesystem code..." and then, find out how those got to be the way they are in Linux.

      If it can be proved that the matching code is totally legit in Linux, (which is what I would expect) then it follows that either (a) SCO actually stole stuff out of Linux, rather than the reverse, or (b) Linux and SCO both took the code from a third source, like BSD.

      Otherwise, option (c) is that Linux actually contains code from SCO which it should not. But this is still an improvement on the current situation, because it would allow the Linux development team to FIX THE PROBLEM.

      Either way, (sooner or later, depending on if Linux fixes are required) it will shoot SCO's claims so full of holes that any reputable journalist reporting on SCO's latest insane claims will have to mention that "... but the source code has been analyzed and all code in Linux similar to SCO's software has been shown to be completely legitimate...", or "... but all code in Linux which SCO might have had a valid issue about has been removed..."

      SCO's big stick right now is FUD. Fear, Uncertainity, and Doubt. The shred tool can remove the uncertainty and doubt. Only SCO will still have the Fear. :-)

      --
      Torrey Hoffman (Azog)
      "HTML needs a rant tag" - Alan Cox
    5. Re:Doubt it will help by Anonymous Coward · · Score: 0

      That is probably deliberate. The whole thing is probably a show trial set up by IBM to set out the exact limits of the "derived works" clause of the GPL. IBM are happy to let it get to court, let the IBM lawyers tell the court what is the official IBM^H^H^HAmerican interpretation of the GPL, and get back to being the most powerful computer company in the world (some people erroneously think that's Microsoft, but it's IBM...).

    6. Re:Doubt it will help by entartete · · Score: 1

      you are making sense and spreading factual information about this matter instead of flailing about wildly. I assume the lawsuit against you from SCO and your paycheck from your masters at IBM are already on the way.

    7. Re:Doubt it will help by YoJ · · Score: 1
      This idea is similar to some ideas used in web searching (a la AltaVista, Google). The problem in websearching is eliminating duplicate documents. You can't just compare hashes of the documents, because duplications often have slightly different tags or headers. Microsoft has a nice algorithm for doing this, I think they call it Shingling. It also involves multiple overlapping hashes from a sliding window, but also some other stuff to speed up comparisons between billions of pages.

      So, not only did Microsoft do this first, they did it better!

    8. Re:Doubt it will help by MrLint · · Score: 1

      This could be very instructing. Not only to find what the real genetic linage of any unix variant is, but i also have to wonder if this can be used on data dumps of DNA sequences to get a 'quick' estimate of genetic lineage and drift.

    9. Re:Doubt it will help by RabidChipmunk · · Score: 1

      IBM has to have the code. They can do it as soon as their lawyers give the okay.

      --
      This is not a political statement. This is not legal advice. It's a frick'n Slasdot post. However: I'm Running For
    10. Re:Doubt it will help by Anonymous Coward · · Score: 0
      If it can be proved that the matching code is totally legit in Linux, (which is what I would expect) then it follows that either (a) SCO actually stole stuff out of Linux, rather than the reverse, or (b) Linux and SCO both took the code from a third source, like BSD

      It is, of course, option B that SCO alleges. Their claim is that IBM wrote code, separate from but depending on SCO's Unix, and that under the IBM license terms SCO has the sole right to distribute that code. This means, they claim that both linux and SCO have some lines of code, written by IBM, that IBM should not have given to Linux.

      It's obviously a tough claim to back up. It's equally tough to decisively disprove, because any code that comes out of IBM that would run on a Unix-like system may originally have been written with the intent to use in AIX, which (according to SCOs allegation) makes SCO the owner of right-to-distribute.

    11. Re:Doubt it will help by ComputerSlicer23 · · Score: 1
      No, it wasn't in the public domain. It had a license, and they retained the copyright. To be in the public domain, you can't have any copyright over it.

      It was put under an original BSD style license (BSD w/ advertising), which is incompatible with the GPL. Thus no, no code from there could be used in Linux. However, it was shown that the code was pulled in from *BSD under the current BSD license.

      Kirby

    12. Re:Doubt it will help by Pharmboy · · Score: 1

      Until the lines that are common are identified, it's impossible to defend against the accusations.

      Eventually, SCO is going to have to identify these lines, or they have no case. This program, while "nifty" is not that relative to the lawsuit. It doesn't matter (legally) what lines this program say are similar, but it matters which lines SCO says are relative. Disproving THOSE lines belong to SCO is worthwhile, which this program will not do.

      The burden of proof is upon SCO, not the open source community. It is in all of our interest to aggressively prove the origins of "similar" code (as claimed by SCO, not the program) and remove any infringing code, if any.

      It is NOT our responsibility to find out what lines are similar, it is SCO's since they are the plantiff. Eventually, they will have to disclose which lines in Linux they claim are infringing, even if they only disclose their own source code in sealed documents. Again, we really don't care about their code, or what this program thinks is similar. We only need to prove the source claimed by SCO is not their property, or prove that the similar code was cleanroom or similar because it was based upon published standards, and that code was the only logical way to impliment it. Or prove they actually stripped away copyright notices on BSD and other software and are now claiming it as their own property. They have to prove their code really IS theirs, as well, and not PD or misappropriated BSD.

      --
      Tequila: It's not just for breakfast anymore!
    13. Re:Doubt it will help by Anonymous Coward · · Score: 0

      There's a very good reason he would not allow it to be used:

      This tool wil ONLY find IDENTICAL source lines. If anyone changes even a single character, a var name or even whitespace, from tabs to spaces say, or even line engings, it will fail spectacularly to find code that is otherwise completely identical.

      There are other programs that do a FAR better job at detecting plagerism. It's just plain stupid to even consider this one.

    14. Re:Doubt it will help by Simon+Brooke · · Score: 1
      If there's going to be a line-by-line comparison, this is the tool to do it. Once those lines are identified, *then* it's simply a matter of finding out the origins of them; that's where we can roll it back to a textbook published in 1973 or whatever.
      Until the lines that are common are identified, it's impossible to defend against the accusations. Because of that, I bet Darling Darl won't allow it to be used. The question is, how to turn the inevitable refusal into something that shuts him (up|down).

      This is an interesting program and a useful move in the game, but it misses the point. (Most of) the code SCO is theirs and is in the kernel is in the kernel. The point is that SCO claim to own anything written by any UN*X licensee ever that's been contributed to Linux. Thus, for example, they claim to own NUMA and RCU code which

      • they didn't write
      • they didn't specify, contribute, or pay for
      • was never a part of any of their products
      • and which they've never had (apart from in the Linux codebase) the source code of

      They're arguing this because the standard UN*X license said something roughly equivalent to 'all your extensions are belong to us'. SCO's claim, essentially, is that anything remotely UN*Xy ever written by IBM or SCO or HP or SGI or... belongs to them, and therefore cannot be contributed to Linux. In the case of the SGI journalled file system I think they may have a case. However, Sequent's work on multiprocessor memory access was not originally written for UN*X, it was written for a proprietary operating system and later ported to UN*X, so I don't believe that SCO can sustain a case to own that; IBM's journalling file system was originally written for OS/2, so I don't believe SCO can sustain a case to own that; and for the rest IBM's side-letter is sufficient protection for everything IBM has contributed.

      --
      I'm old enough to remember when discussions on Slashdot were well informed.
    15. Re:Doubt it will help by ostrich2 · · Score: 1

      SCO's big stick right now is FUD. Fear, Uncertainity, and Doubt. The shred tool can remove the uncertainty and doubt. Only SCO will still have the Fear. :-)

      I say we give the uncertainty back to them. Then they'd have a big FU from all of us.

  8. Nah... by SargeZT · · Score: 4, Insightful

    This shouldn't be relied upon in the court of law. Although I acknowledge that SCO likely has no IP claim over Linux, it should have a fair case. A program that would rule out code similarities does not rule out code that is based on the SCO code. There are hundreds of ways to do a single thing, and if the GNU/Linux took ideas from the SCO kernel, SCO may be as eligible for compensation as if it were directly copied from SCO.

    --
    And why did you staple the trout to the RAM?
    1. Re:Nah... by jedidiah · · Score: 4, Informative

      Don't call it the "SCO kernel".

      It is the SysV kernel.

      --
      A Pirate and a Puritan look the same on a balance sheet.
    2. Re:Nah... by jmv · · Score: 3, Interesting

      That's true in general. However, SCO has explicitly stated that thousands of lines of code have been illegaly copied *verbatim* from System V. This tool could at least prove that they lied (because of the verbatim copy allegation).

    3. Re:Nah... by jonabbey · · Score: 4, Informative

      if the GNU/Linux took ideas from the SCO kernel, SCO may be as eligible for compensation as if it were directly copied from SCO.

      IANAL, but I don't believe this is so in the general case. Copyright protects only specific expression of ideas, not the ideas themselves.

      If SCO had valid patents on some of this stuff, they'd have a point of legal leverage, but they don't from all reports.

    4. Re:Nah... by Anonymous Coward · · Score: 1, Insightful

      Copyright covers expressions, not ideas. You can't "take" ideas under american law, ideas are just there waiting to be reimplemented.

    5. Re:Nah... by nate1138 · · Score: 1

      Umm, no. How would this prevent a fair case? It wouldn't. It is simply a tool to compare files. If SCO is being truthful, they should allow the comparison. If they refuse, it is probably because it would expose them for the liars they are. Additionally, the patents on that sysV code expired a long time ago, so reimplementing something that they do is not a violation of anything. Copying the code directly would be, however.

      --
      Where's my lobbyist? Right here.
    6. Re:Nah... by teece · · Score: 1

      I don't think your analysis is incorrect. SCO would own a copyright on their code. That would not mean they own the ideas -- that is exactly what copyright does not grant you. You get the right to make copies, but in exchange the ideas are given to the public (or at least that is how it was supposed to work, when the country was founded).

      That is why SCO, MS, et al. keep their code a secret: if released to the public, the algorithms are fair game for all to use. Which is why SCO really has a hard time with their case, even if they don't know it -- since so much of the Unix kernel used to be wide open (if not free), the ideas contained therein are almost public domain. Copying the code verbatim would be wrong, but copying the algorithms and design would not.

      --
      -- Hello_World.c: 17 Errors, 31 Warnings
    7. Re:Nah... by Crispy+Critters · · Score: 1
      "There are hundreds of ways to do a single thing, and if the GNU/Linux took ideas from the SCO kernel, SCO may be as eligible for compensation as if it were directly copied from SCO."

      This is true. But proving direct verbatim copying is straightforward. It is much harder to prove copyright infringement when the copying is not verbatim, because the plaintiff has to justify that the copied material is copyrightable.

      If I look at your copyrighted code and retype it exactly into my code, then all that has to be shown is that I copied it (assuming the code is not GPLed or whatever). If I look at your copyrighted code and write a very similar routine for myself, you have to prove that I was looking at your code and you have to prove that whatever I took is copyrightable.

      There is no hard line between what is copying and what is legal use of the ideas embodied in the code. It is for the courts to decide on the merits of each case.

    8. Re:Nah... by dipipanone · · Score: 3, Funny

      Don't call it the "SCO kernel".

      OK then, the GNU/SCO kernel.

    9. Re:Nah... by Anonymous Coward · · Score: 0

      First off, Copyright only protects the expression, not the idea. If your assertion was true, then Royal Bee would rule the roost for Operating Systems with their 1950 implementation of JOS, and Donald Knuth would have most of the rest of the bits.

      Second, shredding for comparision is more accurate and answers the plagerism question much better then Data Crystal's "pull it out of their ass" spectrum analysis that relies on statistics. You want positive matches, not guesses.

      Third, SCO does not have a kernel - wouldn't know how to write a kernel and couldn't recognize one if it walked up and bit them in the ass. The Sys V kernel (and a bad one at that) is what they use. You would think that after all that time and BSD code they would have at least have stolen something from Berkeley to make Unixware work properly.

      Fourth, SCO can speechify all they want about IP. The fact remains: they are suing IBM for CONTRACT violations, not intellectual property (such as it is). Sys V is old enough that most if any patented work has expired under the old rules. SCO owns a copyright to a copy of the Sys V codebase - that's it. IBM owns the patents and copyrights to the disputed sections, and SGI owns the IP to XFS - not SCO. The control rights are what is at issue, and from most legal experts, McBride's claims to that affect are bullshit.

    10. Re:Nah... by Anonymous Coward · · Score: 0

      Dumbass...

    11. Re:Nah... by the_mad_poster · · Score: 1

      They're actually claiming millions of lines now.

      Basically, just take whatever you think it is, and multiply it by a thousand to get the latest claim.

      --
      Alito: A vote for Alito is a punch in the eye to put that bitch back in her place!
    12. Re:Nah... by shaitand · · Score: 1

      ideas are covered by patents, copyrights don't cover ideas. I can read your post and rewrite it in my own words... and publish it with a commentary explaining that this is exactly what I did and I'd be legally untouchable. Copyright lets me steal your ideas all day long, just not your implementation of that idea. Patents are supposed to work the same way with tangible things but alas THEY have been broadened to the point where they cover the idea itself now.

      By all records novell still owns the unix patents... any patents for ibm's coding that was contributed would belong to ibm regardless of agreements concerning the code itself. SCO, MIGHT have recieved the copyrights for the SysV code in their agreement with Novell... this is not even 100% proven.

    13. Re:Nah... by Anonymous Coward · · Score: 0

      If my hunch is right, it just may be called that after all is said and done. :o)

  9. The truth is out there by Teahouse · · Score: 2, Interesting

    The truth is out there, we will finally get to it without signing a SCO NDA. This should end the case before it begins. SHRED ON!

    --
    "Curiosity killed the cat, but for a while I was a suspect."- Steven Wright
    1. Re:The truth is out there by Usquebaugh · · Score: 1

      Clueless.

      The question is not if there is common code, there is, but rather where the code came from?

    2. Re:The truth is out there by StenD · · Score: 1

      But the first step to answering the question of the origins of the code, is determinig which code needs to be investigated. That's why identifying the duplicated code is critical.

    3. Re:The truth is out there by Teahouse · · Score: 1

      Clueless

      Where did I say otherwise? If you require me to spell it out, how bout this...FINDING COMMON CODE THAT IS SOLELY SYSTEM V IP.

      You can always correct my spelling next if you want, I am sure you could find hours of anal retentive pleasure there too.

      --
      "Curiosity killed the cat, but for a while I was a suspect."- Steven Wright
    4. Re:The truth is out there by DaveAtFraud · · Score: 1

      You seem to be under the mistaken impression that SCO is interested in establishing the truth.

      --
      They that can give up essential liberty to obtain a little temporary safety deserve neither safety nor liberty.
      Ben
    5. Re:The truth is out there by Usquebaugh · · Score: 1

      I'll bite :-)

      Why is SCO going to court, what could dissuade them from going to court? The SCO share price and revenue is up, why would they stop now?

      This is going to drag on for years no matter what anybody else does, SCO needs it to continue for a s long as possible. As soon as they go to court they're finished and they know it.

      So I refer the honerable gentleman to my first response, you sir are clueless.

    6. Re:The truth is out there by Teahouse · · Score: 1

      "The SCO share price and revenue is up, why would they stop now?"

      This will play out on Wall Street before it plays out in the court. SCO's entire case depends on FUD and secrecy. "We have the proof, and you need to pay us". That's why they aren't releasing those "million lines of code". As soon as it's all linked to BSD or other non-SCO/derivative IP, this all comes crashing down. You can't extend a court case that gets thrown out. You can't artifically buoy your stock if it's obvious to Wall Street you have no real hope of winning your claim. This little program (and the results it finds) could strip away the secrecy and FUD permanently. if it does, this whole process could be (mercifully) shortened.

      Your assumption that SCO (with an annual income of barely 60 million) can maintain this level of litigation without selling Linux licences and winning some settlement against IBM is foolish at best. If their licence has no credibility, it will generate no income, and the lawsuit will die from lack of money. The Lawyers will ALWAYS get paid.

      --
      "Curiosity killed the cat, but for a while I was a suspect."- Steven Wright
    7. Re:The truth is out there by Usquebaugh · · Score: 1

      Your points a dependent on SCO running the pgm, think they're going to? They have no need to release the source code until court. Do you recall when the court date is?

      Who owns SCO, how much money do they have? M$ paid a fair bit, think they might chip in again? Hell, I would in their position, just to keep the Linux IP/GPL legality issue in the press.

  10. Breaking News! by TexVex · · Score: 3, Funny

    This just in. SCO to sue ESR for patent infringement over "comparator", a software package that performs comparison between different sets of source code to determine if any code is copied between them.

    --
    Fun with Anagarams! LADS HOST, SHALT DOS. HAS DOLTS. AD SLOTHS, HATS SOLD. ASS HO, LTD.
    1. Re:Breaking News! by Anonymous Coward · · Score: 0

      This is, ummm, like the third piece of actual code that ESR has contributed to the Free Software Community, isn't it?!?

  11. Damn... by Audiovore · · Score: 1

    And I just got rid of my paper shredder...

    --
    Without music, life would be a mistake. --- Nietzsche
  12. Answered My Own Question.. by BlackBolt · · Score: 3, Funny
    From the article:

    "...has two advantages: one, it's amazingly fast..."

    Guess not. ;-)

    1. Re:Answered My Own Question.. by segmond · · Score: 1

      it's not fast because of the language but according to his description of the program the ability to load all the hashes in memory and compare. even that will be very fast in basic. :D

      --
      ------ Curiosity killed the cat. {satisfaction brought it back | it didn't die ignorant | lack of it is killing mankind
    2. Re:Answered My Own Question.. by Ninja+Programmer · · Score: 1

      Python contains a built-in module for computing MD5. Assuming it uses a C-backend, then from my own testing, we know then that the program will become disk limited. There will be no significant difference between C or Python on disk limited activities.

  13. Can Someone Explain? by Klync · · Score: 2, Interesting

    If you're comparing two sets of code vis. their MD5 sums, then won't that miss matching lines that differ by even one character - like, say, a space?

    --

    ----
    Not to be confused with Col.
    1. Re:Can Someone Explain? by stratjakt · · Score: 4, Interesting

      Perhaps if you parsed them both, and compared the resulting object code, right before compilation?

      That way if your variable is called numOfPorts and mine is called countOfPorts, the parsed code is the same for both, when stuff like that becomes meaningless.

      Even if not, SCO seems to be saying that much of the code is copy-n-paste anyways.

      --
      I don't need no instructions to know how to rock!!!!
    2. Re:Can Someone Explain? by Sterling+Christensen · · Score: 5, Informative

      From it's manual:
      "The -w causes all whitespace in the file (including blank lines) to be ignored for comparison purposes (line numbers in the output report will nevertheless be correct). This is recommended for comparing C code; among other things it means the comparison won't be fooled by differences in indent style."

    3. Re:Can Someone Explain? by jonabbey · · Score: 1

      You'd have to transform the segments, basically boiling them down to a canonical form before generating the MD5 hash.

      So you might turn all contiguous whitespace (tabs, spaces, etc.) into a single space char before generating the hashes, for instance.

    4. Re:Can Someone Explain? by Anonymous Coward · · Score: 2, Insightful

      While you might be able to deal with whitespace, you do still have the problem that you're really only looking at whole-file matches for identity. You can't find one function lifted from some other source. You can't find code that's had even minimal cosmetic surgery on the variable names.

      While a high degree of exact matching between two trees would demonstrate related code, lack of a high degree of identical files as determined by this method does not demonstrate that two code trees are unrelated. It's perhaps an interesting metric for comparing two projects that you already know are related, like two forks of a project or two versions of one project. But this technique is nearly useless as an anti-SCO defense.

    5. Re:Can Someone Explain? by afidel · · Score: 1

      GNU entab is your friend. Feed it a source file and a rule file and it will give you standardized output which you can easily compare. My brother's tabbing style is not even internally consistant so in order to help him debug programs I have to fist run them through my own GNU entab rules so I can read the code =)

      --
      There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
    6. Re:Can Someone Explain? by Bob+the+Hamster · · Score: 4, Informative

      And note that it is not comparing the MD5's of whole files, it is comparing MD5's of three-line "shreds" of files

    7. Re:Can Someone Explain? by dmaxwell · · Score: 1

      By itself, the shread comparison of two codebases doesn't mean much. With the help of a party in possession of the SysV codebase, the good guys could build a concordance of matches between the two codebases and establish their provenance. Properly indexed and annotated, that concordance could be used to quickly counteract anything SCO raises in court or in public.

    8. Re:Can Someone Explain? by Anonymous Coward · · Score: 0

      I have to fist run them through my own GNU entab rules

      Doesn't that hurt your fist?

    9. Re:Can Someone Explain? by bedessen · · Score: 1

      Try reading for comprehension some time. This method does not compare whole files, it compares small hunks of code, 3 lines at a time. Presumably, this is implemented with some sort of windowing (like the rsync protocol does) so that you can actually compare down to the individual line and not be dependent on where the 3-line boundaries fall. But in any case, it's not just a simple "find . -type f|xargs md5sum" type of deal where you get file-level granularity.

  14. what?! by DumbWhiteGuy777 · · Score: 0

    Does this mean I bought a SCO license in vain?!

    1. Re:what?! by hayden · · Score: 1
      Does this mean I bought a SCO license in vain?!
      No. You are now SCO's bitch. Enjoy!
      --
      Nerd: Derogatory term typically directed at anybody with a lower Slashdot ID than you.
  15. Other uses? by Not_Wiggins · · Score: 4, Interesting

    It might be interesting to see how different families of Linux/Unix compare... maybe generate a veritable "family tree" of relationships.

    Of course, that also depends more on how differences are actually calculated. Still, could make an interesting project to relate OSes based on how much shared code they still retain and show it in a graphical tree format, ala "family tree." 8)

    --
    Diplomacy is the art of saying, "Nice doggie!" until you can find a rock.
    1. Re:Other uses? by JeffTL · · Score: 2, Funny

      Yeah, that'd be great. In my anthropology class we've been studying that sort of stuff, but with DNA...there are some tree diagrams of primates, so why not Unices?

    2. Re:Other uses? by danimal · · Score: 1
      you mean like this?

      granted, it's not based on source code comparisons, but rather history.

    3. Re:Other uses? by nateb · · Score: 1

      Sounds like those reports you hear where scientists compare % of related DNA. 98% human-great ape, 34% human-banana, etc.

      --
      -- Nate
  16. Genius by seldolivaw · · Score: 2, Insightful

    ESR shows us once again why exactly he has so much respect from the community. Well done, that man.

    1. Re:Genius by Anonymous Coward · · Score: 3, Insightful

      Respect? Not from me. He couldn't code his way out of a paper bag. His biggest claim to fame is failing to get kernel modules accepted and then whinging about it. And making up his own Jargon file entries. And writing about how great he is - see his 'I am considerably richer than you' essay. And writing boring long-winded pieces about guns. And claiming to speak for people he shouldn't. Etc.

    2. Re:Genius by bernywork · · Score: 1

      Well, AC, announce who you really are, and what you are doing for the Open Source movement.

      You point all you are doing is as much as SCO, spreading FUD. Until you are doing better, don't whinge about somebody else.

      --
      Curiosity was framed; ignorance killed the cat. -- Author unknown
    3. Re:Genius by andrewgreen · · Score: 1

      I just tried it out, and immediately blew up with a floating point exception. The function eligible uses fgets to read a line from the beginning of a file. If that line starts with a zero byte (like the invisible .DS_Store on MacOS X) then fgets returns success, and a little later on we divide by strlen(buf), which is of course zero. Oops.

    4. Re:Genius by Anonymous Coward · · Score: 0

      Well, AC, announce who you really are, and what you are doing for the Open Source movement.

      You're missing the point. This is all about egos, self-aggrandisement, and trumpeting of claims without basis. esr's continuing attempts to remind us of his 'achievements' is what pisses people off.

      Incidentally, I write free software but I don't try to overstate my fairly minor contribution.

      You point all you are doing is as much as SCO, spreading FUD.

      I'm sure my modest attempts to inject a dose of reality will do little in the face of esr's propaganda juggernaut. He's trying to write himself into the pantheon of uber-hackers, and the worrying thing is he seems to be almost succeeding.

      Come on, if you are a coder, do you think this algorithm is anything special? I could knock this out over breakfast. Why for example does esr need to boast about the speed of his algorithm when it's a one-time application?

    5. Re:Genius by Anonymous Coward · · Score: 0

      Couldn't code his way out of a paper bag? The guy wrote GCC and Emacs, for God's sake! You have no idea what you're talking about. Good day.

    6. Re:Genius by Raffaello · · Score: 1

      You're thinking of Richard Stallman, (RMS), who did write gcc and emacs.

      Eric Raymond, (ESR), did not write gcc, nor did he write emacs.

  17. fire the "laser" by La+Temperanza · · Score: 0, Offtopic

    I can't decide which is funnier - the point about IBM orchestrating all the outrage, or the point that SCO is somehow more "relevant" to the tech community because they've filed a bunch of press releases! SCO has committed the most vile of sin. Ok, I'll stop now.

    This Comment was generated with the Comment-O-Matic for SCO Stories.

    --

    --
    est modus in rebus
    1. Re:fire the "laser" by be-fan · · Score: 3, Interesting

      You know the sad thing about all this? I can't tell the difference between the auto-generator or your average Slashdotter. Does this mean that the auto-generator passes the Turing Test, or that the average Slashdotter doesn't?

      --
      A deep unwavering belief is a sure sign you're missing something...
    2. Re:fire the "laser" by Otter · · Score: 1
      I liked this one:
      It just goes to show that whether it's object-oriented programming or contract law, multiple inheritance is likely to be hard to understand. After all ... there can't be more than one person that actually comments their code, can there? Next McBribe will be showing off a server stats chart to stock holders as proof of sco's growing relevance in the high tech world.

      If you get in early on an SCO story with this one, it's a guaranteed +4, at least.

    3. Re:fire the "laser" by Knuckles · · Score: 1

      Disclaimer: This generator generates a random number of sentences from old slashdot comments randomly thrown together with random topics

      Read it aloud if it helps :) The auto-generator takes old (supposedly human-written) /. comments on SCO and combines them with other stuff. E.g., "It just goes to show that whether it's object-oriented programming or contract law, multiple inheritance is likely to be hard to understand" is taken verbatim from an old posting (I know, I read it). Seems as if copyright issues come up here (considering how /. posters [sadly, IMO] reacted when the Hellmouth book was discussed, I think it's strange nobody takes offence with this

      --
      "When I first heard Daydream Nation it quite frankly scared the living shit out of me." -- Matthew Stearns
    4. Re:fire the "laser" by Otter · · Score: 1
      Yes, I had realized that. ;-)

      Believe me, if I thought it had generated those lines from a dictionary I'd be impressed, not simply amused.

    5. Re:fire the "laser" by Knuckles · · Score: 1

      I thought so :) I probably should have replied to the parent of your post, who is still +4 Interesting. Way to go, Moderators. Not.

      --
      "When I first heard Daydream Nation it quite frankly scared the living shit out of me." -- Matthew Stearns
  18. Who cares? by Otter · · Score: 3, Insightful
    OK, I admit that a) the guy annoys the hell out of me, b) his yapping about "one of us" DOS'ing SCO is yet another case of him embarassing Linux while aggrandizing himself and c) just the quotes in this article alone make me want to slap him. So if someone else had been involved with this, I probably wouldn't bother to care.

    Anyway -- who cares? There's no question there are plenty of common chunks between Linux and SCO-owned source. And that there are ways to find them. The question is what they are (which SCO isn't saying) and what their common origin is and where that origin falls in the murky history of the Unix codebase. It's not as if anyone has been saying, "We're helpless in the face of this computational problem. If only there were a way to compare large bodies of text for common elements!"

    Never mind that there are probably people who can compare both codebases in their heads.

    Maybe he's made some major algorithmic breakthrough. (I doubt it but, but I'll leave that to the experts.) But this story is just him yapping again.

    1. Re:Who cares? by jmv · · Score: 2, Informative

      I think the difference is that a 3rd party that has access to the SysV source can compute the hashes and make them public without violating copyright. That way anyone can look for common lines with Linux and see where they came from (legal or not).

  19. Finally ESR stops yapping and does some hacking by Anonymous Coward · · Score: 2, Funny

    ESR is ok you know, but lately he has just been doing lots of ranting and soapboaxing and no hacking.

    Finally he comes out with some hack action. About time man, I was beginning to view him as just some big windbag who hacked a little back in the day. Well I still sorta do, but this is at least pretty cool, you know.

    1. Re:Finally ESR stops yapping and does some hacking by mdxi · · Score: 3, Insightful
      ESR is ok you know, but lately he has just been doing lots of ranting and soapboaxing...I was beginning to view him as just some big windbag

      Did you read the article? Those are some of the most self-aggrandizing quotes I've ever seen in real life. SCO lawyers should "be afraid" of him. He "perfected" the algorithm. His 1500 line program is a complete masterwork; both elegant beyond compare and a paragon of maintainability!

      You don't ever see, say, Linus, Larry, or RMS talking themselves up like that.

      --
      Posted with Mozilla
    2. Re:Finally ESR stops yapping and does some hacking by the+hopthrisC · · Score: 1

      He did bogofilter right after Paul Graham released that white paper. -- It is fast and it is effective.

      He maintains fetchmail -- one of the most reliable and stable tools i can think of (ever lost an email with fetchmail and it wasn't _your_ fault? i doubt it!)

      And there are quite a few more...

    3. Re:Finally ESR stops yapping and does some hacking by Anonymous Coward · · Score: 0

      Well the funny thing here, is that the reporter of that article, most likely was one of the first people ESR spoke to of his program.

      I don't know about you, but for about the next 10 people that I talk to after I finish working on any form of code, WILL hear about my code and will hear my gloating.

      I think's that's normal for programmers.

    4. Re:Finally ESR stops yapping and does some hacking by orangesquid · · Score: 1

      I don't know what eWeek reporters are like, but most reporters are looking for "BIG NEWS," and they will press and press you until you say that something is "revolutionary." Personally, my suspicion is that ESR may actually be making fun of those recent TV commercials that suggest such-and-such product is the culmination and perfection of the history of the human race. ESR seems to be one of those guys that has a very dry sense of humor that comes off as egocentric to people who aren't used to it.

      But, that could just be me...

      I know this tool is not anything to write home about, but given the fact that it is another free software tool which does something similar to $100,000 commercial packages, it may be another tool in the open source arsenal. Not sure how big of a role it will actually play against SCO, though, since IBM et al are likely well able to spend $$$ proving the annoying dog of SCO wrong.

      --
      --TheOrangeSquid Is it any wonder things seem so awry? We swim in a sea of confusion and don't have to think to survive
    5. Re:Finally ESR stops yapping and does some hacking by Anonymous Coward · · Score: 0

      He started bogofilter but hasn't maintained it in ages.

    6. Re:Finally ESR stops yapping and does some hacking by ValentineMSmith · · Score: 1

      It seems that you've never had experience being on the receiving end of the press. I've had occasion to listen on interviews with family members and see and hear what was actually said to the journalist in question. After "shredding" what was said in the actual interview and comparing the MD5 sums with the published interview, there was absolutely no, none, zip, zilch commonality. Although, I'm sure that if ESR's comments were taken COMPLETELY out of context, we'll be hearing from him soon enough.

      --
      Karma: Chameleon - mostly influenced by bad '80s New Wave music
    7. Re:Finally ESR stops yapping and does some hacking by Russ+Nelson · · Score: 1

      Have you ever gotten a press release printed? If not, then don't criticize Eric (BTW, "ESR" is a slashdot invention -- he never calls himself ESR) for doing what's necessary to get press attention to fact that SCO is lying.
      -russ

      --
      Don't piss off The Angry Economist
    8. Re:Finally ESR stops yapping and does some hacking by darien · · Score: 1

      BTW, "ESR" is a slashdot invention -- he never calls himself ESR

      What, never?

    9. Re:Finally ESR stops yapping and does some hacking by Russ+Nelson · · Score: 1

      Well, yes, that page makes my point. He doesn't *call* himself ESR, but instead makes reference to the fact that other people do so. RMS has expressed a preference that he be called "RMS" during one period in his life or another.
      -russ (aka RNN :)

      --
      Don't piss off The Angry Economist
  20. SCO may not know origin of code by Malfourmed · · Score: 5, Informative
    The Sydney Morning Herald continues its mainstream coverage of the SCO vs IBM roadshow by posting an article where Dr Warren Toomey, a Unix historian, says that SCO may not know the origin of their own code.

    Article text follows:

    SCO may not know origin of code, says Australian UNIX historian

    By Sam Varghese
    September 9, 2003

    More doubts have been cast on the heritage of System V Unix code, which the SCO Group claims as its own, by an Australian who runs the Unix Heritage Society.

    Dr Warren Toomey, now a computer science lecturer at Bond University, said today: "I'd like to point out that SCO (the present SCO Group) probably doesn't have an idea where they got much of their code. The fact that I had to send SCO (the Santa Cruz Organisation or the old SCO) everything up to and including Sys III says an awful lot."

    He said that even though SCO owned the copyright on Sys III, a few years ago it did not have a copy of the source code. "I was dealing with one of their people at the time, trying to get some code released under a reasonable licence. I sent them the code as a gesture because I knew they did not have a copy," he said with a chuckle.

    Dr Toomey's statements come a few days after Greg Rose, an Australian Unix hacker from the 1970s, raised the possibility that there may be code contributed by people, including himself, which has made its way into System V Unix and is thus being used by companies like the SCO Group.

    Dr Toomey said this was one reason why the code samples which the SCO Group had shown at its annual forum had turned out to be widely published code.

    SCO was unaware of the origins of much of the code and this "explains how they could wheel out the old malloc() code and the BPF (Berkeley Packet Filter) code, not realising that both were now under BSD licences - and in fact they hadn't even written the BPF code," Dr Toomey said.

    He said that there was lots of code which had been developed at the University of New South Wales in the 70s which went to AT&T and was incorporated into UNIX without any copyright notices.

    "At that time the development that was going on was similar to open source - the only difference was that the developers all had to have copies of the code licensed from AT&T," he said.

    Dr Toomey, who served 12 years with the Australian Defence Force Academy, an offshoot of the University of New South Wales, before joining Bond University, said he had source code for Unices from the 3rd version of UNIX which came out in 1974 to the present day. "I don't have Sys V code but there are people with licences for that code who are members of the Unix Heritage Society. We can compare code samples any time," he said.

    He agreed that the codebase of Sys V was a terribly tangled mess. "It is very difficult to trace origins now. There is an awful lot of non-AT&T and non-SCO code in Sys V. There is a lot of BSD code there," he said.

    In March, the SCO Group filed a billion-dollar lawsuit against IBM, for "misappropriation of trade secrets, tortious interference, unfair competition and breach of contract."

    SCO also claimed that Linux was an unauthorised derivative of Unix and warned commercial Linux users that they could be legally liable for violation of intellectual copyright. SCO later expanded its claims against IBM to US$3 billion in June when it said it was withdrawing IBM's licence for its own Unix, AIX.

    IBM has counter-sued SCO while Red Hat Linux has sued SCO to stop it from making "unsubstantiated and untrue public statements attacking Red Hat Linux and the integrity of the Open Source software development process."

    -----

    Wordforge writing contest now open: deadline 2003-03-28

    1. Re:SCO may not know origin of code by Anonymous Coward · · Score: 0

      Well, shouldn't they have researched this before:
      1. filing a $billion lawsuit
      2. making claims about their IP being included in Linux and claiming a $700 per CPU royalty
      3. dumping all of their SCO stock (they might still have a winner they can't buy back!)

      I think you assume more ignornace than they deserve (shit! did I really say that?)

    2. Re:SCO may not know origin of code by Anonymous Coward · · Score: 0

      Well, shouldn't they have researched this before: ...

      Well, in fairness, they originally planned on getting bought-up by IBM. So they didn't really care how accurate or reasonable their claims were.

    3. Re:SCO may not know origin of code by twrake · · Score: 1

      In the late 70's a collage classmate had mentioned to be that he had written a few lines of the Unix operating system. He went to high school in the area of Bell's Murray Hill facility and I believed this claim was plausable. Unix has a checkered IP history.

      McBride :
      "At a minimum, IP sources should be checked to assure that copyright contributors have the authority to transfer copyrights in the code contributed to Open Source."

      Are SCO sources up to McBrides standards?

  21. What respect? by Anonymous Coward · · Score: 3, Interesting

    Most people *I* know consider ESR to be a bloated windbag with a penchant for fanatical gunrights. He's regarded as pretty much being on the same level as the late Jon Katz.

    1. Re:What respect? by Anonymous Coward · · Score: 1, Insightful

      Exactly. Any coder worth his salt could put a "shred" program together in a couple of days. ESR's claim to fame is more about being a loudmouthed idiotarian (woops, meant "libertarian", but whats the difference?) than anything else.

    2. Re:What respect? by landley · · Score: 1

      > Most people *I* know consider ESR to be a bloated
      > windbag with a penchant for fanatical gunrights

      You need to get out more.

      Rob

    3. Re:What respect? by Russ+Nelson · · Score: 1

      Jon Katz isn't dead. The rest of your facts are equally factual.
      -russ

      --
      Don't piss off The Angry Economist
  22. Re:I'm more interest in time saved by programmers by Anonymous Coward · · Score: 0

    It's "than" not "then".

    They're different words you know.

  23. Be careful... by nolife · · Score: 4, Interesting

    The more points you discover and disprove now with SCO's claims.. the higher quality, more refined, and detailed SCO's evidence will be when this setup finally gets to a court in front of a judge. If they went to court two months ago or even today, they would have been sent home quickly with bascially easy to disprove evidence. With the help of the open source community, they are slowly changing their weapon of choice from a shotgun to a rifle.

    --
    Bad boys rape our young girls but Violet gives willingly.
    1. Re:Be careful... by Anonymous Coward · · Score: 0

      It will most likely be reduced to a spitwad.

    2. Re:Be careful... by NickFortune · · Score: 1
      This is the security through obscurity vs. dull disclosure agument applied to the legal arena. It does not necessarily follow, I'll grant, that what works in software works in law, but in this case I can't see why not.

      If SCO had gone to court straight away they might have stolen the prize while the open source community was off-guard. I think our chances get stronger every day.

      --
      Don't let THEM immanentize the Eschaton!
    3. Re:Be careful... by inc_x · · Score: 1

      From what I have seen it has changed from a shotgun to a wet newspaper.

    4. Re:Be careful... by Anonymous Coward · · Score: 0

      Be careful or you'll be labeled a heretic and burned at the OS stake. God forbid you acknowledge logic and law doesn't always jive. Not here.

    5. Re:Be careful... by Anonymous Coward · · Score: 0

      News flash. The Linux community as such isn't a concern of SCO one way or the other. No matter what we find, claim, or research they are convinced they own Linux. They have picked that fixed point of reference, and from that point of reference we are noisy and lying malcontents. Just like in the old days religion said the moon was perfect and Earth was the center of the universe. By definition, any disagreement was Wrong. Facts didn't have a thing to do with it, and should be punished for being Wrong.

      This is a FUD war. If they show any real code, whoosh it's gone from the Linux kernel. The real fight isn't the evidence, it's something much more dangerous.

    6. Re:Be careful... by Anonymous Coward · · Score: 1, Funny

      At the same time though, the OSS community are rather more rapidly changing their clothing of choice from a light cotton shirt to Kevlar body armour.

    7. Re:Be careful... by peacefinder · · Score: 1

      The rhetoric from the free/open software folks has always been: "We don't think we're infringing, but if SCO is right we'll change the code immediately." Well, here's a chance for the community to make good on that claim.

      If I read between the lines correctly, ESR's method will only find wholly-identical code+comments in three-line chunks. Keep in mind that this will not find duplications of single lines, or even triples with any non-whitespace change at all. However, it's probably going to find most of the really significant areas of dispute. These identical shreds can be examined for origins, and changed if the origin is at all murky.

      Combined with the presumption that the best legal remedy is to stop infringement on the copyright, the ability to identify code that certainly needs to be changed, despite SCO's NDA, is pretty sweet.

      But you're right that this will not be the end of the story. SCO has staked its life on this, and ESR's method is not enough to find everything that a desperate SCO might claim as an infringement in court. However, by the time this makes it to court, any gross violations that might exist will have been found and probably corrected. While the remaining snippets may be more focused and specific, they will also be of lesser quality as evidence of infringement.

      --
      With reasonable men I will reason; with humane men I will plead; but to tyrants I will give no quarter. -- William Lloyd
    8. Re:Be careful... by Rock+Ridge · · Score: 1

      Just DUMP IBM's contributions. That is the simplest way out.

    9. Re:Be careful... by Overly+Critical+Guy · · Score: 1



      How do you know? Did you have access to their files or something?

      With the help of the open source community, they are slowly changing their weapon of choice from a shotgun to a rifle.

      If SCO is right and somebody copied their code illegally, it shouldn't be about avoiding addressing their claims just to skirt by and have them sent home regardless of whether they were right or wrong. What kind of image does that give the Linux community when you say things like that? It should be about what's right. We have yet to see.

      --
      "Sufferin' succotash."
    10. Re:Be careful... by Anonymous Coward · · Score: 0

      Yeah, that would have worked so well in finding the SGI contributed code SCO said was in violation.

      How 'bout just DUMP ALL contributions not from Torvalds himself. That'll work.

    11. Re:Be careful... by shaitand · · Score: 1

      The problem with that theory is this... they have made their legal stance clear, in their filings, in their public statements, all of which are documented. Each time they change their story to cover some new hole pointed out in it they lose credibility and generate evidence which can be presented in court.

    12. Re:Be careful... by Daniel+Phillips · · Score: 3, Insightful

      The more points you discover and disprove now with SCO's claims.. the higher quality, more refined, and detailed SCO's evidence will be when this setup finally gets to a court in front of a judge.

      Having many thousands of bright minds working on our side much more balances the advantage SCO can get by snooping on our discourse, if they can even come close to following it all, that is. We outnumber them, it's stupid not to capitalize on that.

      Just think, if the word doesn't go out, there are many people who might not have come out of the woodwork to contribute their valuable input, historical recollection, interesting files, legal insight, whatever. We work in the open, we share information, we cooperate, we are many in number. They work in the dark, they trust nobody, they're afraid to ask for help, they are few. It's open source versus closed source all over again.

      Also, we each do our own thinking, we try to come up with the part we can contribute, then we go looking for the best place to contribute it. Multiply by 10's of thousands. Compare to a few fevered minds going over and over the same rotten thoughts then sending out marching orders. Seen two systems like that before? Right, it's a free market economy versus Soviet-style central planning. In the end, the free market won because it is more efficient.

      With the help of the open source community, they are slowly changing their weapon of choice from a shotgun to a rifle.

      A rifle will not help you much against a herd of 50,000 enraged penguins stampeding towards you at an average speed in excess of 100 miles per hour.

      --
      Have you got your LWN subscription yet?
    13. Re:Be careful... by Anonymous Coward · · Score: 0

      Yeah, but any credibility they lose in the press will have absolutely zero effect when and if the case goes to trial. By pointing out the holes in SCO's claims, the community shows SCO what lines of code not to introduce in court.

      IANAL, but I'd imagine that the research being done is a good thing. However, I don't think that it's necessarily a good thing to publish it. Why not just turn the info over to IBM's lawyers? They're the ones who know how to use the ammo. If it's best saved to rebut SCO's allegations in the courtroom, then it's harmful to tip off SCO by going public.

      I realize that this approach is anathema to most slashdotters, but the legal system is a horse of a different color.

    14. Re:Be careful... by shaitand · · Score: 1

      most of this information is discovered through open collaboration and discussion. And what on earth makes you think that SCO's public statements are somehow hidden from the courtroom? They can be admitted as evidence, particularly in Redhat's case where a big chunk of the case is ABOUT their public statements. Once SCO makes a statement, they can't take it back, things like the letter from McBride, which are filled with outright lies... the kind which redhat's lawyers can easily prove he knew were outright lies when publically issuing it, will give him a nice ram up the arse.

      What it really boils down to though, is that things which are determined through open collaboration, can't exactly be kept quiet... that's kind of why they call it OPEN.

  24. But SCO's no ordinary rabbit! by Anonymous Coward · · Score: 3, Funny

    Bruce Perens:
    Three. Three. And we'd better not risk another frontal assault. Their legal team is dynamite.
    Linus:
    Would it help to confuse it if we run away more?
    Bruce Perens:
    Oh, shut up and go change your firewall!

    Alan Cox:
    Let us taunt it! Darl may become so cross that he will make a mistake.
    Bruce Perens:
    Like what?
    Alan Cox:
    Well... ooh.
    ESR:
    Have we got bows?
    Bruce Perens:
    No.
    ESR:
    We have the Holy Hand Grenade.
    Bruce Perens:
    Yes, of course! The Holy Hand Grenade of Antioch! 'Tis one of the sacred relics Brother Richard carries with him.
    Brother Richard! Bring up the Holy Hand Grenade!
    MONKS: [chanting]
    Pie Iesu domine, dona eis requiem.
    Pie Iesu domine, dona eis requiem. Pie Iesu domine, dona eis requiem. Pie Iesu domine, dona eis requiem.

    Bruce Perens: How does it, um-- how does it work?
    ESR:
    I know not, my liege.
    Bruce Perens:
    Consult the Book of Armaments!
    RMS:
    Armaments, chapter two, verses nine to twenty-one.
    OPEN SOURCE ZEALOT:
    And Saint Attila raised the hand grenade up on high, saying, 'O Lord, bless this Thy hand grenade that, with it, Thou mayest blow Thine enemies to tiny bits in Thy mercy.'
    And the Lord did grin, and the people did feast upon the lambs and sloths and carp and anchovies and orangutans and breakfast cereals and fruit bats and large chu--
    RMS:
    Skip a bit, Brother.
    OPEN SOURCE ZEALOT:
    And the Lord spake, saying, 'First shalt thou take out the Holy Pin. Then, shalt thou count to three. No more. No less. Three shalt be the number thou shalt count, and the number of the counting shall be three. Four shalt thou not count, nor either count thou two, excepting that thou then proceed to three. Five is right out. Once the number three, being the third number, be reached, then, lobbest thou thy Holy Hand Grenade of Antioch towards thy foe, who, being naughty in My sight, shall snuff it.'
    Richard:
    Amen.
    KNIGHTS:
    Amen.
    Bruce Perens:
    Right!

    One!... Two!... Five!
    Alan Cox:
    Three, sir!
    Bruce Perens:
    Three!
    [sco dies]

  25. Unfortunate name by tordon · · Score: 2, Insightful

    Upper mangeement in most enterprises have a low level of technical knowledge. To them the thought of something called shredding coming anywhere near the 'voodoo' of software development would be abhorrent.

  26. Is this really as useful as it seems? by typobox43 · · Score: 1

    What I see from the article is that it can only compare whether two code snippets are exactly alike (which makes sense from the standpoint of MD5 - they're really only useful for equality checks) - and from the claims that are being thrown around about obfuscating the supposedly legal code, that isn't going to help much of anything.

    1. Re:Is this really as useful as it seems? by toast0 · · Score: 3, Insightful

      Finding obfuscated copied code is a difficult problem to solve. Presumably, SCO has put forth much effort into that, but they refuse to make public their claims.

      Straight forward copying of code is much easier to find, and much easier to show is copying in a court. If we look at all the instances of duplicate code, and determine if they are license violations or not, it will be a start to making SCO go away.

    2. Re:Is this really as useful as it seems? by deinol · · Score: 1

      Finding obfuscated copied code is a difficult problem to solve. Presumably, SCO has put forth much effort into that, but they refuse to make public their claims.

      What are you talking about? Yes, SCO could obfuscate their code. The point is, there are thousands if not more people who have legitimate copies of the code. They have had their copies much longer than SCO has been working on their lawsuit. It's not like SCO has rereleased the Unix source code in an obfuscated version. Nobody with a licensed copy of the source code would be able to use it for anything.

      The infringing code is out there. SCO isn't the only group with access to it. They can't go and hide it now, it's been around for a long time. Someone will be able to compare the code and point to the areas in the Linux code that needs to be examined.

      --
      Got Apathy?
    3. Re:Is this really as useful as it seems? by toast0 · · Score: 1

      I'm saying that it's difficult to take two large source trees and look for code that's the same but obfuscated.

      I'm presuming that SCO has put forth a large amount of effor into finding duplicated by obfuscated code. Not that they're obfuscating their code.

    4. Re:Is this really as useful as it seems? by typobox43 · · Score: 1

      The point is that there are claims by SCO that Linux developers changed little elements of the code that they allegedly stole from SCO (similar to how you would plagiarize a report in middle school and change a few words so it wasn't a verbatim copy ;). It would be much harder to discover using a computer program that such code was the same.

  27. Ridiculous by ckimyt · · Score: 1
    Interestingly, as the shred algorithm can run reports on source trees using only the MD5 signature shreds (once generated), it is possible to use it to compare trees without direct access to the source code itself, leading to a possible use in comparing various proprietary source trees with each other and with Freely available code bases such as Linux and *BSD without requiring actual disclosure of the proprietary source code
    That's just plain dumb:

    [evilscobox] /usr/codebase> find . -iname '*.c*' -o -iname '*.h*' -exec echo "/* I'm defeating MD5 */" >> {} \;

    Duh!
    --

    Putting the sig back into +1, Insightful since 1995!
    1. Re:Ridiculous by jonabbey · · Score: 1

      Nice job on not reading the story. Shred calculates a panoply of md5 hashes for each text file.. essentially it creates an md5 hash of each 3 line segment of code, starting at every line offset.

    2. Re:Ridiculous by Another+MacHack · · Score: 1

      Shred works by taking checksums of groups of three lines. Adding a line at the end won't keep it from matching every other group of three lines.

      Furthermore, the linux code in question is a matter of historical record. The only code which could be usefully changed would be the code which is being released only via shred-indexes. In that case, though, the person trying to modify the code would only decrease the match-likelyhood rating by adding random crap.

    3. Re:Ridiculous by ckimyt · · Score: 1
      jonabbey wrote:
      Nice job on not reading the story. Shred calculates a panoply of md5 hashes for each text file.. essentially it creates an md5 hash of each 3 line segment of code, starting at every line offset.
      Touche. However, just change to:

      [EVILSCOBOX] /usr/codebase> find . -iname '*.c*' -o -iname '*.h*' -exec perl -pi 'chomp;$_ .= " /* still defeating MD5 */\n"' {} \;

      and you're golden.

      Another MacHack wrote:
      Furthermore, the linux code in question is a matter of historical record
      Notice what I quoted...specifically that you can compare proprietary source trees this way. If they don't have to release the source code, they can just munge their own like the above to change the hashes.

      Of course, I guess SCO's aim would be to show the same hashes, so this technique doesn't really apply here. It would, however, apply to Microsoft or other companies incorporating GPL code into their proprietary products.
      --

      Putting the sig back into +1, Insightful since 1995!
    4. Re:Ridiculous by Anonymous Coward · · Score: 0
      Of course, I guess SCO's aim would be to show the same hashes, so this technique doesn't really apply here.


      Interesting choice of topic title then...
  28. Dibs on naming the KDE GUI by Eberlin · · Score: 4, Funny

    KDE GUI version should be called Krang since Shredder would obviously be used from the command line (shell). Maybe it should have helper apps called Bebop and Rocksteady. And if the need should arise, the project shouldn't fork...it should splinter.

    1. Re:Dibs on naming the KDE GUI by Anonymous Coward · · Score: 0

      This hardly seems worth the effort, but
      BOO.

  29. This is actually a darn good idea by RocketRick · · Score: 5, Informative

    By computing MD5 hashes of consecutive (overlapping) line triplets, the shred algorithm makes it easy to identify copied code, without ever seeing the actual code. This might be a perfect way for companies to allow a third party to compare code, without giving away any trade secrets in the process.

    Of course, since MD5 is a very good cryptographic hash function, *any* one-bit change in the source will result in, on average, half of the bits in the result being flipped. So, this method of identifying copied code would only work if the code had never been run through an obfuscator. It would also be defeatable by running the source through a script to have its variable names search-and-replaced with similar names (such as replacing every variable name with a new name consisting of the old name plus "_newname")....

    In short, this might be a useful technique for allowing a third party to look for trivial wholesale copying of code, but it would be useless for finding a motivated miscreant, determined to steal code without being caught.

    1. Re:This is actually a darn good idea by Anonymous Coward · · Score: 0

      It's useful to prove equivalence, not disprove it. Duh. You could just add a space on the end of every line if you wanted to throw off the checksums. You could just generate random "checksums" and present those as your actual results if you weren't going to show the code anyway. Since either party could "cheat" by faking the results, it's only useful when two mutually distrusting parties are *genuinely *interested in the answer of how similar their documents are, but don't want to disclose.

    2. Re:This is actually a darn good idea by rossz · · Score: 1

      So add a preprocess that standardized the variable names. I'd also have the shredder ignore whitespace and line breaks in its computation.

      --
      -- Will program for bandwidth
    3. Re:This is actually a darn good idea by karmavore · · Score: 1

      It would also be defeatable by running the source through a script to have its variable names search-and-replaced with similar names (such as replacing every variable name with a new name consisting of the old name plus "_newname")....

      When I steal code I use names from the phone book for the variables.

      For the numeric constants I use phone numbers.

      --
      Speech: Free
      Beer: $699.00
    4. Re:This is actually a darn good idea by Trailer+Trash · · Score: 5, Insightful

      So, this method of identifying copied code would only work if the code had never been run through an obfuscator.

      You've hit the nail on the head, possibly without knowing it. The source code needs to be run through an obfuscator *before* shredding. Actually, I'm thinking a special obfuscator, let me explain.

      Let's take a piece of C source, not randomly chosen:

      malloc(mp, size) struct map *mp; { register int a; register struct map *bp; for (bp = mp; bp->m_size; bp++) { if (bp->m_size >= size) { a = bp->m_addr; bp->m_addr =+ size; if ((bp->m_size =- size) == 0) do { bp++; (bp-1)->m_addr = bp->m_addr; } while ((bp-1)->m_size = bp->m_size); return(a); } } return(0); } Now, the structure of the code is 99% of what matters. Variable names can change, but few people would change anything beyond that. Let's modify the code in a couple of important ways. First, all variable names are changed to new names, on a per-line basis. Blank lines and unneeded blanks are all removed. Each statement is on its own line, and formatting styles (such as curly bracket placement) are standardized. malloc(a, b) struct a *b; { register int a; register struct map *b; for (a=b;a->c;a++) { if (a->b>= c) { a=b->c; a->b=+c; if ((a->b=-c)==0) do { a++; (a-1)->b=a->b; } while ((a-1)->b=a->b); return(a); } } return(0); }

      This might not be perfect, but it should do the trick. A programmer can change variable names, spacing, or format, but as long as the code is the same, it'll match. Obviously, changing the code would have an impact, but nearly every line would have to be changed for it to not match, and in a substantial way. That's literally not always possible to even do in a way that would trick this function.

      Anyone want to write it?

      Michael

    5. Re:This is actually a darn good idea by Trailer+Trash · · Score: 1

      Okay, always preview and find that "pre" isn't one of your options. Let's try it again.

      So, this method of identifying copied code would only work if the code had never been run through an obfuscator.

      You've hit the nail on the head, possibly without knowing it. The source code needs to be run through an obfuscator *before* shredding. Actually, I'm thinking a special obfuscator, let me explain.

      Let's take a piece of C source, not randomly chosen:

      No idea how to format this in slashdot:

      malloc(mp, size)
      struct map *mp;
      {
      register int a;
      register struct map *bp;

      for (bp = mp; bp->m_size; bp++) {
      if (bp->m_size >= size) {
      a = bp->m_addr;
      bp->m_addr =+ size;
      if ((bp->m_size =-
      size) == 0)
      do {
      bp++;
      (bp-1)->m_addr = bp->m_addr;
      } while ((bp-1)->m_size = bp->m_size);
      return(a);
      }
      }
      return(0);
      }

      Now, the structure of the code is 99% of what matters. Variable names can change, but few people would change anything beyond that. Let's modify the code in a couple of important ways. First, all variable names are changed to new names, on a per-line basis. Blank lines and unneeded blanks are all removed. Each statement is on its own line, and formatting styles (such as curly bracket placement) are standardized.

      malloc(a,b)
      struct a *b;
      {
      register int a;
      register struct map *b;
      for (a=b;a->c;a++) {
      if (a->b>= c) {
      a=b->c;
      a->b=+c;
      if ((a->b=-c)==0)
      do {
      a++;
      (a-1)->b=a->b;
      } while ((a-1)->b=a->b);
      return(a);
      }
      }
      return(0);
      }

      This might not be perfect, but it should do the trick. A programmer can change variable names, spacing, or format, but as long as the code is the same, it'll match. Obviously, changing the code would have an impact, but nearly every line would have to be changed for it to not match, and in a substantial way. That's literally not always possible to even do in a way that would trick this function.

      Anyone want to write it?

      Michael

    6. Re:This is actually a darn good idea by fdicostanzo · · Score: 1

      given that there is only so many useful ways to structure a C statement, could't one take the MD5 hashes and generate a table of common 3 lines of C code hashes and reconstruct most of the original code this way? There will be billions of combinations (more even) but that wouldn't take all that long to work out. A variant of crack could do it. The key is that there are only so many useful combinations.

      This would work if the code was first de-obfuscated as you mention. It may not find all the lines but it would find lots and the holes could be worked out by the context. Remember that only one of 3 hashes would need to be worked out in order to figure out a particular line.

      --
      Synergies are basically awesome, and they're even better when you leverage them. -PA
    7. Re:This is actually a darn good idea by Anonymous Coward · · Score: 0

      As a preprocessing step it should strip whitespace and convert all variables and contants to the same name.

      A slightly modified source beautifier would be the best bet for the preprocesser.

    8. Re:This is actually a darn good idea by Anonymous Coward · · Score: 0

      In a preprocessing step, all variable names should be set to the same name or removed. All constant names should also be set to the same or removed.

      To see why this is, think of a small code snippet from your first example above incorporated into two much larger but unrelated code samples. In the preprocessing step you suggested, the variable names of the identical code segments will most likely get unrelated variable names since the code above them will most likely have different number of variables. Thus their MD5 hashes will not be the same. Same holds true for constant names.

      Removing or setting all variable and constant names to the same name fixes this problem.

    9. Re:This is actually a darn good idea by Kurt+Gray · · Score: 1

      You bring up a good point about simply switching variable names and function would throw off the comparator and I'm assumming ESR thaught of that already. In that if it's not already a feature the obvious first feature to add would be for comparator to have the option of ignoring variable names, function names, and comments.

    10. Re:This is actually a darn good idea by Tokerat · · Score: 1

      In that if it's not already a feature the obvious first feature to add would be for comparator to have the option of ignoring variable names, function names, and comments.
      Consider then things like that malloc example SCO gave: suppose an acedemic implementation of a certain code snippet is used by two programmers in two different projects. Each has different variables, but the lines of code are syntactically exact in some places. This would throw off the ESR checker as well.

      That being said, it's stilla good idea to have this feature in there. This will be more of a lie detector than anything: Lie detectors need people to operate them and interpret their readings. There won't ever be a "guilty" light.
      --
      CAn'T CompreHend SARcaSm?
    11. Re:This is actually a darn good idea by Trailer+Trash · · Score: 1

      You didn't read closely enough. Variable renaming starts at the beginning again on each new line. You're right; otherwise simply adding a new variable to the program would throw the whole thing off. But you're also correct that it might not even matter to know the different variables, just getting the high-level structure may well be enough to differentiate it.

      All stuff to consider.

    12. Re:This is actually a darn good idea by LilJC · · Score: 1
      I'm not sure I agree with this idea. Because nobody seems to trust SCO's claims to begin with, I think they could play dirty pool with this too. All they have to do is put a hush-hush stop on checkins/outs to their developers, stick the linux kernel code in their CVS, and let any third party who wants to compare the supposed SCO code with the linux kernel code.

      After they have "proven" their code is identical using a third party, they stick their own code back in the CVS, and are up and running with the code that was never stolen in the first place. Since the third party doesn't see the code, nobody's the wiser. At least until a court date - but then the linux community is face-deep in egg with injunctions out the wazoo until some court date in 2005. Even there, we can only hope that the real code is examined and their claims are falsified in another 18 months. Until then, we'd be begging for the days of our troubles being limited to removal of RPC and SoBig worms off every friend/family member's computer we have.

      While I wish things were as simple as this, I don't know of any way for their claims to be verified other than a peer review by qualified *nix programmers, who as of now have to sign their career away with an NDA.

      --

      The only thing more dangerous than a file named -rf is renaming it -rf\ /
    13. Re:This is actually a darn good idea by Anonymous Coward · · Score: 0

      I'm amazed that nobody has mentioned MOSS yet. It is mainly used for checking CS students' assignments for plagiarism, and does exactly what we want.

      It preprocesses code first (actually it works on different languages, and even things other than code, by using different preprocessor plug-ins) then takes hashes and compares. It manages to detect almost all the things student do to try to cover up their copying.

    14. Re:This is actually a darn good idea by Thing+1 · · Score: 1
      Great idea! However there's a bug in your example. The first two lines of the example are (btw, the courier-formatting is done with the construct):

      malloc(mp, size)
      struct map *mp;
      The first two lines of your processed version are:
      malloc(a,b)
      struct a *b;
      However, there are three separate "entities" which you've reduced to two. It should read:
      malloc(a,b)
      struct c *b;
      You also reuse "a" since the original code had "register int a;" which you didn't change to "register int d;" and other errors cascading like this.

      I like what one response said, which is that all variables and constants should be named the exact same thing. This would of course no longer compile, but it would eliminate variable-name issues. Of course, it also might find more false positives than another method, which wouldn't necessarily be bad as it might point to areas in which code was copied then obfuscated (but it would probably just create more work for the human(s) doing the double-checking).

      I think this is a neat type of problem to solve. Years ago I wrote a utility to produce HTML-output from a Perl script; it first listed all subroutines in the order they appear (each sub is clickable, and goes to the place in the script that the sub is defined); then in alphabetical order; then it lists the entire script. It was very useful for helping to figure out someone else's code. I would imagine it could easily be modified to work with C, Java, or other languages.

      --
      I feel fantastic, and I'm still alive.
    15. Re:This is actually a darn good idea by Trailer+Trash · · Score: 1

      Great idea! However there's a bug in your example. The first two lines of the example are (btw, the courier-formatting is done with the construct):

      This is the second time that I've repeated this. Read closely, we are not renaming variables consistently across the function. If we did, the addition of a variable would completely change the function.

      Instead, we are naming variables with a consistent pattern per-statement. That pattern starts over on every statement. We're just looking at the general structure of the program, and this brings out the structure.

  30. I can decipher it! by normal_guy · · Score: 0

    What do we get? It's like SCO is holding a handgrenade and people are slowly moving away from the madman. Shhh! You are breaking my concentration! I'm trying to shed a bitter tear for them. You mean this whole lawsuit thing is for real?

    This Comment was generated with the Comment-O-Matic for SCO Stories.

    --

    Linux: Free if your time is worthless.
  31. SCOs IP violations will be obvious... by Anonymous Coward · · Score: 0

    This will also find all the places SCO has
    violated the terms of other organizations IP...

  32. Ups and downs by autocracy · · Score: 4, Informative
    Upside: we can maybe help catch more stolen code.
    Downside: Uh... it just came out... and it's making some big, big claims involving fuzzy logic. I think it's gonna need some testing first, eh?

    Also, anybody else think it only works on larger sections of code than just say 10 lines?

    --
    SIG: HUP
    1. Re:Ups and downs by ckimyt · · Score: 1
      anybody else think it only works on larger sections of code than just say 10 lines?
      Why not make a parse tree of the source and compare it by structure, instead of just lexically?
      --

      Putting the sig back into +1, Insightful since 1995!
    2. Re:Ups and downs by lspd · · Score: 1

      Downside: Uh... it just came out... and it's making some big, big claims involving fuzzy logic. I think it's gonna need some testing first, eh?

      This is the same technique that was used to find the similarity between malloc.c in ancient Unix and ate_utils.c in Linux 2.4.21.

  33. Bad for Students by chicagoan · · Score: 2, Funny

    I'm just glad that I finished college before they had this technology otherwise I might have been caught for cheating. Although I was really good at renaming variables.

    1. Re:Bad for Students by i_am_nitrogen · · Score: 1

      Remind me not to hire anybody from Chicago...

      I hope for the world's sake you're joking. If you're not joking, I hope for your sake you don't get hired by anyone making critical systems.

  34. Eh, what? by Spamalamadingdong · · Score: 1
    Although I acknowledge that SCO likely has no IP claim over Linux, it should have a fair case.
    A fair case of what? Proving use of trade secrets (which had been widely distributed and taught in universities for decades)? SCO has made no claims of copyright or patent infringement.

    SCO is going to get its corporate head handed to it on a platter, and I hope that the courts allow the corporate veil to be pierced so that McBride and company have to bear the cost of their misdeeds personally (and not just the duped stockholders).

  35. Automating people's careers away by YetAnotherName · · Score: 4, Funny

    Thanks ESR. You've just put a team of mathematicians at SCO who were somehow related to MIT out of their jobs.

    1. Re:Automating people's careers away by Lodragandraoidh · · Score: 1

      Sign seen at Utah bars:

      Earn $$$ verifying Unix code! Don't know what Unix is, or where the restroom is for that matter? No problem! We will pay for your honorary 'MIT' mathematics degree, to lend you that aire of authenticity! Call 555-1111...

      --

      Lodragan Draoidh
      The more you explain it, the more I don't understand it. - Mark Twain
    2. Re:Automating people's careers away by eric76 · · Score: 1

      They still have jobs.

      Now they are preparing a list of all possible MD5 hash codes.

      SCO is going to prove once and for all that every single line of Linux was directly copied from UNIX.

    3. Re:Automating people's careers away by Anonymous Coward · · Score: 0


      MIT doesn't give out *any* honorary degrees, dammit!

    4. Re:Automating people's careers away by davejenkins · · Score: 1

      Paul Hatch, a SCO spokesman, wrote in a statement to The Tech, "To clarify, the individuals reviewing the code had been involved with MIT labs in the past, but are not currently at MIT. Unfortunately, due to contractual obligations, we cannot specifically name the individuals."

      Yeah, they delivered pizza in the greater cambridge area, including a couple of visits to the math labs at MIT.

    5. Re:Automating people's careers away by Lodragandraoidh · · Score: 1

      Milton's Institute of Tricology does...

      --

      Lodragan Draoidh
      The more you explain it, the more I don't understand it. - Mark Twain
  36. Interesting implementation, but flawed by mTor · · Score: 1

    A simple re-indentation or a variable change would fool comparator. What someone needs to do is to implement a parse tree comparison tool which would be able to compare files on a semantic level.

    1. Re:Interesting implementation, but flawed by Anonymous Coward · · Score: 0
    2. Re:Interesting implementation, but flawed by El · · Score: 1

      It would be trivial to front-end this with a C beautifier or some other method of converting the white space to cannonical form. Renaming entities would be more difficult; in that case, you probably need to parse the C and compare the parse trees -- isn't that what they use to catch plagerizers in college? All my CS lab students were at least smart enough to rename the variables...

      --

      "Freedom means freedom for everybody" -- Dick Cheney

    3. Re:Interesting implementation, but flawed by johny_qst · · Score: 1

      I think you are both right on the point about parse tree comparison being the way to go, but is there a way to truly stop the motivated code stealer? Though this would allow simple variable name changes to be caught, couldn't any amount of code become obfuscated just by a simple alteration of semantics. Though with all that work the creative CS student would probably know a lot more about how the code works.

      --
      Fnord.sig
    4. Re:Interesting implementation, but flawed by tamyrlin · · Score: 1

      The tool have an option to remove whitespace.

      But using a parse tree comparison would be interesting anyway. Especially if it could be compared with object code for which the source code is not available.

  37. from the Bouncing legs dept. by bobdotorg · · Score: 0

    When I saw the headline 'shred SCO', 'from the woodchipper department' a vision of a corporate purge 'Fargo Style' popped into my head.

    --
    __ Someday, but not this morning, I'll finally learn to use the preview button.
  38. Nonsensical idea by YU+Nicks+NE+Way · · Score: 1, Interesting

    Great. So cool. And so stupid.

    First, IBM, Sequent, SGI and Linux wouldn't be off the hook if the provenance of each line of code were proven to have come from other sources. There are a number of trade secret issues that still could crop up.

    But let's assume that Raymond's work was actually run on the SCO source and on Linux. Would the results be meaningful?

    No.

    Suppose I have a routine that comes originally from source B. I work for a company which has the right to copy B, but which redistributes the results of its work under a closed license. Call that new source S. It so happens that the code my company got from B had a nasty bug in it, and I spent a month finding a fix for that bug. Suppose also that the fix is quite small relative to the original code, as is ususally the case. A shredder is going to find significant similarities between at routine as implemented in source B and source in S. Now, suppose source L comes along. The authors of L had the right to copy from B, but not from S. They have a very similar routine, originally derived from B. After shredding, the routines in B, S, and L will all look similar -- but whether there's an infringement between S and L will depend solely on a tiny fragment of the code. Without disclosing that fragment, there is no way to determine if there's in infringment or not.

    1. Re:Nonsensical idea by El · · Score: 4, Insightful

      Comparing the hashes doesn't give you a definitive answer; it does, however, tell you where to look. Or which submitters to ask for clarification on the origins of potentially infringing code. That's more than we have now!

      --

      "Freedom means freedom for everybody" -- Dick Cheney

    2. Re:Nonsensical idea by Ninja+Programmer · · Score: 1
      Comparing the hashes doesn't give you a definitive answer; it does, however, tell you where to look.
      No. You do not understand. MD5 has a 128 bit output. 128 bits is a *LOT*. If you were to take every 3-line sequence of text written by every person ever in the history of man and compute the MD5 of each of them, the probability that *ANY* two have the same MD5 sum, but which are not the identical in the first place is vanishingly small.

      If found any such collision, you could probably write a paper on it, and it would very readily be published in any crypto publication.
  39. Slim to None by tomRakewell · · Score: 5, Insightful

    Chances are slim to none that a software company would allow it's "shredded" source code to be publicly released. What happens if the proprietary source is found to violate the GPL?

    Proprietary (closed) source companies have a tremendous advantage over open source software when it comes to violating intellectual property. Who will ever know if they did it? A source code "comparator" eliminates that crucial advantage.

    1. Re:Slim to None by Anonymous Coward · · Score: 0

      But, and I don't know the answer, does the company have to allow people to "shred" its products binary code?

      Why can't anyone do it?

    2. Re:Slim to None by Dr_LHA · · Score: 1

      Simple - we go to congress and lobby for powers for our "GNU Software Alliance". Then get the Federal marshalls along and bust down the doors of proprietary software vendors demanding a "GPL Compliance Audit" or a hefty fine.

    3. Re:Slim to None by JoeBuck · · Score: 4, Interesting

      But IBM already has a copy of SCO's code; they licensed it after all. They can release the output of "shred" without violating their agreements with SCO.

    4. Re:Slim to None by k98sven · · Score: 2, Interesting

      Proprietary (closed) source companies have a tremendous advantage over open source software when it comes to violating intellectual property. Who will ever know if they did it? A source code "comparator" eliminates that crucial advantage

      Not really.. Open-source software usually has a nice setup with mailing-lists, CVS, etc. Most of the code is well accounted-for. The same is not as true with a lot of proprietary software.

      Remember, it's not enough that two pieces of code match to prove an infringement in court.
      In fact, the court will most likely take into consideration the fact defending code is open-source, and the burden of proving that they originated the code would be increased for the plaintiff.

      Also, failing to prove that they originated the code could leave them open to a countersuit in which the tables would be turned against them, since they obviously had access to the open-sourced code.

    5. Re:Slim to None by Anonymous Coward · · Score: 0

      What argument can they use to justify NOT releasing the shredded data? Since the shredded data doesn't give away any of their IP, they cannot claim it is to protect that. Basically, a company that doesn't release the data when asked is obviously trying to hide something so is ripe for IP infringement lawsuits.

    6. Re:Slim to None by bladernr · · Score: 1

      WHAT?!?!?!

      I'm sorry, but this just burns me. This is like the RIAA telling me "If you don't have any songs to hide, then show me your hard drive. What, you want a warrant and your rights? Anyone demanding their rights must have something to hide"

      That is completly flawed. Failure to provide something does not prove guilt. I do not want to live in a society where I am assumed guilty (or my business is) by not submitted to searches, or, basically, by refusing to do anything.

      What about my right not to do anything?

      Sorry, its a bit of a rant, but this just smacked of the Patroit Act and the idea that none of us should mind being monitored, because only criminals mind that. Its called a police state, and I don't like anything that smells like it.

      --
      Sarcasm and hyperbole are the final refuges for weak minds
    7. Re:Slim to None by Valar · · Score: 1

      not to mention, you could actually reverse engineer the code from the md5s assuming a) the lines are short enough, b)you have enough memory, c) you have enough processing power. Especially if you have examples from other bits of their code that you can use to make intelligent guesses...

    8. Re:Slim to None by fruitbatUK · · Score: 1
      not to mention, you could actually reverse engineer the code from the md5s assuming a) the lines are short enough, b)you have enough memory, c) you have enough processing power. Especially if you have examples from other bits of their code that you can use to make intelligent guesses...

      No, you couldn't. MD5 is a checksum. There is a many-to-one mapping, even if you have the computing resources to try all possibilities so you couldn't tell which actual one was correct.

    9. Re:Slim to None by IM6100 · · Score: 1

      That sounds like an excellent way to get companies warmed up to allowing any GPL'd code at all in their organization.

      (please note I was using sarcasm)

      --
      A Good Intro to NetBS
    10. Re:Slim to None by shaitand · · Score: 1

      yes but in SCO's case there isn't much choice, IBM and a few thousand other individuals, institutions, and corporations all have the SysV code.

    11. Re:Slim to None by WNight · · Score: 1

      Now that an easy tool is available, look for whistle-blowers. All they have to say is that file X, lines Y-Z, of Windows XP, build N, are taken from linux and here are the MD5s. At this point it's relatively easy for someone to get a court order allowing them to view the code, under NDA of course. Without a tool like this people would have to actually read source and be familiar with potential sources in order to catch dupes. Not something someone with casual access could do.

      Hell, this could even store all the MD5 sums of open source projects on a website, allowing people to compare their code to large open source projects without even having to download the source.

      It would be interesting if "they" rewrote this to compare with or without comments, and with or without original variable names. I'd imagine in a few cases you'd want the code but you'd do a mini re-write to bring the variable names into line with your existing code. I wonder if comparing the parse tree, as suggested elsewhere in this over-thread, would provide good results?

    12. Re:Slim to None by shaitand · · Score: 1

      It would be wrong for their to be laws assuming your guilty because you do not provide the hashes... that doesn't mean it wouldn't be FUCKING STUPID for the rest of the world not to know the implication.

    13. Re:Slim to None by WNight · · Score: 1

      Not at all. In this case it's SCO who claims their code is being infringed upon. It's their responsibility to prove this. Same as you having to show receipts for something if you claim I stole it. If you make a claim, prepare to back it up.

    14. Re:Slim to None by Anonymous Coward · · Score: 0

      Out of these possible reverse mappings to a MD5 hash:

      [1] ASoi349ds##92 YTsa!osfd
      [2] for(x=0;x=255;x++){
      [3] asoji3@#$@vjvopm#sdofm##%UI)SDddsfjop3#
      [4] RQWo9sdo3T#Roigeo9

      MY bet would be on number 2.

      But you are right in that revering an MD5 hash is much too computationally expensive.

    15. Re:Slim to None by Anonymous Coward · · Score: 0

      I bet you could recognise the MD5 for these three lines fairly quickly... but much more would take a little longer:

      }
      }
      }

    16. Re:Slim to None by Pharmboy · · Score: 1

      But IBM already has a copy of SCO's code; they licensed it after all. They can release the output of "shred" without violating their agreements with SCO.

      Unless SCO claims that the hash is "derivitive"...

      --
      Tequila: It's not just for breakfast anymore!
    17. Re:Slim to None by Tony-A · · Score: 1

      I wonder if comparing the parse tree, as suggested elsewhere in this over-thread, would provide good results?

      The odds of two independent very well executed codings of a function are quite high. Actually there is a reasonable chance of code matching variable names and comments.

      What's more interesting is that with a large sample you should be able to find two "different" functions with identical parse trees.

    18. Re:Slim to None by Anonymous Coward · · Score: 0

      Sorry to be pedantic but... 2 is an infinite loop though maybe your just futher trying to implicate the notion of computational expenses.

    19. Re:Slim to None by Anonymous Coward · · Score: 0

      Actually, in 2 the loop is never executed since the condition is never met. I had originally meant to use less-than-or-equal-to rather than equal-to.

    20. Re:Slim to None by Valar · · Score: 1

      thus, lines short enough. That's the reason that many authentication modules use md5, and limit the number of characters you can use.

    21. Re:Slim to None by Anonymous Coward · · Score: 0

      Unless SCO claims that the hash is "derivitive"...

      Hey Phagboy, it's spelled "derivative"

    22. Re:Slim to None by WNight · · Score: 1

      This is where you'd want to be careful in accusing someone of copying. If the function they "copied" is a string copy, or a string length, chances are that you both used the simplest C idiom for it. If it's something more complex, but still only six lines, you probably just reinvented the wheel. If it's sixty lines, or ten different functions in the same functional area of code, that might indicate copying.

      Actually, for a baseline, before scanning System V Unix against Linux 2.6, scan SysV against the largest non-unix codebase you can. Against Quake2 source for example. Then discount that level of matching when comparing SysV to Linux. You'd get more, because OSes do OS-ish type things, but it'd give you an idea of how many lines of completely unrelated things would match.

      Or, if you're SCO, sue id Software. They obviously copied your string handling code!

  40. Results Will Appear "Tainted" by zapf · · Score: 5, Insightful

    While I fully support ESR and the rest of the open source movement's defense of Linux against SCO, I have a feeling that this tool's results will not immediately be accepted by established media simply because of ESR's bias. A reporter looking into the SCO story who knows little about open source wouldn't trust a tool made by one side of the disagreement.

    It seems very important to me that "third parties" and experts who are not an integral part of the open-source movement validate that comparator works as intended and is effective at detecting code similarities. Hopefully we'll see some articles on respected sites in the next week or so with conclusive analyses of comparator. Not to mention a chance for someone to use it on SCO's code!

    Oh, and "Yes, I'm being deliberately vague and tantalizing" is quite funny.

    1. Re:Results Will Appear "Tainted" by nate1138 · · Score: 1

      It seems very important to me that "third parties" and experts who are not an integral part of the open-source movement validate that comparator works as intended

      I know! We could get Microsoft to do it!

      --
      Where's my lobbyist? Right here.
    2. Re:Results Will Appear "Tainted" by dmaxwell · · Score: 1

      When this whole debacle got rolling, the first thing that occurred to me was IBM has it's own copy of the SysV sources.. The shred idea is clever but I don't think its so clever that it didn't immediately occur to the IBM engineers that have been sicced on this issue. Blowing SCO's so-called evidence out of the water is too obvious for IBM to miss. They have to be doing something like this. Its also another good reason for IBM to drag this out(in addition to making them burn through cash). Establishing the exact provenance of the matches to courtroom perfection and then indexing them will take time. Everytime time SCO introduces some of their tripe into evidence, you want to find the counter evidence immediately.

      On the other hand, RedHat does not have this handly clueclub sitting around to beat Darl senseless with. They'll need something like this.

      As for doing SCO's work for them, they should already know their "evidence" is shit. All "turning the shotgun into a rifle" (as one poster put it) will do is make it that much easier to remove any legitimate residue that may have found it's way into Linux. Actually, that would help the good guys out. The immediate removal of anything that infringes demonstrates the good faith the SCO is demonstrably void of.

      The most damning outcome SCO can hope for is some small infringements that will be removed immediately. MS, Sun, and SCO aren't going to get the apocalyptic outcome they lusting after.

    3. Re:Results Will Appear "Tainted" by fireboy1919 · · Score: 1

      Yeah, that'll work. I can see five possible answers to the question, "Do you have a bias?" which eliminates all experts.

      1) "I use Windows" - unsuitable in favor of SCO.
      2) "I use Linux boxen" - unsuitable in favor of IBM.
      3) "I'm a Mac person" - unsuitable in favor of IBM (thanks to OS X)
      4) "I use Unix boxen" - unsuitable in favor of IBM, with the exception of SCO users, who would favor SCO.
      5) "I don't use computers at all/I don't know what I use" - too ignorant for expert status, or lying.

      So I think the best we can do is have both sides validate it.

      --
      Mod me down and I will become more powerful than you can possibly imagine!
    4. Re:Results Will Appear "Tainted" by A+Big+Jerk · · Score: 1

      >A reporter looking into the SCO story who knows little about open source wouldn't trust a tool made by one side of the disagreement.

      Yeah, I can't imagine the reverse of that bias being true...

      --
      >> Buy yourself some extremely long bed sheets. You'll be making an escape rope out of them very soon.
    5. Re:Results Will Appear "Tainted" by Brandybuck · · Score: 5, Insightful

      A reporter looking into the SCO story who knows little about open source wouldn't trust a tool made by one side of the disagreement.

      Then why would a reporter trust the press releases that SCO puts out on an daily basis?

      The unfortunate reality is that they DO trust them. We may all think this is a joke here in our insular community, but the great majority of reporters report the press releases "as is". Then the analysts come along and refine those press releases into easily digestible chunks. Then the pundits come along with preconceptions based on those chunks. Ever wonder why the SCO stock keeps going up and up and up? It's because the only thing the general public knows about this issue has come from SCO.

      Anything that can help get the truth before the public eye is a Good Thing(tm). A tool that can mathematically "prove" that SCO is lying is valuable, even if most reporters suspect a bias.

      --
      Don't blame me, I didn't vote for either of them!
    6. Re:Results Will Appear "Tainted" by Anonymous Coward · · Score: 0

      Get some OpenVMS users - they hate IBM and SCO and anyone else having to do with UNIX equally.

    7. Re:Results Will Appear "Tainted" by Reteo+Varala · · Score: 1
      That's an interesting point, and while the likelihood might be good, two flaws exist in that reasoning:

      1: Whether or not the results of such a tool could be admissible would be the decision of an independent expert in the court. Such an expert can use the tool to compare several test documents, some with others, some with their copies, and some with modified copies. Once the testing is done, the experts can proclaim shred's evidence as truthful, and admissible as evidence in court.
      2: Too many people claim to be neutral, but they have a bias; look at the media these days! "Fair, balanced news," my ass.

    8. Re:Results Will Appear "Tainted" by platypus · · Score: 1

      When this whole debacle got rolling, the first thing that occurred to me was IBM has it's own copy of the SysV sources.. The shred idea is clever but I don't think its so clever that it didn't immediately occur to the IBM engineers that have been sicced on this issue. Blowing SCO's so-called evidence out of the water is too obvious for IBM to miss. They have to be doing something like this. Its also another good reason for IBM to drag this out(in addition to making them burn through cash). Establishing the exact provenance of the matches to courtroom perfection and then indexing them will take time. Everytime time SCO introduces some of their tripe into evidence, you want to find the counter evidence immediately.

      Maybe, maybe not. I find this idea original enough that it might be that IBM hadn't thought to use this brute force approach.

      As we have seen yesterday with IBMs subpoena (see groklaw) against Canopy (not SCO!), they seem to _really_ play this game the hard way, and they have a lot more weapons on their hands than comparing SCO's code vs. linux.

      As I interpret it, they wanted to show the canopy guys that this is goin to be tough, also for canopy, not only for SCO.

      I have no idea what suprises the canopy guys are in for next, but if I were IBM, at some point of the future I would suggest to the canopy guys how many of the companies they have a stake in are probably violating some of my patents.

      This tool OTOH makes linux developers sort of independent from IBM, in that they now are able to hurt SCO themselves, for example by showing evidence that SCO has stolen OSS code.

    9. Re:Results Will Appear "Tainted" by Sven+Tuerpe · · Score: 1
      While I fully support ESR and the rest of the open source movement's defense of Linux against SCO, I have a feeling that this tool's results will not immediately be accepted by established media simply because of ESR's bias.

      Note that the tool itself is biased regardless of its author. Much like testing, which can show the presence of errors but not prove their absence, this tool can show the presence of copied lines but not prove the absence of copyright violations. So the bias actually seems to work for SCO because they could use it to substantiate their claims but the Linux community cannot use it to disprove SCO's (unstubstantiated) claims. Well, seems. The hidden message is: We really want to get rid of code illegally copied into Linux if there is any, and we don't need SCO's fscking support to find it.

      --
      http://erichsieht.wordpress.com/category/english/
    10. Re:Results Will Appear "Tainted" by taff^2 · · Score: 1

      The difference between ESR's comparator and any similar tool that SCO may produce, is that ESR's version would be infintely more trustworthy as it is open source and cou8ld be scrutinized by anybody.

      Would SCO be so willing to divulge their methods for ascertaing the similarities between codebases?

      --
      Karma: Bad. (As in Good?)
  41. IBM has a project called History Flow by TedTschopp · · Score: 5, Interesting

    This is perhaps a better project and it would be interesting to see this tool run against the source.

    History Flow The following is from their website:

    history flow
    visualizing dynamic, evolving documents and the interactions of multiple collaborating authors:

    Motivation
    Most documents are the product of continual evolution. An essay may undergo dozens of revisions; source code for a computer program may undergo thousands. And as online collaboration becomes increasingly common, we see more and more ever-evolving group-authored texts. This site is a preliminary report on a simple visual technique, history flow, that provides a clear view of complex records of contributions and collaboration.

    --
    Fantasy remains a human right; we make in our measure and in our derivative mode... -- JRR Tolkien
    1. Re:IBM has a project called History Flow by GigsVT · · Score: 1

      Cvsweb with colored diffs also does this pretty well. Of course that won't help with comparing existing documents, unless you merge each one into a CVS tree in turn.

      --
      I've had enough abrasive sigs. Kittens are cute and fuzzy.
  42. Re:The real question is: by Anonymous Coward · · Score: 0

    Shut up Darl

  43. Would this really work? by Paradox · · Score: 3, Insightful

    Well, I was looking at ESR's description of the code (I haven't read the code yet), and it seems to say that he takes 3 line slices, MD5s them, then compares them for identical points. I'm sure he compensates for funky whitespace and whatnot like diff and patch do...

    But if even one bit of the source is different, the MD5 hash will be quite different. So, the code slices have to be IDENTICAL. This is not a very good system because a simple find-replace could defeat it. A variable's name changed by one letter, or even capitalization, will defeat it.

    Unless the code reveals much more complex tricks than ESR describes in the help file, this tool wouldn't be much use in the SCO case. Hell, it wouldn't be much use catching college class cheaters even.

    --
    Slashdot. It's Not For Common Sense
    1. Re:Would this really work? by net_bh · · Score: 3, Insightful
      But SCO claims that a lot of code has been copied *as-is* along with the comments.

      The tool ought to be able to highlight all those flagrant cases (if any) and the report generator would then generate something that would be perused by a human.

      --
      There is no patch for stupidity

      Visit my blog

    2. Re:Would this really work? by Paradox · · Score: 1

      Would it? What if it's been reformatted into the GNU coding style? Spacing matters.

      Let's say it was copied more or less verbatim, but the number of lines is slightly different by a comment (which should be ignored). Unless these situations are corrected for, the windows could be misaligned. Suddenly identical regious that are shifted slightly would appear different, and the docs say that it throws out differences.

      Even if this tool works much more intelligently than is explained, it wouldn't matter to the SCO case. They'll just claim it's obfuscated, and even the most miniscule of changes would ruin this tools' ability to give accurate results, even with a window of 1.

      Having helped work on my University's source code analysis tool that they use to prevent cheating, I can say this kind of approach would even have problems keeping honest people honest, let alone catching dishonest people.

      --
      Slashdot. It's Not For Common Sense
    3. Re:Would this really work? by mugnyte · · Score: 1


      Comparisons should do an exact match, with preprocessors for generalization beforehand. Functions and variables can be auto-named, for example. Even conditionals can be ripped apart and rebuilt in a standard way to check for simple hacks to these. Ignoring whitespace and using C preprocessor output may be good defaults.

    4. Re:Would this really work? by net_bh · · Score: 1
      Whitespaces can be ignored by the tool.

      Plus, since it shreds the code into 3-line slices, if a decent sized function has been copied, dont you think there is a high probability that there might be atleast one (or more) 3-line chunks that will match perfectly?

      --
      There is no patch for stupidity

      Visit my blog

    5. Re:Would this really work? by Paradox · · Score: 1

      Not if there is even one different line. Add a comment, say. Or anything. Just one character will ruin that slice. Just one line will completely remove any chance of finding it.

      It's still a useful tool. Just not for the SCO case, or any case where the honestly of the authors is in question.

      --
      Slashdot. It's Not For Common Sense
    6. Re:Would this really work? by shaitand · · Score: 2

      except that according to SCO millions of lines were copied VERBATIM into linux.

      Verbatim would give a matching md5 sum, sysv code isn't tough to get your hands on (especially since IBM has it, as well as their own code they supposedly contributed). Making the md5 hashes will be a breeze.

    7. Re:Would this really work? by be-fan · · Score: 1

      How does that work? The core of the kernel is well under a million lines of code. How could millions of lines get copied in? Why the fuck would we steal SCO's drivers?

      --
      A deep unwavering belief is a sure sign you're missing something...
    8. Re:Would this really work? by Piquan · · Score: 1

      It's still a useful tool. Just not for the SCO case,

      Why? Who do we think is going to cheat? Who would have incentive to force a mismatch?

      Scenario one: Linux hackers change code to force a mismatch. But we have the kernel sources going a long time back, so an impartial judge simply has to pick sources from before SCO's initial annoucement. So this is ineffective.

      Scenario two: SCO changes code to force a mismatch. But they are trying to prove a match, so this is contrary to their best interests.

      Scenario three: Linux hackers change code to force a match. Besides being vulnerable to the issue listed in #1, it is contrary to their best interests, and also few Linux hackers have access to SCO's (current) source.

      Scenario four: SCO changes code to force a match. In other words, they copy code from the Linux sources into their kernel before running shred. This is possible, but is no different that if they made the same copy without running shred.

    9. Re:Would this really work? by Anonymous Coward · · Score: 0

      I wrote a report on code recognition a while ago. Probably not briliant, but informative anyway.

      TieFighter.et.tudelft.nl/~lucifer/metrics/m.pdf

    10. Re:Would this really work? by Paradox · · Score: 1

      Well, SCO can easily, trivially, and even somewhat believably claim Scenario One (which they will). Remeber, they're already twisting what everyone says, and outright lying.

      In order for any tool to really debunk stuff and be usefull, it has to be extremely comprehensive, thourough, and correct. Otherwise, it's only so much chaff that doesn't help much.

      What we need are comparisons so thourough that no one can deny them and look sane.

      Maybe SCO isn't the group to do that anyways :)

      --
      Slashdot. It's Not For Common Sense
    11. Re:Would this really work? by shaitand · · Score: 1

      That's one of many things which makes the SCO claims hilarious of course.

  44. Next Step? by Bilbo · · Score: 1

    Do I smell a Court Order in the works? If you really can do this without divulging the original code, then someone could conceivably convince a judge to issue an order to have a "neutral third party" create the MD5 sums on SCO's codebase, giving us a chance to look for pirated GPL code hidden inside of SCO proprietary products, without having to look directly at the SCO code.

    --
    Your Servant, B. Baggins
  45. Its been around for years by Anonymous Coward · · Score: 3, Interesting

    check out this research project coming out of berkeley CAP

    Drop in the code you are interested in and it will tell you where its found in a bunch of open source stuff, including the linux kernel.

  46. Warranting SCO code? by Anonymous Coward · · Score: 0

    SCO claims that Opensource is weak because nobody will warrant that the code does not violate somebody's IP rights.

    However, what about SCO's products and other propietary code? Does SCO warrant that UNIX does not violate somebody else's IP rights?

    If not, then any current SCO customer should be very afraid of the outcome of this lawsuit, because during the trial we'll almost certainly find that SCO accidentally or purposely copied GPL code into UNIX. By SCO's own logic, this means that all customer must immediatly stop using the product.

    For this reason, SCO should be very reluctant to let anyone use ESR's new tool, or even plain old eyeballs to examine the code.

  47. He's a clever so and so, isn't he? by NickFortune · · Score: 2, Insightful
    Firstly this gives us a way around the SCO NDA bullshit. So far the only way to disprove their case has been to look at the code, after which the NDA stops you from telling anyone. This lovely piece of work sidesteps that nicely. Furthermore, it opens a whole load of possibilties.

    It gives software houses a way of publishing commercial code for copyright purposes. If you claim copyright on code, you can publish the MD5 shred sigs for the code. No one can rip you off, but you can enforce your rights in a court.

    Even better - no one now has an excuse for not publishing. That means that we can make sure the kernel never comes within spitting distance of anyone else's property again. And if it does - well they should have published.

    Now if SCO aren't willing to publish their MD5 shreds, then that can only be because they have no case. In which case - game over!

    On the other hand, if they do, the world at large can then go through their published shreds and see exactly whose code SCO have been ripping off. Given the likely origins of those samples they exhibited a while back, I'd say that's likely to be quite a bit.

    This looks like the best news for the war against everyone's favourite Stupidly Corrupt Organisation since the whole mess kicked off.

    --
    Don't let THEM immanentize the Eschaton!
    1. Re:He's a clever so and so, isn't he? by peacefinder · · Score: 1

      Well, kind of. The problem here is that MD5 is too good.

      ESR's comparator will find wholly-identical code, yes, and that's good in this instance. All matches ESR's method finds will be significant... but ESR's method will not find all significant matches.

      As many have pointed out, a trivial change in source will yield a dramatic change in MD5... that's the whole point of MD5. Now that the method is known, a malicious code-copier can easily make trivial changes that defeat MD5 shredding as a comparison. It only works in this case because of the long history of the codebase where no one has systematically tried to defeat this method of comparison.

      What we need is a fast "hash" that is designed to measure similarity on a scale, rather than yield a straight identical-different binary result. With such a hash, any code shred over some threshold of similarity would be worth a closer look.

      Alternately, as another poster suggested, comparing the parsed-but-not-compiled code might be a better choice. It would still have problems, in that it might yield many false positive matches.

      --
      With reasonable men I will reason; with humane men I will plead; but to tyrants I will give no quarter. -- William Lloyd
    2. Re:He's a clever so and so, isn't he? by miniver · · Score: 1

      Download & read the source. Or just read the documentation.

      Comparator has the capability (-w) to ignore whitespace while generating the hash, while at the same time tracking the actual line numbers for purposes of merging and reporting. In my experience, most code-copiers are dumb and/or lazy -- to get past ESR's tool, the code-copier would have to (a) realize that they're violating a license, (b) not care, (c) be smart enough to realize that a pure cut-and-paste might get caught, and (d) energetic enough to munge up the code logic and variables. While I'm sure there are people like that, I would argue that most of them wouldn't be interested in contributing the result to the community, and the code wouldn't get past Linus if they did. The more logical case is some one/company who believe that they have a legitimate right to copy code from one kernel to another (BSD -> Linux / Linux -> SysV / SysV -> Linux) and thus not feeling the need to cover things up. Either of the SCO User Group examples would fit this category.

      Keep in mind, this tool will not divine intent, or direction -- it's only going to give broad hints of where to look for possibly problematic code.

      --
      We call it art because we have names for the things we understand.
  48. It seems impossible to compare without leaking by expro · · Score: 3, Insightful

    Think of the chance that any given line of source code in an arbitrary program is repeated somewhere else in a large open source program such as the Linux Kernel. This is even more true if some degree of fuzziness is added to handle changes such as adding or removing spaces in insignificant places, removing comments, (and there are many other things like brace style which affect multiple lines so you might want to physically reformat between lines to a standard format....

    If the number of lines is even only 1% that are found somewhere in the open source code base, I think a source who wants to keep their code base secret will have a big problem with someone computing the checksums. In reality, I wouldn't be suprised to see a much-higher percentage of lines leaked this way. And this is not the only way leaking can occur (think of application of simple cryptography).

    I would not want to be the one publishing the checksums of the closed source due to possible legal liability. The checksums are a derived work in any case.

    1. Re:It seems impossible to compare without leaking by shaitand · · Score: 1

      checksums cannot be a derivative of the source for an app, since checksums are NOT source for a modified app.

    2. Re:It seems impossible to compare without leaking by expro · · Score: 1

      You could say the same thing about a compiler since you cannot fully reconstitute a program in many languages, but it is a derived work anyway.

    3. Re:It seems impossible to compare without leaking by shaitand · · Score: 1

      I'm taking a wild shot in the dark here and assuming you are talking about decompiled code? Nope it's a two way street, the decompiled output could theoretically be recompiled back into the original application.

      You cannot turn an md5 sum back into the original program... an md5 sum is simply a number, a simple number is not copyrightable no matter how it's derived. I can write a number which also happens to be the md5 sum for ms office on the walls all day long and they can do nothing more than a tire company could if I should choose to write pi on te walls all day... because their copyright is on office, numbers themselves are in the public domain.

      Most importantly however, an md5 sum isn't truely unique AFAIK, it's possible (though not likely) for too pieces of different code to generate the same sum. Kind of like DNA for identification, although the odds are slim, statistically there should be 3 or 4 people walking the earth who will match with the best dna criteria we can currently manage.

      Of course decompiling is a just a virtually automated method of reverse engineering the functionality of a program... the output isn't TRUELY a derivative at all (in fact it generally is NOT compilable to the original program), it's so easy in fact that lots of closed source programmers were able to successfully lobby because it was easy to see their program was constructed. What's next, saying one can't view the source of an html document or a perl script because that makes it too simple to see how it functions?

  49. Read the article by Bananenrepublik · · Score: 1, Redundant

    From ESR's README:
    Besides the production C code, the distribution also includes working Python versions. These were used to prototype the concept.
    So the answer to your question is yes and no.

  50. In all fairness.. by k98sven · · Score: 3, Insightful

    SCO was unaware of the origins of much of the code and this "explains how they could wheel out the old malloc() code and the BPF (Berkeley Packet Filter) code, not realising that both were now under BSD licences - and in fact they hadn't even written the BPF code," Dr Toomey said.

    The SCO Group (not old SCO) hasn't written any code in SysV UNIX.

    Anyway.. One could hope that when this is all over, the UNIX sources will be bought up from the carcass of SCOX and open-sourced, finally putting it out of its misery..

    That is, as long as SysV UNIX doesn't have more stolen code in addition to the BSD code we all know about..

    The sooner the zombie of UNIX is put to rest, the better for all the live Unices.

    1. Re:In all fairness.. by Dr_Marvin_Monroe · · Score: 4, Insightful

      In all fairness, SCO's value is not in being purchased so that the source code can be freed...

      SCO's value is in acting as a totem against future companies who would try this same stunt....Their value is in their smoking carcass with Daryl's chared head mounted promanently on a high pike...

      At this point, there can be no comprimise with people who commit fraud to inflate their stock price and to promote FUD.... I believe that Daryl KNOWS that his claims are false...he deserves to fry....

      I say, "smoking head on stake" for all the SCO/Canopy group members.... leave all the execs at SCO without a job and discredited like the MCI/ENRON execs....Leave all the investors holding worthless stock certs....Somebody needs to be an example, and SCO volunteered by inflating/changing/hyping/FUDing their claims.

      I could have had a little sympathy for them if they had just filed their suit and shut-up until the trial....but at $17/share now, we need to destroy some wallets to remind everyone that it's not over till the gavel falls......

    2. Re:In all fairness.. by Morky · · Score: 1
      ..Leave all the investors holding worthless stock certs...
      The investors can and should fry with them.
    3. Re:In all fairness.. by SarekOfVulcan · · Score: 1
      SCO's value is in acting as a totem against future companies who would try this same stunt....Their value is in their smoking carcass with Daryl's chared head mounted promanently on a high pike...

      I'd look up at your lifeless eyes and wave like this. Can you and your associates arrange that for me, Mr. McBride?
  51. Re:The real question is: by Anonymous Coward · · Score: 1, Interesting

    That's a legitimate question. If the CS students that code Linux have learned anything it's how to obfuscate someone else's code to avoid getting caught cheating. Hell, I do it all the time when I don't want to write a stupid program for class. Obfuscation is an artform.

  52. Shred? by Anonymous Coward · · Score: 0

    Isn't that the command you have to use to delete kiddie porn when the fbi comes knockin?

  53. It's close by LittleLebowskiUrbanA · · Score: 1, Insightful

    It's a tossup between ESR and Perens as my favorite Open Source advocates. Perens is funny and answers emails and offers to help anybody out. ESR likes guns and writes extremely well. They both have some big brass ones, leading the fight against SCO.
    As for Stallman... I think he's still bitching about Debian and their not completely, totatlly, 100% free packages. I haven't seen him contribute anything in a long time except complaints and rhetoric.

    1. Re:It's close by be-fan · · Score: 1

      I like Perens myself. I disagree with ESR harshly on several points. I like guns myself, and am rather anti-government, but don't share ESR's intense disrespect for non-Western cultures.

      --
      A deep unwavering belief is a sure sign you're missing something...
    2. Re:It's close by LittleLebowskiUrbanA · · Score: 1

      Huh? He's into some decidedly non-western things like martial arts. I haven't heard anything about what you're saying about ESR. Explain.

    3. Re:It's close by be-fan · · Score: 1

      Read his blog.
      Specifically, the second article on the linked-to page. Its more of an anti-Islamic viewpoint, but the connotation of Western superiority is all over the place. He seems to be so afraid of terrorism that he's compromised his own beliefs about anarchy.

      --
      A deep unwavering belief is a sure sign you're missing something...
    4. Re:It's close by LittleLebowskiUrbanA · · Score: 1

      I didn't get the impression ESR is against all Arabs. Rather, he's admitting his vision of a society like in Heinlein's novel Beyond This Horizon wouldn't be able to confront terrorism like we are now. I don't agree w/ his point about occupying other countries either.
      As for the article, I think it's a bit too extreme in saying all Arabs and members of the Islamic religion are our enemies. However, there is something fundamentally wrong w/ a society or religion that follows such evil men blindly and it's main form of warfare is attacking innocents.

    5. Re:It's close by johny_qst · · Score: 1

      I'll get in on this IRV rating of OSS advocate figureheads.
      Perens is #1! He's got an awesome sense of humor. His views on politics and religion don't tend to color his speaking about OSS.
      Stallman is #2! He has the most crisply defined view on what "Free as in Speech" really should mean that meshes most with mine.
      ESR is #3! Sometimes he can be as funny as Perens, though in a totally different way. Sometimes his attitude is infectious, but sometimes it seems nasty.
      They are all intelligent guys who like technology so they are all pretty damn cool in my book.
      Will someone please force SCO into court so we can get this crap over with! On to better topics for these 3 to talk about!

      --
      Fnord.sig
    6. Re:It's close by be-fan · · Score: 1

      didn't get the impression ESR is against all Arabs.
      >>>>>>>>
      I don't know. The statements about wiping Islam of the face of the earth are kind of hard to misinterpret.

      However, there is something fundamentally wrong w/ a society or religion that follows such evil men blindly and it's main form of warfare is attacking innocents.
      >>>>>>>>
      Its largely a matter of perspective. First, remember that only 25% of the world's Muslims live in the Middle East. Only a small percentage of those people support people like bin Laden. The rest of the Muslims, including its international scholars, do *not* support terrorists. Many (most?) of them have the same goals, like a free Palestine, decreased influence of American culture, etc, but do not condone blowing up innocent people to achieve those goals.

      The number of people who directly support militant Islam is relatively low. Its probably higher than the percentage of people who support militant Christianity (Ireland anyone?), but remember that the Muslim population tends to be poorer than the Christian population, and live in less developed countries. People in these conditions tend to be more militant than people in developed countries, regardless of faith. As for anything in the Koran, I would remind people that the Bible also says a lot of thing that level-headed Christians do not interpret strictly. The Bible (and conservative Christians, I might add) is anti-women, but that hasn't stopped the mainstream of Christians from (sorta) adopting a liberal attitude towards women.

      Lastly, to address the question of attacking innocents, I must say that men in glass houses shouldn't throw stones. I believe that there is no reason, short of self defense, to kill someone, and don't condone terrorists killing innocent people anymore than I condone the US killing innocent people (either directly or indirectly) through their "regime changes" and twisted foreign policy.

      --
      A deep unwavering belief is a sure sign you're missing something...
    7. Re:It's close by Anonymous Coward · · Score: 0

      Aw, what the heck

      #1 Perens for speaking in terms that will actually convince people who aren't already on our side

      #2 Torvalds for being a really good, sound-bitey speaker. We are lucky that such a great engineer and project manager also turned out to have a good sense of advocacy.

      #3 PJ and her merry band of commentators for the best facts and analysis.

      #4 RMS because he thought up the whole deal. This is RMS's world, we just live in it. But his public statements are laughable though. Honest to God, I read an article by him where we was complaining that the root problem was that "Linux" was a vague word and that if people called the system "Gnu/Linux" it would be easier to defend against SCO's attacks. And hey -- README.SCO is not enough, I want to see the next versions of gdb and gcc with NO SCO SUPPORT. But on the plus side: unlike Torvalds, RMS got some real legal advice on how to protect FSF contributions against dirty contributions, and you might notice that *no one* is trying to throw mud on the pedigree of FSF software. RMS made all the advocates' jobs easier with his "you must sign a copyright assignment form before I will read your code" conservatism.

      #5 ESR for the stupid "the DDOS'ers friend called me" article. That was worse than laughable, that was harmful. But ESR gets credit for publishing the Halloween Memos and getting press attention for them (which many people seem to have forgotten).

    8. Re:It's close by Russ+Nelson · · Score: 1

      Interesting. Eric models his pacifism after the Buddhist tradition. Doesn't seem like intense disrespect to me.
      -russ

      --
      Don't piss off The Angry Economist
    9. Re:It's close by be-fan · · Score: 1

      I'd really like to understand how he reconciles this statement (referring to Islamic culture):

      "I think he is also right to say that our long-term objective must be to break, crush and eventually destroy this culture"

      with his professed pacifism.

      --
      A deep unwavering belief is a sure sign you're missing something...
    10. Re:It's close by Russ+Nelson · · Score: 1

      Go look to see if he said that BEFORE members of that arab culture attacked our country. Note that he's not referring to all members of Islam. He's referring specifically to a small subset of Arab culture. Islam is just as capable of being read as a peaceful religion as is Christianity, "Onward Christian Soldiers" and all.
      -russ

      --
      Don't piss off The Angry Economist
    11. Re:It's close by be-fan · · Score: 1

      a clash of civilizations driven by the failure of Islamic/Arab culture ... he is also right to say that our long-term objective must be to break, crush and eventually destroy this culture...
      we could corrupt Islamism out of existence, make it fat and lazy with cheap consumer goods and seduce it with porn.
      >>>>>>>>>>>>>>
      Quoted from his blog. I'm really having a hard time seeing how he's referring specifically to a small subset of Arab culture. In particular, his comments about the cheap consumer goods and porn indicate that he's referring to the wider Islamic community.

      --
      A deep unwavering belief is a sure sign you're missing something...
  54. SCO partners as potential clients by DavidNWelton · · Score: 1
    Here's my own take on one small way to go about doing something about SCO, at least for those of us who work as consultants:

    http://www.advogato.org/article/702.html


    By picking a fight with IBM and the free software comunity, SCO has backed itself into a corner. The company faces a very real possibility of not existing in several years if they don't win their case convincingly. Where does this leave all their partners who have based businesses on their products? Up a creek. Here's where you, the Linux consultant come in: why not reach out to these people and help them transition away from SCO's crappy products?


  55. Not So Ridiculous by chundo · · Score: 1

    You're going on the assumption that they're referring to MD5 hashes of entire files. Not so. Typically such comparisons generate many hashes per file, by going through each file line-by-line, and for each line generating a hash of it and the next 5 lines or so. When all is said and done, this gives you many hashes of small blocks of code within a file, which can then be compared to the hashes from a different codebase. Any time 5 lines are the same between the codebases, you would get a match.

    -j

    1. Re:Not So Ridiculous by Rock+Ridge · · Score: 1

      But five lines may be too much. The standard in some copyright/plagiarism matters has been a sequence of thiry identical characters.

  56. Here is the real message from Allah Oh' Brother by Teahouse · · Score: 0, Offtopic

    Terrorists Suprised to Find Themselves In Hell
    From: The Onion . com

    JAHANNEM, OUTER DARKNESS--The hijackers who carried out the Sept. 11 attacks on the World Trade Center and Pentagon expressed confusion and surprise Monday to find themselves in the lowest plane of Na'ar, Islam's Hell.

    Above: Mohammed Atta (top) and Ahmed al-Haznawi.
    "I was promised I would spend eternity in Paradise, being fed honeyed cakes by 67 virgins in a tree-lined garden, if only I would fly the airplane into one of the Twin Towers," said Mohammed Atta, one of the hijackers of American Airlines Flight 11, between attempts to vomit up the wasps, hornets, and live coals infesting his stomach. "But instead, I am fed the boiling feces of traitors by malicious, laughing Ifrit. Is this to be my reward for destroying the enemies of my faith?"

    The rest of Atta's words turned to raw-throated shrieks, as a tusked, asp-tongued demon burst his eyeballs and drank the fluid that ran down his face.

    According to Hell sources, the 19 eternally damned terrorists have struggled to understand why they have been subjected to soul-withering, infernal torture ever since their Sept. 11 arrival.

    "There was a tumultuous conflagration of burning steel and fuel at our gates, and from it stepped forth these hijackers, the blessed name of the Lord already turning to molten brass on their accursed lips," said Iblis The Thrice-Damned, the cacodemon charged with conscripting new arrivals into the ranks of the forgotten. "Indeed, I do not know what they were expecting, but they certainly didn't seem prepared to be skewered from eye socket to bunghole and then placed on a spit so that their flesh could be roasted by the searing gale of flatus which issues forth from the haunches of Asmoday."

    "Which is strange when you consider the evil with which they ended their lives and those of so many others," added Iblis, absentmindedly twisting the limbs of hijacker Abdul Aziz Alomari into unspeakably obscene shapes.

    "I was told that these Americans were enemies of the one true religion, and that Heaven would be my reward for my noble sacrifice," said Alomari, moments before his jaw was sheared away by faceless homunculi. "But now I am forced to suckle from the 16 poisoned leathern teats of Gophahmet, Whore of Betrayal, until I burst from an unwholesome engorgement of curdled bile. This must be some sort of terrible mistake."

    Exacerbating the terrorists' tortures, which include being hollowed out and used as prophylactics by thorn-cocked Gulbuth The Rampant, is the fact that they will be forced to endure such suffering in sight of the Paradise they were expecting.

    "It might actually be the most painful thing we can do, to show these murderers the untold pleasures that would have awaited them in Paradise, if only they had lived pious lives," said Praxitas, Duke of Those Willingly Led Astray. "I mean, it's tough enough being forced through a wire screen by the callused palms of Halcorym and then having your entrails wound onto a stick and fed to the toothless, foul-breathed swine of Gehenna. But to endure that while watching the righteous drink from a river of wine? That can't be fun."

    Underworld officials said they have not yet decided on a permanent punishment for the terrorists.

    "Eventually, we'll settle on an eternal and unending task for them," said Lord Androalphus, High Praetor of Excruciations. "But for now, everyone down here wants a crack at them. The legions of fang-wombed hags will take their pleasure on their shattered carcasses for most of this afternoon. Tomorrow, their flesh will be melted from their bones like wax in the burning embrace of the Mother of Cowards. The day after that, they'll be sodomized by the Fallen and their bowels shredded by a demonic ejaculate of burning sand. Then, on Sunday, Satan gets them all day. I can't even imagine what he's got cooked up for them."

    --
    "Curiosity killed the cat, but for a while I was a suspect."- Steven Wright
  57. SCO claims, Open Source acts by Arnulf · · Score: 2, Insightful
    I'm impressed.

    While Dark McBride and Chris Sontag shoot their mouths off, the community develops tools to finally make something clearer. :)

    -Arnulf

  58. Still have to trust SCO by scaldef · · Score: 1

    Isn't there a problem with this, that you have to trust that the shreds that SCO provides are really the shreds to their source code? After all, they could just grab big chunks of Linux source code and stick it in their tree and then shred it. This technique only seems reliable for comparing two secret codebases without revealing either one. The asymmetry in this case makes the test unfair.

    1. Re:Still have to trust SCO by gnutechguy · · Score: 1

      I see your point, but I visualize it this way. Darl McBride has stated that millions of lines have been copied verbatim. Well, we just run the Compare-O-matic and see what happens.

      Two likely outcomes are

      1) Significantly less than a "million" lines of code will match

      2) We can examine the parts that do match and show that they are not of System V origin.

      For SCO to prove their point, the above points cannot happen.

      I think this is a good way to "Smoke out" the Smoking Crack Organization.

      --

      ... and beyond them a far green country under a swift sunrise
  59. Who says ESR can't code? by Jay+Maynard · · Score: 1

    I've seen a lot of bitching that ESR can't code. One look at this proves otherwise.

    --
    Disinfect the GNU General Public Virus!
    1. Re:Who says ESR can't code? by Anonymous Coward · · Score: 0

      I guess this proves someone can't read!

      Get it? That's you! You can't read, Maynard! Or is that Jaynard? I'll just call you Nard, how about that!

    2. Re:Who says ESR can't code? by finkployd · · Score: 4, Insightful

      Actually fetchmail proves that he can code.

      This program just proves that md5 is not the correct hash for doing this kind of comparison. It is TOO GOOD of a one way hash, and will only return is positive if the lines being compared are 100% equal.

      Finkployd

    3. Re:Who says ESR can't code? by Jay+Maynard · · Score: 1

      Hm. While you're indeed correct, this then raises the question of just how much alike program segments have to be before they're just copies of one another.

      The tool is good enough to answer SCO's claims, since they're claiming that vast stretches of code were lifted with only the copyright statements removed, but as a more general code comparison, it might be lacking. I'm not at all sure that that problem is answerable short of an AI-complete solution, though.

      --
      Disinfect the GNU General Public Virus!
    4. Re:Who says ESR can't code? by TrancePhreak · · Score: 1

      A modified windiff would probably suffice. One that checks for similarities across multiple files.

      --

      -]Phreak Out[-
    5. Re:Who says ESR can't code? by joeytsai · · Score: 4, Informative

      Actually fetchmail proves that he can code.


      You may want to check out "The Emperor Has No Clothes", a look at ESR's real code contributions.
      --
      http://www.talknerdy.org
    6. Re:Who says ESR can't code? by Anonymous Coward · · Score: 0

      duploc does a better job

    7. Re:Who says ESR can't code? by Anonymous Coward · · Score: 0

      People should also read this, well, and this.

      All I'm trying to say is, all your heros have some skills, but they are also full of hot air and say and do stupid things. They are only human after all.

      I have great respect for Linus, because is good sides outweight the bad, I can't say the same thing about ESR, RMS and Bruce Perens.

    8. Re:Who says ESR can't code? by burnetd · · Score: 1

      Warning dudes the ESR link above trys to install Gator... Well done joeytsai, yuo become the first person on my enemies list.

  60. Why Proprietary? by Anonymous Coward · · Score: 0

    Could someone explain to me again why this supposedly proprietary code is still a secret when it has been released under the GPL?

    1. Re:Why Proprietary? by shaitand · · Score: 1

      The sysv code is what is being called proprietary, not the code that was supposedly copied into the linux kernel. Despite sco's claims, even if copying had occured I'm pretty sure not ALL of Sco Unix was donated to the linux kernel.

  61. Who is ESR? by Anonymous Coward · · Score: 0

    First rule of journalism: don't assume your audience knows anything about the subject.

    1. Re:Who is ESR? by Anonymous Coward · · Score: 0

      First rule of journalism: don't assume your audience knows anything about the subject.

      First rule of posting: don't post as a noob, first lurk until you understand the community and its topics of discussion.

      You certainly can't expect people to spell everything out one syllable at a time just because the forum gets visited by lazy noobs that haven't done their homework.

  62. I can write such a utility also! by pclminion · · Score: 5, Funny

    int main()
    {
    printf("These source trees appear to be entirely different!\n");
    return 0;
    }

    1. Re:I can write such a utility also! by demachina · · Score: 1

      I'm pretty sure 80% of the lines in your program are stolen from SCO's source. Man you are so screwed now....Darl.....Darl.....come look at this.

      --
      @de_machina
  63. Useless by Anonymous Coward · · Score: 0

    Something like this would be useless because any good (wink: evil) programmer that changed the source even a little would render the compiled code to be different. To be legitement, you would need source trees from BOTH and compare line by line (really, function by function).

    This doesn't negate the fact that SCO is claiming blatent code copying were the source wasn't changed at all...

    Can't someone just burn down SCO's buildings and end this whole idiotic process? SCO is an obvious MS puppet now.
    (some lawyer at MS --> "All your source is now belong to us")

  64. Who says SCO gets to court first? by JoeBuck · · Score: 4, Interesting

    If we can show that SCO's violating the BSD license, maybe we can convince some BSD copyright holder to sue them first, and demand as part of discovery the MD5 checksums from "shred", showing duplicated BSD code but no duplicated BSD copyright.

    1. Re:Who says SCO gets to court first? by Anonymous Coward · · Score: 0

      I hear Bill Joy's got some free time nowadays.

    2. Re:Who says SCO gets to court first? by Rock+Ridge · · Score: 1

      You borgs are counting on help from the BSD community?

    3. Re:Who says SCO gets to court first? by Anonymous Coward · · Score: 1, Interesting

      The various BSD flavours are next in line to feel SCO's wrath after they're done with Linux. SCO has already said they have issues with BSD.

  65. What if...? by bladernr · · Score: 4, Insightful

    What if this ESR tool runs and finds commonality, and the research shows that, in fact, SCO's rights were breached. Remember, this type of analysis is a two-edged sword. The purpose of this ESR is to remove doubt... but remember doubt could be removed either direction.

    So, given that hypothetical, what would people here think? Would you forive SCO? Would you concede SCO's point, but think that SCO defended their rights in a very poor manner? (this, btw, is what I would probably do). Would you stick your fingers in your ears and refuse to accept the outcome, and believe in some vast -wing conspiracy?

    Obviously, the Linx movement would carry on. I don't think the death of Linux is even worth discussion. Some recourse would happen, probably monetary damages, and the offending code would be removed.

    My real curiosity is how people's attitudes or feelings would change (or not change) if it turns out SCO is right (however unlikely that is).

    --
    Sarcasm and hyperbole are the final refuges for weak minds
    1. Re:What if...? by good-n-nappy · · Score: 2, Insightful

      It still would not prove SCO's point. They have to answer the question of why they distributed the Linux source even after finding out it contained their valuable "IP."

      Most of us are relying on common sense and don't really care whether a few lines of archaic code were copied. Given SCO's

      1) previous sales of Linux
      2) misinformation about owning Unix
      3) waffling on what IP is violated
      4) refusal to show copied code
      5) frequent, inconsistent press releases
      6) heavy insider trading
      7) ridiculous licensing terms
      8) collusion with MS
      9) discredited evidence of code being copied
      10) ?
      11) profiting

      Common sense says that SCO does not have a legitimate claim. There is no rabbit left for SCO to pull. So no, most people would not "forgive" them although they might concede the point that a few lines of ancient code were copied.

      --
      Never underestimate the power of fiber.
    2. Re:What if...? by yeremein · · Score: 1
      What if this ESR tool runs and finds commonality, and the research shows that, in fact, SCO's rights were breached. ...

      So, given that hypothetical, what would people here think? Would you forive SCO?

      No. There is no reasonable possiblity that infringements within several orders of magnitude as the "millions of lines" SCO has claimed actually exist.

      As sanctimonious and hypocritical as they've been about this whole thing, I expect SCO to continue the FUD and gain another few dollars on their stock price while the media reports ESR's findings support SCO (despite the fact that 99% of commonalities will be explained without infringement--a fact the media will ignore).

      It doesn't really matter until it gets to court. And even if SCO succeeds in proving some infringement, do you think the judge will allow 3 billion dollars from IBM and $1400/CPU from all Linux users? Especially since SCO has unclean hands, and the GPL issue, and the fact that they probably steal GPL code themselves, etc. etc.

    3. Re:What if...? by Anonymous Coward · · Score: 2, Informative

      It's not that people around here think SCO is evil for saying their IP has been stolen. Some people around here think SCO is 'evil' for how they've handled the situation.

      If I remember correctly, the open source community has made several offers to remove the tainted code if SCO would just say what code is in violation.

    4. Re:What if...? by screenrc · · Score: 1
      SCO's claims are based on the definition of
      derivate work. Their main argumement (so far)
      has been that linux is derivate from System V
      Unix because... "we believe" it does.


      I don't see the point of using a comparator (assuming
      it works) when SCO's arguments are based on some
      arbitrary definition of derivative works
      which changes from month-to-month.

    5. Re:What if...? by Anonymous Coward · · Score: 0

      This is the most pathetic rationalization given hypothetical guilt that I've ever seen. Half bogus and half irrelevant.

    6. Re:What if...? by The+Cydonian · · Score: 2, Insightful

      A two-cent observation regarding most media "debates":- folks usually attack other people, not positions. It's usually very rare to find anyone who's changed his position based on new evidence or inference. Not impossible, just rare. (Which, in a way, is why not getting emotional about anything is always a good idea; that way, you can base your support on rational thought, not objects)

    7. Re:What if...? by Anonymous Coward · · Score: 0

      What a lucid and articulate response! 93% nonspecific and 99% wrong.

    8. Re:What if...? by spitzak · · Score: 1

      No.

      This would identify what portions of Linux are infringing, something SCO has refused to do. You could then argue over whether SCO copied it from Linux, Linux copied from SCO, or both copied from a third party, but at worst only those sections would have to be replaced. If it really is an infringement of SCO's copyrights then it will also identify the person responsible, and SCO can go and sue them, like they should have done if they were really trying to protect their copyright.

    9. Re:What if...? by Anonymous Coward · · Score: 0

      ...I would thank ESR & SCO* for bringing the matter to our collective attention so that ALL traces of SCO could once and forever be wiped from our (hypothetically) unclean slate of Creation.

      * Ok, I wouldn't actually thank SCO too much, but they deserve a little credit for their own demise.

    10. Re:What if...? by ls+-lR · · Score: 2, Insightful

      Realistically speaking, if indeed there were infringing parts, we never would have heard of any of this. The whole tone of this article, and the quotes from Raymond, smack of "I've already done the comparison and nothing's there." I think if by some small chance there was something illicit in the Linux tree, Raymond would have notified the maintainer and/or Torvalds and put out a patch to remove it ASAP. Or at least, that what they've constantly stated they would do were this the case.

      In other words, I believe them when they say that "If by some chance there were infringing code we'd do everything we could to remove it very quickly." SCO's lawyermongering only really applies to IBM anyway, so that point is rather moot.

    11. Re:What if...? by rew · · Score: 1
      .... and the research shows that, in fact, SCO's rights were breached.

      Ehmm. If I had just a little more time on my hands, then I'd personally go to SCO and say: Guys, I want to stop infringing on your IP. Please tell me where it is, and I'll remove it and run Linux without your code ASAP. I'm pretty sure that I've shown enough "good will" that way. If they refuse to tell me what parts to remove, then that's their problem.

      So, whenever parts of the Linux kernel turn up "identical" to parts of SCO, we'll
      • show that it's code legal to use in Linux.
      • or remove it and reprogram without the offending code.

      The SCO strategy is to prevent the Open Source community from removing the offending code. As long as they can maintain this situation, they can continue to extort money from Linux users. As soon as they show the code, we'll remove it.

  66. Obfuscation Observations by Anonymous Coward · · Score: 1, Interesting
    So, this method of identifying copied code would only work if the code had never been run through an obfuscator. It would also be defeatable by running the source through a script to have its variable names search-and-replaced with similar names (such as replacing every variable name with a new name consisting of the old name plus "_newname")
    But the sources for Linux are already available, so SCO knows the OS community isn't cheating - it can independently run the shred algorithm and prove the same results come out. SCO, OTOH, is refusing to let anyone see the code that isn't under NDA. This lets one or more persons who are willing to forego ever writing kernel code walk into SCO's office, run shred on the code, and walk out without even a copy of the code. And while SCO would be able to obfuscate, they have no incentive to do so - on the contrary, they have the strongest incentive to keep everything exactly the same, if not to fudge the other direction by renaming variables to match the ones in Linux.

    Then we'll have the ability to identify the lines of code in the Linux tree that are the same as lines that SCO says is in their codebase. And show exactly where that code came from. Why? Because the process is OPEN! There are LKML archives with all-out flamewars over some of that code. There are companies whose legal departments have vetted the code they've contributed, and have files that document the process in excruciating detail.

    We will undoubtedly find some of the 'copied' code is in fact BSD code, and the shred algorithm will show that the code differs exactly where the California Regents' copyright notice has been taken out, which will prove that SCO violated not only the GPL, but the BSDL as well. And just like AT&T before them, they'll lose big.

    Posting as AC from work, but you know who I am...
    SVM, ERGO MONSTRO

  67. Flip This Around by Bilbo · · Score: 1

    I can't read ESR's mind, but I strongly suspect that this is intended to provide ammunition for a counter-suit claiming that SCO has pirated GPL code and illegally embedded it into their software. The comparison isn't a proof of copied code, but it could indicate "hot spots" and provide sufficient basis for a sort of "search warrant" to force SCO to show its code. If it turns out, as it did in the famous AT&T vs. BSD case, that SCO has been whole-sale ripping off other people's code, then things might turn really uncomfortable for McB and his cronies...

    --
    Your Servant, B. Baggins
    1. Re:Flip This Around by bernywork · · Score: 1

      Hear hear. I was thinking some kind of suit...

      I think though that they could be doing this if nothing else but to shut them up. What happens if they turn around and come up with zero? (After sorting through the stuff that was already released as public)

      Other reason for all this would be to let us get back to what we do best (coding and whinging :P) and to stop worrying about the current problem.

      I knew that something was coming when ESR turned around and said:

      "Take that offer while you still can, Mr. McBride. So far your so-called 'evidence' is crap; you'd better climb down off your high horse before we shoot that sucker entirely out from under you. How you finish the contract fight you picked with IBM is your problem. As the president of OSI, defending the community of open-source hackers against predators and carpetbaggers is mine -- and if you don't stop trying to destroy Linux and everything else we've worked for I guarantee you won't like what our alliance is cooking up next."

      (Quoted whole from: http://www.armedndangerous.blogspot.com/2003_08_17 _armedndangerous_archive.html)

      --
      Curiosity was framed; ignorance killed the cat. -- Author unknown
    2. Re:Flip This Around by Anonymous Coward · · Score: 0

      "I guarantee you won't like what our alliance is cooking up next."

      STOP, or I shall write another manifesto!!

  68. I love geek stereotypes as much as the next guy... by Anonymous Coward · · Score: 0

    ...but can't we at least move on to ANOTHER Monty Python movie? I hate "The Holy Grail".

  69. Silly question... Why not a spell checker? by zakezuke · · Score: 1

    Right, I remember one of the early comments from CEO SCO... that one of the major similarities were the consistent and repeated spelling errors as shown in the most recent example they released.

    While sometimes spelling errors are consistent among diffrent programers, wouldn't it not be wise to.

    1. Strip out all the comments into a diffrent file
    2. Peform a spell check, isolate mispelled words
    3. Scan the linux kernel for these mispellings

    --
    There is no sanctuary. There is no sanctuary. SHUT UP! There is no shut up. There is no shut up.
  70. I just ran this software on Slashcode... by Anonvmous+Coward · · Score: 1

    ... and it kept noticing "we welcome our new ??? overlords."

    Something's wrong with this bloody thing.

  71. how about the Bible Code Algorithm? by The+Lynxpro · · Score: 3, Funny

    Have any of those techno-Rabbis run a comparison search with their "Bible Code" program on SCO? Did it come up with the phrases "bankrupt in 2004," "full of camel dung," and "Serpent of Utah"? How about running the "Bible Code" on Unix System V. code? Considering SCO's fondness for converting code over to Greek symbols for their presentations, converting to sanskrit, Hebrew or Aramaic shouldn't be a problem...

    --
    "Right now, somewhere in this world, Scott Baio is plowing a woman he doesn't love," - Peter Griffin, *Family Guy*
    1. Re:how about the Bible Code Algorithm? by Kalak · · Score: 1

      Now I finally know why I borrowed that book from the library - so I can understand the phophet's Darl's true message. Where are my SCO press releases?

      --
      I am, and always will be, an idiot. Karma: Coma (mostly effected by .hack)
  72. Here's how you defeat obfuscators by Serveert · · Score: 3, Insightful

    Compare C parse trees. That's right, look at the parse trees, use some fancy graph algorithm to compare the calculations and parse tree nodes.

    Someone mod this up I think I'm on to something!

    --
    2 years and no mod points. Join reddit. Because openness is good.
  73. Re:The real question is: by bladernr · · Score: 3, Insightful
    Hmmmm.... that is an interesting point, but I'm not sure it applies.

    Here is the reason: the people that "stole" SCO's code (if indeed that happened) probably were not acting with ill intent. They probably thought they were doing genuine, valid reuse, in which case, why hide it? Obfuscating runs the risk of introducing new bugs.

    OSS programmers, even the ones that cut corners, are not malicious in my experience. There are honest mistakes made, because, well, they are lone programmers, not lawyers, or professional managers, or finacial experts, or whatever.

    However, if code was diliberatly obfuscated, that would be very, very bad news for Linux. That shows that it was not an honest mistake, but the programmer knew something about the origins and they needed to be hidden. At the best, he could argue that he didn't think that it was an IP violation, he was just trying to make himself look better by not giving credit. The other side could argue he obviously new he was breaking the law.

    Of course, as I said, I honestly don't think this case will come about. Even if code found its way in, I don't think it was a programmer say "Hey, I'm going to do this, but it is illegal, so I will cover my tracks."

    --
    Sarcasm and hyperbole are the final refuges for weak minds
  74. Refactoring Tool by rimu+guy · · Score: 1

    This sounds like a great tool. The copied code I'm concerned about, is the code myself and other developers on the same project have copied from one file to another file.

    If I can use this tool to find that code and refactor it into subroutines/classes then that would be superb.

    And if someone could write a plugin for Eclipse to help automate/assist with the refactoring...

    -P
    RimuHosting - Linux VPS Hosting

  75. Re:Could Linux users BE any gayer? by Anonymous Coward · · Score: 0

    You're obviously a little misinformed, check out the stats on that snazzy IIS server the International Lesbian and Gay Association are using. Don't worry, most Windows users are "in the closet" just like you.

  76. This explains the stock price increase today... by weave · · Score: 1
    A running joke on the SCOX yahoo investment board is that whenever some bad news comes out about SCO, SCOX stock goes up. It's uncanny. IBM announces counter-suit, stock goes up that day. RHAT announces suit, stock goes up. Code leaks and is debunked, stock goes up. Now this -- and today SCOX goes up almost 10%.

    1. Re:This explains the stock price increase today... by Anonymous Coward · · Score: 0
      Code leaks and is debunked, stock goes up. Now this -- and today SCOX goes up almost 10%.

      Maybe it's because the investment community understands that Linux's days are numbered and SCO is telling the truth. Once you fuckers have to pay SCOSource $699 per processor you're going to be crying home to mommy that Linux sucks and BSD rules. FreeBSD is dying, my ass.

    2. Re:This explains the stock price increase today... by Russ+Nelson · · Score: 1

      Sigh. I saw the lawsuit, said "This is bullshit. SCO is going down. Better sell now that the stock is at 2.5".
      -russ

      --
      Don't piss off The Angry Economist
  77. derivative work? by donutz · · Score: 4, Interesting

    Presumably any of the many people with legal rights to SCO source code can publish the hash list without divulging any of SCO's (ahem) "IP".

    Would these hashes of SCO source code be considered derivative works? That could have copyright implications...

    1. Re:derivative work? by Anonymous Coward · · Score: 0

      can an integer number be copyrighted?

    2. Re:derivative work? by aastanna · · Score: 1

      A derivative work is work that has been "recast, transformed, or adapted. A work consisting of editorial revisions, annotations, elaborations, or other modifications which, as a whole, represents an original work of authorship, is a 'derivative work.'"

      Seems to me that one could argue both sides. The word transformed seems broad enough to cause some problems, but calling the hashes derivative would seem to violate the spirit of the law.

    3. Re:derivative work? by Wumpus · · Score: 4, Insightful

      Let's apply a powerful legal tool: The silly analogy.

      Take a copyrighted work (Harry Potter and The Chamber of Secrets, for example).

      Now, rearrange all the letters randomly, and pick (say) every 10th letter. Apply rot13 to the result, and print it.

      Is this derivative work? If you think it is, then, yes, copyright holders should be able to control MD5 hashes produced from their work.

    4. Re:derivative work? by Ninja+Programmer · · Score: 2, Insightful
      Would these hashes of SCO source code be considered derivative works?
      Its derivative, but is it work? I think this could easily be explained to the court. There is no expressive usable content in MD5 hashes and its not realistically reversable -- that's the whole point of it.
    5. Re:derivative work? by ls+-lR · · Score: 2, Interesting

      I think we all agree that the obvious "duh" answer is that "of course they wouldn't be derivative works." But SCO has proven that it has a knack for just making stuff up or interpreting things funny. However, even based on the letter of the law I don't think this would qualify as a "transformation." That would seem to apply to a case where you shift the representation of the data to a different format but retain its essence, such as copying a DVD to a VHS tape. However, creating MD5 sums does not seem like it would be a transformation in that sense, in that the new work has none of the qualities of the original -- it's not code, it won't compile, it cannot be used to divine any algorithms, methods, etc. In sort it's completely useless, other than for comparing to other source code fragments.

    6. Re:derivative work? by Idarubicin · · Score: 1
      Would these hashes of SCO source code be considered derivative works? That could have copyright implications...

      Is a library call number a derivative work?

      --
      ~Idarubicin
    7. Re:derivative work? by mik · · Score: 1

      The point of MD5 is that it is a unidirectional hash: posession of the hashes in insufficient to do anything but compare with code you've already got. If the arrangement information (line numbers and filenames) of the SCO data set are discarded, then there is no way to recover the original. It would be an uphill battle to convince a court that a hash set could be considered derivative if the only possible use of it is to determine the veracity of the parties' claims.

    8. Re:derivative work? by Suidae · · Score: 1

      In that case, any work which happens to contain only characters in HPaTCoS could be shown to be a 'derivative work', simply by coming up with an algo that would produce the work.

      Heck, it would be trivial produce an Xth order function with roots that completely describe any number of short poems it terms of HPaTCoS.

  78. Re: Copyright by daigu · · Score: 2, Insightful

    While IANAL, I don't suspect you are either. Copyright is not something that applies to ideas - it applies to expressions of ideas. I'll quote the Apple vs Microsoft case note by Joseph Meyers:

    Typically, copyright protection is awarded to literary work, and Congress has included the code that makes up a computer program in this category.[7] Additionally, some non-literal expression is protected. For example, not only are the actual words in an author's copyrighted novel protected, but the structure and plot may be protected as well.[8] The debate[9] in the area of computer science is whether, by analogy, this means that the result, output, organization or display (the "look and feel") of a computer program might be protected as well, even if the source code is different.[10] This is to be distinguished from the idea underlying the program, which is not subject to copyright protection.[11]
    In other words, the question on the table is whether portions of the Linux kernel are a derivative work of SCO's code - not whether it uses SCO's ideas.
  79. He's lost the plot by Pentagram · · Score: 1

    What's particularly bizarre is that he wrote an apology for Franco. Not exactly the side of the Spanish civil war you'd expect an 'anarchist' to take!

  80. Ancient Unix Code by twistedcubic · · Score: 1

    So if the "Ancient Unix Code" is still available and its license permits, we can do the comparisons today.

  81. Excellent, but can you wait... by harryk · · Score: 1

    OK, I'm a bastard and all, but with the most recent SCO (SCOX) stock jumps can you wait to leave them in the dirt for a few more days. I'll be able to buy my 8 160gb drives in a just another day or two if you'll let the stock continue to climb.

    But seriously. This is a good thing. SCO needs to be slapped down like a bad habbit. Darl has pushed this long enough, and with a tool like this, we can finally push back.

    Great work Eric!

    harryk

    --
    think before you write, it'll save me moderator points.
  82. Not as useful in court by klui · · Score: 2, Interesting

    It will just tell someone two trees are similar/identical. The important thing to prove in court is who copied from whom.

  83. Just in case by TCM · · Score: 1

    the question hasn't been asked:

    But does it run on Linux?

    --
    Of course it runs NetBSD. BTC: 1NT7QvbetmANwaMzhpVL6
  84. Better way to compare code by Brikus · · Score: 2, Informative

    Speaking of BSD, a better way of doing this comes from Berkley too. It's a program called Moss that is used by many universities to detect plagarism in CS classes. I know from firsthand experience that this is a very powerful program. Unlike the shredding technique, things like changing variable names won't affect the comparsion value Moss returns. It even does a pretty good job of noticing changes like replacing for loops with while loops.

    One disadvantage it does have though is that it won't work with the MD5 checksums, although I'm a bit skeptical of how well that would work anyway.

    1. Re:Better way to compare code by Reteo+Varala · · Score: 1

      Maybe, but Linux is quite huge... I ran a "tar jtvf | wc" on a linux kernal bzip-tarball, and found that the file list alone had eleven thousand files; and I'm laying odds those aren't one-line files, either.

      Now, I'll admit that about 2-3 dozen of those files are documentation, and another 4 dozen are likely the assembler code for the low-level. But is a program like Moss, designed to compare CS code in classes, going to take the kernel without hiccupping? Or is someone just going to feed the code in, one file at a time, and hope that it's all going to go well?

  85. MD5 easily fooled by dmiller · · Score: 1

    While the concept sounds nice, any line by line comparison could easily be fooled. A run through indent, a comment change or a common search & replace on a variable will change the MD5 sum. A (rather more difficult) enhancement would be to compare code at the semantic level (perhaps using gcc's intermediate RTL or TenDRA's ANDF).

    1. Re:MD5 easily fooled by Jeremy+Erwin · · Score: 1

      While the concept sounds nice, any line by line comparison could easily be fooled. A run through indent, a comment change or a common search & replace on a variable will change the MD5 sum.


      So, you've downloaded Comparator, and run tests, then. (I haven't, yet--I have to port it to macosx, first.)

    2. Re:MD5 easily fooled by Jeremy+Erwin · · Score: 1

      It's going to be a while, too. Need to translate the ftw calls to their fts equivalents. Perhaps I should read the man pages.

    3. Re:MD5 easily fooled by dmiller · · Score: 4, Interesting

      So, you've downloaded Comparator, and run tests, then.

      I didn't need to, the following is in the readme:

      comparator does not attempt to do semantic analysis and catch relatively trivial changes like renaming of variables, etc. This is because comparator is designed not as a tool to detect plagiarism of ideas (the subject of patent law), but as a tool to detect copying of the expression of ideas (the subject of copyright law).

      He's wrong BTW (and he is smart enough to know it, which makes this a deliberate deception). A work is no less subject to copyright if someone does a global search and replace on a variable name.

    4. Re:MD5 easily fooled by ESR · · Score: 2, Informative

      But for changes of that kind, the burden of proof
      is heavy and on the party alleging infringement.
      The comparator tool isn't designed to try to catch
      such deliberate obfuscation, because that would get
      into murky territory near the boundary of expression and idea. Did you really think I failed to study the legal questions before I wrote this?

      --
      >>esr>>
    5. Re:MD5 easily fooled by dmiller · · Score: 1

      Your readme demonstrates none of the ambiguity that you have just expressed. Also what you describe as "deliberate obfuscation" (re-indenting or variable renaming) occurs as a matter of course when software is appropriated (legally or otherwise).

  86. Re:The real question is: by Anonymous Coward · · Score: 1, Interesting

    OSS programmers are usually highly-connected programmers, not lone ones. I haven't met one in years that doesn't have an IRC window open in the background for chitchat while their brain is idling. All the proprietary programmers I've met (mainly shareware authors) have been sad, lonely, bitter, microsoft-haters, while the OSS crowd just treat MS with mild derision.

    Lawyers, professional managers and financial experts all make huge, glaring mistakes, sometimes even making the news.

  87. SCO ethics anyone? by bicho · · Score: 1

    Even If that rogram is to be used to compare SCO's source tree to Linux, given what we know of SCO's ethics, how can we be sure they wouldnt mess with their own code so as to get their "expected" resules?

    At this point, SCO should have 0.00000... credibility, and nothing short of clear exposure of the allegedly copied code should be accepted.

    --

    errera hunamum ets
  88. For MSN Money: "Tech analyst likes SCO Group" by quisquil · · Score: 1
    According this piece of story:
    This is not the time to go rushing in" to buy tech stocks just because they're hitting 52-week highs, Cohen said on CNBC's "Wake Up Call" on Tuesday. Money 2004. Smarter, faster and easier than ever.

    "It's a great moment to step back and look not just at its income statement but at a company's competitive advantage, its position in the industry and the quality of its balance sheet," he said.

    Cohen recommended three stocks, each of which he owns in funds managed on behalf of clients. "These are companies with very good balance sheets and the ability to generate lots of cash flow," he said.

    And among those the marvelous SCOX... cannot copy here SCO group description and stock analysis from this "unbalanced" financial article, it is quite disgusting to me.... go and read if you want.

    Just another example when real and reality follow different paths.

  89. ..... jesus by Anonymous Coward · · Score: 0

    wtf

  90. Now what? by stormcoder · · Score: 0

    It should be interesting bizarro statement SCO comes up with to counter this?

    --
    Sorry my bullshit sensor overloaded.
  91. Re:The real question is: by bladernr · · Score: 1
    I didn't mean lone in the sense of "alone", I meant in the sense that their associates are other open-source programmers. That was worded very, very poorly by me, so I'm sorry.

    The sentiment is basically that, if I am in a software company, I have access to legal opinion (Can I use this code?), financial resouces (if not, can we license this code), and other stuff. OSS doesn't always have easy access to the same resources.

    That is why I think honest mistakes could be made. In the adsense of a legal team, financial team, etc, sometimes you just have to do what you think is right, or best, and it may not always be.

    --
    Sarcasm and hyperbole are the final refuges for weak minds
  92. What makes you think.... by Anonymous Coward · · Score: 0

    that SCO is even interested in having their source examined by such tools? Do you really think that Microsoft is going to submit their code base to any such examination?

    This points up a very basic fallacy in SCo's IP arguments: they keep claiming that people cannot be protected in an OSS environment. What they really mean is: you don't have to worry about it in a proprietary environment because we won't let any one see the code! We can steal whatever we want with impunity from lawsuits! We can generate whatever lawsuits we want and no one can prove us wrong because we can see their source and they can't see ours!

    Now, don't get me wrong: I do not think that Microsoft will avoid any of this kind of comparison due to theft. Rather, I think they will avoid it because of liability issues. If their code was ever examined against the most common causes of buffer overrun vulnerabilities, it would most certainly fail. Given that the tools exist to examine source code for exactly these common design problems and that Microsoft code continues to suffer from these problems, why wouldn't they be liable for damages due to exploits of these things? They must have been either incompetent or stupid. Either looks bad in a claims court!

  93. Miss the point by mic256 · · Score: 1

    I think you miss the point. AFAIK SCO didn't sue IBM on the basis of copyright infrigment, but rather breaching trade secrets. They admit the code in question is owned and copyrighted by IBM. They talk a lot about Linux code infringing their copyrights, but they haven't so far sued anybody because of it, nor did they say what the problem is. They are only talking, making press releases, etc., no legal action. They are not probably even offering those licenses for real - I haven't heard about anybody who bought them - did you? It could be a basis for a lawsuit against them after all. It seems to me they are playing safe on the legal side - one lawsuit, nothing more.
    The real issue here is the media attention they get. Microsoft has lots of lawsuit - the Timeline case (users of SQL Server might be held responsible), the browser case (they might be forced to remove ActiveX support from IE and 0.5 billion fine), the case brought against them by Intertrust (DRM infringing their patents, might affect almost all of their products, Microsoft is loosing so far). These are real cases, yet few notice it. But one lawsuit by a little known company, with little chances for success gets so much attention, and moreover everybody seems to get it wrong - I don't recall reading a note, commentary or an article in a newspaper or a magazine that would get it right. This is (correct me if I'm wrong) a case about a breach of contract by IBM, because it contributed its own code, that has not a single line of SCO in it to Linux, which was later sold by SCO. They claim now, that they didn't know what they were selling. They didn't know Linux contained (at that moment) JFS support, NUMA, SMP, contributed by IBM, SGI, etc. After all - how could they? Did they have access to the source code?
    Of course they did. Get it finally. The goal of this company is not to make profit. It's not find justice. It is to make everybody believe, that Linux is illegal. This is why ESR moves are waste of time.
    People treat their claims seriously, because the media do. The question is - why the media behave the way they do?

    1. Re:Miss the point by glynor · · Score: 1
      Actually, I think you miss the point, at least where it comes to the point of the case. The actual complaint (amended) reads:
      4. The UNIX software distribution vendors, such as IBM, are contractually and legally prohibited from giving away or disclosing proprietary UNIX source code and methods for external business purposes, such as contributions to the Linux community or otherwise using UNIX for the benefit of others. This prohibition extends to derivative work products that are modifications of, or based on, UNIX System V source code or technology. IBM and certain other UNIX software distributors are violating this prohibition, en masse, as though no prohibition or proprietary restrictions exist at all with respect to the UNIX technology. As a result of IBM's wholesale disregard of its contractual and legal obligations to SCO, Linux 2.4.x and the development Linux kernel, 2.5.x, are filled with UNIX source code, derivative works and methods. As such, Linux 2.4.x and Linux 2.5.x are unauthorized derivatives of UNIX System V.

      5. As set forth in more detail below, IBM has breached its obligations to SCO, induced and encouraged others to breach their obligations to SCO, interfered with SCO's business, and engaged in unfair competition with SCO, including by:

      a) misusing UNIX software licensed by SCO to IBM and Sequent;

      b) inducing, encouraging, and enabling others to misuse and misappropriate SCO's proprietary software; and

      c) incorporating (and inducing, encouraging, and enabling others to incorporate) SCO's proprietary software into Linux open-source software offerings.

      Which I take to mean that SCO believes that IBM both copied verbatim lines of code from System V UNIX (which SCO at least appears to have some sort of a copyright claim to), and inappropriately used intimate knowledge of SCO's System V UNIX source to create its own "derivative work products that are modifications of, or based on, UNIX System V source" ... by adding the code to linux. However, as is pointed out in It Ain't Necessarily SCO by Rob Landley and Eric Raymond there are many problems with this complaint. One of which is that derivative works are expressly permitted (and are in fact deemed to belong to the creator, in this case IBM) in the contract that SCO is disputing, which BTW was between ATT and IBM, not SCO and IBM.

      As far as your point about the media. I think you were at least partly on there, though I think that this had much more to do with SCO's motivation. Also, notice where that link came from. I think they may have much more motivation to "make everybody believe, that Linux is illegal."

      --
      -glynor

      Some cultures are defined by their relationship to cheese.

  94. Don't these guys have anything better to do? by Anonymous Coward · · Score: 1, Interesting

    Seriously. Perens and ESR are fueling SCO's flames by giving them poorly-thought-out statements to cull choice quotes from to support SCO's case. And SCO's words are the only ones seeing mainstream attention (check who's stories are linked from any pages about the SCOX stock prices, and you'll see the public is only getting SCO's side of the story).

    In my most reasonable, humble opinion, anyone who is not an IBM lawyer really needs to STFU concerning this matter. The wise man waits his turn to speak.

  95. Re:IT WILL NOT WORK! Here's technical reason why by Q2Serpent · · Score: 4, Funny

    create a C language parser that reduced the C-code down to op codes

    like gcc?

  96. Re:Linux is Gay by Anonymous Coward · · Score: 0

    I smell a troll so fiendish, so rotten, so utterly insane; LADIES and GENTLEMEN: May I introduce our pugilist/trollmeister of the evening...in the right corner, weighing in at 400 lbs of bullshit, heavyweight champion of the obscure trolling world, the best, the worst, the uncrowned king of sociopaths:

    Kadaitscha Man! - "insert favorite troll"

    (Trumpets)

    So, "insert favorite troll", how's that BS in Scatology coming along?
    Ah.
    I see.
    Yeah, I believe you...no,no, eating shit is probably good for you, in fact, err, I think I read somewhere that a certain african tribe use this practise to entice...
    Say what?
    Eh, well, I guess chimp-turds are kind of unconventional for a person in your office, but hey, what...
    Hello?
    Where'd you go, "insert favorite troll"?
    Hellooooo!

    (Fade out)

    (and yes, permission granted for inclusion in "the peanut gallery")

  97. Oh, wow. Super by Anonymous Coward · · Score: 0

    I didn't know the old extremist fart could still actually write code.

  98. Re:The real question is: by yeremein · · Score: 1
    However, if code was diliberatly obfuscated, that would be very, very bad news for Linux. That shows that it was not an honest mistake, but the programmer knew something about the origins and they needed to be hidden.

    It would be hard to prove that the differences between Unix and Linux amount to deliberate obfuscation.

    SCO thought that Linux's BPF implementation was an obfuscated copy of the original, when in fact it's a clean room reimplementation based on published specifications.

    In addition, natural evolution of code could be easily mistaken as "obfuscation" by someone who is not a kernel hacker.

  99. his tool is only good for this case by segmond · · Score: 1

    next time peeps rip off codes, they litter every other line with dummy lines, n = n;
    n = n * 1
    n = n + 0
    and bam his tool fails.

    --
    ------ Curiosity killed the cat. {satisfaction brought it back | it didn't die ignorant | lack of it is killing mankind
  100. ok, I have a question by EricTheFish · · Score: 1

    If SCO doesn't want people to find out which lines are "copied into linux", so they can be removed, then why would they allow their code to be run through this program?

    The open source community would then know which lines SCO has any business talking about, those lines could be investigated as to their source/license, changed if neccesary, and everyone could move on.

    But wait! SCO wouldn't have a case. (Not that they have one anyway, but that's beside the point. This point, at least). This is exactly the same situation as SCO just telling everyone which lines they are talking about. How would it be different?

    I mean if you compare two sets of data, and you know which lines are identical....

    Is the difference that maybe someone wouldn't be breaking their NDA with SCO if they ran the code through this and provided MD5 hashes (rather than the raw source code)? (i.e. so that this can actually be done, whereas SCO will not show their code themselves...)

    --
    -ETF EOM
  101. A tool like this by earthforce_1 · · Score: 1

    would be very useful, as it can cut both ways. I wonder how many Linux or BSD device drivers found their way into SCO source code? Once similarities have been detected, it would be a simple matter to go through the Linux kernel archives and trace the source of the device driver code. Could SCO do the same? Would they dare?

    Of course, the hashes would have to invariant against camoflauge by variable name search and replace, whitespace and blank line insertion, etc. And tests would have to be made to ensure nobody is pulling a fast one by swapping code - It would have to be shown to compile correctly to the executable.

    --
    My rights don't need management.
  102. Re:IT WILL NOT WORK! Here's technical reason why by Anonymous Coward · · Score: 1, Insightful

    > A single change in the white space would make the
    > MD5keys not match. any code run through an
    > obfuscator woul dnot match. a change in case
    > would cause it not to match.

    Sorry to flame here chaps but the comparator code already strips all white space characters from the lines before checksuming them. Thus trivial change such as indentation size will be ignored (though switching from 'proper' K&R indenting to some other form will break it).

  103. Bah! FSS developers will never learn... by greppling · · Score: 3, Funny

    ...how to write good user interfaces. With coders like you we will never achieve complete world domination. The correct program is, of course, s.th. like this:

    int main()
    {
    int i;
    printf("Comparing source trees...\n");
    sleep(2);
    printf("Check started.\n");
    for (i = 1000; i--;) {
    printf(".");
    sleep(1);
    if (i % 100 == 0)
    printf("\n%d0 percent remaining\n", i / 100);
    }
    printf("\n\nThese source trees appear to be entirely different!\n");
    return 0;
    }

    1. Re:Bah! FSS developers will never learn... by Anonymous Coward · · Score: 0
      Line 7: for (i = 1000; i--;) {

      Where's the limit test? Or did you mean:
      for (i = 1000; ;i--) {
    2. Re:Bah! FSS developers will never learn... by joe_bruin · · Score: 2, Informative
      Line 7: for (i = 1000; i--;) {

      Where's the limit test? Or did you mean:

      for (i = 1000; ;i--) {


      what the original poster had works correctly. i-- returns the value i (pre-increment), and satisfies the end condition when i is zero.
    3. Re:Bah! FSS developers will never learn... by Anonymous Coward · · Score: 0

      My mistake.

  104. Re:The real question is: by bladernr · · Score: 1
    Legally, you are probably right. Morally? I don't think anyone of us would get a good feeling if we new that OSS was in the wrong, but we won on a technicality (in this case, we were wrong, but SCO was not able to get past the reasonable-man test).

    To keep up the /. tradition of analogy, imagine if I stole something from the store, but, in court, I claimed that I already had the item, and won. Does it make the theft right? No. In fact, it may make it worse, because I added lying on top of stealing.

    Oh yeah, remember, in a civil suit, SCO does not have to get beyond reasonable doubt. So if a reasonable person would think that it is an obfuscation of the original, SCO wins. To use your example, a jury will not be stacked with kernel hackers. That is something that goes both ways, but it worries me. Hell, if they can find OJ innocent, maybe they can find IBM (Linux) guilty?

    --
    Sarcasm and hyperbole are the final refuges for weak minds
  105. Improved crap by j_w_d · · Score: 1

    You mean that by digesting SCO crap, they may produce improved crap?

    --
    ------ The only greater hazard to your liberty than n politicians is n+1 politicians.
  106. Re:IT WILL NOT WORK! Here's technical reason why by larry+bagina · · Score: 1

    lcc would be a better solution. It can spit out text dags. I wasn't aware gcc could do anything other than asm.

    --
    Do you even lift?

    These aren't the 'roids you're looking for.

  107. MOD PARENT DOWN by Anonymous Coward · · Score: 0

    -1, karma whore.

    1. Re:MOD PARENT DOWN by Malfourmed · · Score: 1
      -1, karma whore.

      Guilty your honour!

      ----

      Wordforge writing contest now open: deadline 2003-09-28

  108. a better algorithm is ... by segmond · · Score: 1

    his algorithm is really cheap...

    comparing object code will not work if one line of code is there as the check sum will be different.
    a lot of codes don't port directly, so when stolen perhaps a few lines are added or deleted. if i was the author of this tool (note, i have no idea how it works besides checksums) i will look at the logic of code.

    for example,
    if statement will the value 1
    variables have 2
    logical operators will have their own numbers
    say, && = 4
    comparasion operators will have their etc...
    == = 6
    ~ = 7
    ( = 8
    numbers = 9
    & = 10

    this code
    if (a && b) & ~0xf000 will yeild
    1,8,2,4,8,10,9

    now we can pass those numbers to f()our result will be Z, Z defines the logic for that code
    the next logic we can call Z1, Z2 ... Zn

    now if the offending code has
    if (foobar && feh) & ~0xffff it will likewise yield
    1,8,2,4,8,10,9
    we get Y, and the next line Y1, Y2 ... Yn
    We can now then compare logics and compare how often they match.
    So if Z0 matches Y3, we check
    if Z1 matches Y4
    and if Z2 matches Y5

    by using this technique we can find logics that match, even if someone inserts 5 lines without our code, we do not have to worry about the checksums not matching or matching object codes.

    basically we are matching by logic and if we notice 20 consecutive matching logics or whatever the threshold we set, we can yeild a positive result.

    I just made this algorithm on the spot, perhaps someone has already done it, perhaps note. :D the beauty of this is that we can even use this algorithm to compare DIFFERENT languages, for example, C code that has been ported to Java or Python.

    Wooops, it's now left for someone to implement and attempt to patent it, then profit! :D

    --
    ------ Curiosity killed the cat. {satisfaction brought it back | it didn't die ignorant | lack of it is killing mankind
    1. Re:a better algorithm is ... by Anonymous Coward · · Score: 0

      This was a great post, man, it sucks no moderators noticed it :( You're dead-on, though.

  109. Re:This Is The Same Idiotic Grandstander wrote thi by larry+bagina · · Score: 1

    does anyone have a copy of his essay from when VA Linux stock hist $0.50 a share? ($75,000)

    --
    Do you even lift?

    These aren't the 'roids you're looking for.

  110. Equal code solutions.. by euxneks · · Score: 1

    Wouldn't SCO then be able to just copy from linux and then the comparisons would also be equal?

    --
    in girum imus nocte et consumimur igni
  111. More possibilities... by Anonymous Coward · · Score: 0

    If you're comparing Linux to the legacy Unix source released by Caldera, wouldn't you want to throw away the common signatures? The unique signatures are what "might" contain any leeched code which is uniquely System V.

    Theoretically, you can repeat the comparisons iteratively with other code bases (e.g., NetBSD/FreeBSD/OpenBSD), or compare against other signature sets (results of other comparisons). A SVR licensee could compare their branded version of Unix against SCO's, isolating the code which they own/created.

    Thanks ESR!

  112. Open Source by digidave · · Score: 3, Interesting

    THIS is exactly why Open Source works. It's not because of IBM or Red Hat or geeks from Finland. It's because people in the community are willing to step up to any challenge.

    Thanks, ESR.

    --
    The global economy is a great thing until you feel it locally.
    1. Re:Open Source by Anonymous Coward · · Score: 0
      THIS is exactly why Open Source works. It's not because of IBM or Red Hat or geeks from Finland. It's because people in the community are willing to step up to any challenge.


      And then I read your sig, and realize...this is a JOKE!

      Or you're a fucking moron.
  113. You guys are missing the point. by LinuxParanoid · · Score: 4, Informative

    Pardon me, but a lot of you guys are missing the point of this comparator.

    1) There are people out there with legit source licenses to SVR5 source trees. And not just Unix OEMs. Various people in large companies with SVR4/5 source licenses etc.

    2) Such people cannot release the source code, and may (if paranoid of how they interpret 'derived works') not want to publish hashed MD5 codes of SVR5.

    3) However, it should be possible for a legit SVR5 source licensee to publish openly a list of areas of code that are similar across trees, without that list being either A) a derived work, B) violating their NDAs (um, do check the fine print first though) and C) spending tons of their own, presumably expensive time, digging through stuff.

    Then Linux advocates can then sift through the resulting large pieces of code and doublecheck/crosscheck the origins of it. At the very least, we'd have a public list of suspicious areas of Linux and could determine that certain parts are A) BSD-licensed, B) are verified as original by a known Linux coder, and C) don't fall into the above categories and remain 'suspicious'. This presumably is what ESR is referring to by "various persons will apply it in useful ways. Yes, I'm being deliberately vague and tantalizing". Let's say that its likely the percentages in A and B will be large.

    Of course it's true that there could be code that this primitive tool doesn't catch. But SCO probably started their analysis by using tools like this also. Looking through millions lines of code by hand is no cakewalk, so one will inevitably start with code like this in such an investigation. (Unless one is concerned about one specific predetermined critical/sensitive piece of code.)

    Oh, and the other thing about this tool that is nice IMHO? It demonstrates a "good faith" effort on the part of Linux advocates and coders to correct the problem -- despite the barriers raised by SCO (no code release except via NDA).

    Finally, running this tool across a Linux and a BSD release should turn up some data that is both interesting and relevant for this dispute. I'm almost tempted to try that myself.

    --LP

  114. Press release! by mflaster · · Score: 2, Interesting

    Why isn't this a press release?

    If I go to Yahoo, and look at news related to SCOX, this doesn't show up. Here is the open source community trying to help find any misappropriated IP - and no one that doesn't read slashdot/eWeek will know about it!

    Isn't there someone who subscribes to a wire service, that can issue a press release? In order to fight FUD, we have to get info out to people that don't read slashdot!!

    Mike

  115. Assuming someone gets a SCO shred-tree... by Stormbringer · · Score: 1

    It seems to me that the first order of business should be to compare it against Caldera's Linux tree to see if it lights up at 100%. How do you know it was the UNIX tree that was shredded? "Oops... that wasn't a shred-tree we intend to use in court."

  116. SCO? by Overly+Critical+Guy · · Score: 1, Funny

    Look how Slashdot turned this into another SCO article.

    The news was a simple source tree comparison tool. Why is the headline "ESR to Shred SCO Claims?" He didn't mention anything about SCO whatsoever.

    Just noticing. Now we'll have yet another few hundred SCO bitch posts. The Darl McBride troll will post and get modded up, people will try to act like intellectual property experts, and we'll all go about our day as usual. There is nothing new here but another attempt for more page hits on the part of corporate-owned Slashdot...

    --
    "Sufferin' succotash."
    1. Re:SCO? by Anonymous Coward · · Score: 0

      Wake me up when you stop being boring.

    2. Re:SCO? by TomV · · Score: 1
      He didn't mention anything about SCO whatsoever

      Well, ESR did specifically refuse to state whether he'd built this tool specifically for the SCO case, but he did say:
      "I am grinning a grin that should frighten the thieves and liars at SCO out of a week's sleep."
      which suggests that he may, just, subtly have hinted at a possible SCO connection... ;-)

      TomV
  117. Will...not...work... by Anonymous Coward · · Score: 0

    MD5 sums are only userful for determining if two chunks of data are EXACTLY the same. If you hash each file, then only copying a whole file would be detected. What if you hash each line? Well, then you still can't detect single character changes (such as re-indentation or added whitespace), and you risk having the line:

    void main() {

    occur in multiple places, resulting in collisions.

    What is really needed is a funcitonal checker that compares the code logically, and a syntax checker that compares unique word fragments.

  118. No source = no copyright by poptones · · Score: 4, Insightful
    This entire argument is happening for ONE reason: various governments of the world )specifically, in this case, the US) has afforded COPYRIGHT protection to works that contribute nothing to "furtherance of the state of the art" and nothing to "the progress of science." If I build a power saw, I can patent unique aspects of its design but have to reveal those aspects.

    Copyright is misapplied to source code. Either REVEAL THE SOURCE or you only get protection on that which you "publish" - namely, the binary.

    Put up or shut up; no source, no copyright on the source. You won't share it, you don't need it protected.

    1. Re:No source = no copyright by IM6100 · · Score: 1, Flamebait

      Are you seriously using the term 'patent' and 'copyright' interchangably, and yet trying to be part of an adult discussion on the topic??

      --
      A Good Intro to NetBS
    2. Re:No source = no copyright by Anonymous Coward · · Score: 0

      Get your knuckles off the floor. Close your mouth and breath through your nose. Take your ADHD meds. And then read what he wrote again.

    3. Re:No source = no copyright by poptones · · Score: 3, Interesting

      Apparently your reading comprehension skills are right on par with the dolt who modded the post down.

    4. Re:No source = no copyright by jmv · · Score: 1

      Careful what you wish for. If you had to publish the source and get patent-like protection on the whole source, it would basically be catastrophic. Every program published that way would be equivalent to thousands of patents, never mind the fact that you couldn't use the source anyway. I still prefer it that way. Copyright isn't that bad (at least compared to patents), and if the program is proprietary, I prefer not being exposed to the source anyway (legal protection).

    5. Re:No source = no copyright by IM6100 · · Score: 3, Interesting

      Get a clue. Nobody who copyrights a work is under any obligation to widely spread around the work. Copyright is inherent in any written work. I can write a poem intended only for my lover, just give the one copy of the poem to that lover, and it's protected by copyright. Break into my lover's house, steal a copy of the poem, and publish it, and you've broken copyright and I have standing to nail you good for it.

      Patents, in order to be patented, need to be fully disclosed. That's inherent in the patent process, you're saying 'this is MY idea, here's the whole deal laid out, I assert that it's mine.' There's no comparable oblication for copyright.

      People like you who try to mush it all up are just trying to loot other people's property.

      --
      A Good Intro to NetBS
    6. Re:No source = no copyright by poptones · · Score: 1
      No. You get copyright protection on material you publish. This is tied to the executable in that it's part of the work, but it's still different than the executable.

      You prefer "not being exposed to the source" for LEGAL protection? Let's think about that: basically, Sid and Nancy would never have been made, for want of all those scenes that evoked other classics like gay Divorcee and American in Paris. Most of Steven King's work would never have been published at all (which you may argue would be a good thing) but then neither would Resevoir Dogs or even Rodney Dangerfield's classic scene in Natural Born Killers.

      So how is software uniquely beyond this scope of usage and protection?

    7. Re:No source = no copyright by nagora · · Score: 1
      No. You get copyright protection on material you publish.

      I don't think US law actually works that way anymore. I think the publication requirement was dropped at the same time that anonymous works became protected. But I'm not sure.

      I don't have anything other than doubt to add to this argument, I'm afraid.

      TWW

      --
      "Encyclopedia" is to "Wikipedia" what "Library" is to "Some people at a bus stop"
    8. Re:No source = no copyright by Anonymous Coward · · Score: 0

      If your argument was right competitors could steal my work before I publish it, publish it themselves and I couldn't touch them.

      Publishers could take the work in progress I send them under NDA and steal it because the NDA means its not a published work.

      Obviously there's something very wrong with your claim. However cracked up the US legal system is I get copyright on my work the moment its written here in Europe, regardless of what happens to it afterwards. I simply don't believe the US is as different as you claim.

    9. Re:No source = no copyright by Raffaello · · Score: 2, Informative

      You are missing the context to which the OP refers, which is Article I, Section 8 of the United States Constitution. This Section gives Congress the power:

      "To promote the progress of science and useful arts, by securing for limited times to authors and inventors the exclusive right to their respective writings and discoveries; "
      The "requirement" that the grantee of either a copyright or a patent publish the work in question follows from the first clause of the sentence, i.e. "To promote the progress of science and useful arts."

      If grantees were allowed to keep their works secret, the grant would not be promoting the "progress of science and useful arts," since no other scientist or author would have access to their work.

      The whole idea of patents and copyrights in the U.S. constitution is that the grantee goes public with the invention/work, thus letting others advance "science and the useful arts" by using the grantee's work. In exchange for disclosing this information (remember, the word "patent" means "public."), the grantee is given a legal monopoly on the right to profit from the invention/work for a limited period of time.

      As the constitution sees it there are only two alternatives. Either the potential grantee keeps the work/invention a trade secret, never publishing it, but thereby giving up legal rights to time limited exclusive profitability, or, the potential grantee "promotes the sciences and useful arts" by publishing the work/invention, and thereby gains a time limited exclusive right to profit from it.

    10. Re:No source = no copyright by IM6100 · · Score: 1

      The "requirement" that the grantee of either a copyright or a patent publish the work in question follows from the first clause of the sentence, i.e. "To promote the progress of science and useful arts."

      It says right in the text you quoted that the aim is to 'secure .... exlusive rights.'

      One exclusive right is the right of free association, and the right to control how your work is promulgated.

      There's a clear rationale for disclosure of patented ideas. The whole purpose in patent law is for interests to not have to hide behind 'trade secret' policies. Nothing of the kind is true for copyright.

      It's not been enforced the way that you proposed. Thank goodness firebrands like yourself aren't allowed to force your interpretation on the rest of us.

      --
      A Good Intro to NetBS
  119. Jon Katz died? by Anonymous Coward · · Score: 0

    So how did he meet his fate?

  120. Buying UNIX.. by k98sven · · Score: 1

    In all fairness, SCO's value is not in being purchased so that the source code can be freed...

    Well.. that's not what I had in mind either.
    Rather, I was thinking that the source and rights to UNIX will be up for sale with the rest of SCOs assets after they file for bankruptcy. (And THAT, I assure you is just a matter of time..)

    Interestingly, Red Hat's lawsuit and IBM's countersuit could very well leave the future smoldering wreck of SCO owing them money..
    (I think we can all agree that their cases definetly have more merit than SCOs)

    In fact.. if that happens UNIX could be included in a financial settlement between either of the above parties and the trustee of SCOs bankruptcy.

    Now.. -that- would be sweet.

  121. Md5!!! by Laconian · · Score: 1

    Is there a page that details the uses of the MD5 algorithm? I've just been using it for password security, but it seems like every other /. article references MD5 hashes. I'd like to know more about it.

  122. what if... by Anonymous Coward · · Score: 0

    what if SCO deliberately releases a list of shreds MD5fied from the linux kernel, as if they were from their own sources, without ever disclosing any code to public observation?

  123. SCO's trade secrets --- it's all FUD by EmbeddedJanitor · · Score: 5, Funny

    They would be divulging SCO's biggest trade secret, that all their claims are just FUD.

    --
    Engineering is the art of compromise.
  124. Re:IT WILL NOT WORK! Here's technical reason why by Megaslow · · Score: 4, Informative

    RTFM:
    Name

    comparator, filterator -- fast comparisons among large source trees
    Synopsis

    comparator -c [-d dir] [-o file] [-s shredsize] [-w] [-x] path...

    [snip]

    The -s option changes the shred size. Smaller shred sizes are more sensitive to small code duplications, but produce correspondingly noisier output. Larger ones will suppress both noise and small similarities.

    [snip]

    The -w causes all whitespace in the file (including blank lines) to be ignored for comparison purposes (line numbers in the output report will nevertheless be correct). This is recommended for comparing C code; among other things it means the comparison won't be fooled by differences in indent style.

    Using the appropriate switches will address two of your points. I'm sure it wouldn't be extraordinarily difficult to modify the code to ignore other things such as string constants, variable names, etc.

  125. It won't isolate non-copyrighted code by Kazoo+the+Clown · · Score: 1

    Problem is, at least in the case of SCO and probably in the case of many other code comparisons, it'll match on code that was commonly duplicated from open source or various sorts of PD or free sources. Consequently, the degree of similarity between the two trees will not be an indication of the extent of any copyright infringement.

    Circumvent copy protection-- refuse to buy media that use it.

  126. Someone Did this in June. by Popsikle · · Score: 2, Interesting

    I cant dig up the slashdot post, but here is The Inquirer article from Jun 18th. Someone did this well before esr did.

    Its not new, Its not esr's Idea, Its almost 3 months old!!!

    1. Re:Someone Did this in June. by Glenn+R-P · · Score: 1

      I cant dig up the slashdot post, but here is The Inquirer article from Jun 18th. Someone did this well before esr did.

      Its not new, Its not esr's Idea, Its almost 3 months old!!!


      Not new, 3 months old, yes. But how can you conclude that the anonymous
      correspondent to the Inquirer is not ESR? The correspondent describes
      ESR's algorithm in detail. The only significant difference is in June he was using 5-line shreds instead of 3-line shreds.

  127. Help! I'm an SCO news addict! by Anonymous Coward · · Score: 0

    I nearly blew Mountain Dew through my nose on that one. That's just crazy! SCO has committed the most vile of sin.

    This Comment was generated with the Comment-O-Matic for SCO Stories.

  128. better utility by mcbridematt · · Score: 1

    I thought we already had one:

    $ diff -u

  129. Comparison algorithms? by Ivan+the+Terrible · · Score: 2, Interesting

    I'm interested in algorithms that could be used to compare code. Moss and CAP from Berkeley are not interesting because the algorithm is secret (AFAIK).

    What algorithms other than ESR's comparator are there? (I recall but can't locate a recent comment on Slashdot that said something like "most plagiarism detection programs used by professors use the XXXX algorithm".)

  130. Missing the point? by Robb · · Score: 1
    I suspect the amount of code in Linux where a person knowingly plagerized sources to which they did not have a legal right and tried to cover this theft up is vanishing small.

    It is far more likely that people were not thinking, made a mistake or actually thought they had the right to include the code in which case they would not have tried to cover it up. I believe that ESR's goal is to identify these cases so they can be looked into.

  131. sounds mickey mouse by Anonymous Coward · · Score: 0

    e s r s c o m o u s e

    who cares dude linix is fscking unstoppabull. it offers a caring community tripping all over itself to cater to the end user. oh, wait.. i think i just decribed the windows community.

  132. Re:IT WILL NOT WORK! Here's technical reason why by miniver · · Score: 3, Informative

    Download & read the source. Or just read the documentation.

    Comparator has the capability (-w) to ignore whitespace while generating the hash, while at the same time tracking the actual line numbers for purposes of merging and reporting. In my experience, most code-copiers are dumb and/or lazy -- to get past ESR's tool, the code-copier would have to (a) realize that they're violating a license, (b) not care, (c) be smart enough to realize that a pure cut-and-paste might get caught, and (d) energetic enough to munge up the code logic and variables. While I'm sure there are people like that, I would argue that most of them wouldn't be interested in contributing the result to the community, and the code wouldn't get past Linus if they did. The more logical case is some one/company who believe that they have a legitimate right to copy code from one kernel to another (BSD -> Linux / Linux -> SysV / SysV -> Linux) and thus not feeling the need to cover things up. Either of the SCO User Group examples would fit this category.

    --
    We call it art because we have names for the things we understand.
  133. Doesn't work by TekPolitik · · Score: 1

    If it encounters a file with an unknown extension that has a blank first line, it gets a divide by zero (line 64 of shredtree.c).

  134. Algorithmic comparison by Anonymous Coward · · Score: 0

    I think it would be a lot more interesting with a program that could make algorithmic comparisons. Say, a program specialized in analysing e.g. C-code, prescinding naming, commenting, spacing and such from the actual algorithm.

    That could allow finding even obfuscated similarities. (?)

  135. Re:The real question is: by jd142 · · Score: 1

    So if a reasonable person would think that it is an obfuscation of the original

    Not quite. It isn't that a reasonable person has to think it is an obfuscation. I believe the phrase you are looking for is "preponderance of the evidence", but I am not a lawyer. What they'll have to do is show that the preponderance of the evidence indicates that the code in Linux cannot legally be there. The judge will tell the jury what the law is and the jury will determine what the facts are.

  136. Dont rely on this... by BlackSabbath · · Score: 1

    Even if SCO allowed an "independent" party to hash their source with this tool, they could still present an "impure" source tree that has been deliberately peppered with code taken out of linux, specifically for the purposes of getting a match.

    The bottom line is they need to front-up the code.
    End of story.

  137. Not possible by mbessey · · Score: 1

    It's just not possible to make a (small) set of MD5 hashes that represent all the "useful ways to structure a C statement". Even with extensive pre-processing, there's just way too many different ways to express the same algorithm.

    Because of the nature of MD5 as a cryptographic hash, the value of the hash gives you almost no useful information at all about the structure of the code.

    For example, given 3 variables (a,b, and c) and just the basic arithmetic operations of +, -, * and /, you could generate (literally) an infinite number of arithmetic expressions. Not all of those would be "useful", but there's no way to even enumerate the possible "reasonable" arithmetical expressions, much less calculate an MD5 sum for any combination of three of them.

    -Mark

  138. What a weird tool by p3d0 · · Score: 1
    By the time you MD5-hash the line triplets, and then compare the hashes, why don't you just compare the lines in the first place? Seems like it's cheaper and simpler.

    If all you want to do is keep the source secret, then a utility to spit out MD5 hashes of each line triple would be sufficient. Then pipe that into "sort | uniq -d" to find duplicate lines. You can even use uniq's "-w" switch to allow you to append line number information to the hashes. Voila, a 1-line shell script that duplicates most of ESR's tool:

    find -name '*.[ch]' -exec codehasher {} \; | sort | uniq -d -w32

    Why is ESR's super tool better than this?

    --
    Patrick Doyle
    I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....
    1. Re:What a weird tool by dazk · · Score: 2, Informative

      Eric's tool allows to compare larger and smaller chunks. Simple lines will easily match very often. Simple lines are not a problem. The problem is always lying in a sequence of lines. That's why you need overlapping sequences.

  139. Re: here's a workaround by ubiquitin · · Score: 1

    Only do the shredding on documents which have been run through a script which converts all characters to uppercase and removes whitespace in a systematic fashion: no tabs, anything beyond three spaces becomes one space etc. Wouldn't that resolve most of the objections you raise? Fuzzy hashes, mentioned elsewhere in threads on this slashdot posting, could also be useful.

    --
    http://tinyurl.com/4ny52
  140. Slightly less lazy by jtheory · · Score: 1

    It hashes the code in 3 line chunks, and the unique hashes from the shreds of both source trees are thrown out. Simple. And yes, whitespace can be ignored.

    I still have a question about how the 3-line thing works, though (I read the article, but not the source or docs): if the source files are exactly the same, but the Linux version has an extra single line comment on top, won't all of the 3-line chunks come out as unique because they're out of phase?

    Maybe it has some logic to restart the 3-line patter after any double linebreak, or some pattern detection (i.e., restart at comment start/end) to ameliorate this.

    Anyway, I do have to say: this is by far the MOST EXCITING news I've heard yet in this whole mess. What reason can SCO possibly give to refuse providing the shreds of their code? Either they provide the shred results and their lying is exposed, or they refuse, and their lying is exposed.

    --
    There are only 10 types of people: those who understand decimal, those who don't, and, uh, 8 other types I forget.
    1. Re:Slightly less lazy by Chmarr · · Score: 1

      Once the code spits out the md5sums, all matching md5sums are compared, regardless of which line they're on (or started on, since it's 3-line chunks).

      So, no, an extra line, 10, or 1000 won't throw it off.

    2. Re:Slightly less lazy by dspeyer · · Score: 2, Funny
      The hashes would reveal the code in question, because they could be compared to the Linux hashes (appropriate version) and then the relevant lines read out of the Linux code. Those lines will now be revealed to have passed SCO's elaborate QA procedures [trying to keep straight face] and this valuable trade secret will be breached. After all, no one would want to use grubby hobbyist-written code [keeping straight face is getting harder] but if they knew it was good enough for SCO, they'd all jump on it, and SCO's unique virtues would be spread to all their competetors!

      It makes as much sense as anything SCO's said.

      Hey, I just realized -- I'm typing this. I don't have to keep a straight face!

    3. Re:Slightly less lazy by Anomylous+Howard · · Score: 1

      This is why the three line chunks are over lapped.
      It eliminatetes phase errors.
      Shred1 = md5hash(lines(1,2,3))
      Shred2 = md5hash(lines(2,3,4))
      Shred3 = md5hash(lines(3,4,5))

      Of the three shreds that contain line 3, you would throw out the two out-of phase chunks.

  141. Cheers Eric by ralphclark · · Score: 1

    Heh. This puts the ball squarely back in SCO's court. Darl McBride's whinging "open letter" demanded that the Open Source community had to fix their development process in order to monitor for, and prevent, leaks of "intellectual property". Well, this tool seems to do just that...

    It's obvious that Darl was no doubt expecting his challenge about "fixing the process" to throw the Open Source community into a panic of bickering about what to do next which would last for weeks, triggering discussions about potential liability (which the press would no doubt interpret as an admission of guilt), generally make us look bad and thereby strengthen his case in the sadly all-too-relevant court of public opinion.

    So he clearly he wasn't expecting Eric to solve his problem later THE VERY SAME DAY!

    Jeez, you couldn't make this stuff up. Ha ha bloody ha ... in your face, McBride!

  142. I've seen this before from ESR... by Dr.+Smeegee · · Score: 2, Informative

    He developed a Callcenter Training Utility for our company in the early 80's. It used genetic algorithms to generate simulated customer complaints that were _very_ realistic, even to the point of using sample voices to "whine". Of course, the helpdesk trainees hated it...

    But hey, the mewling was featureful.

  143. Re: Genius -- umm try using it by fw3 · · Score: 1
    Compiled with gcc 3.2.3 it throws an FPE on actual source trees.

    Compiled with gcc 3.3+ it will actually run the analysis but it won't generate the output files if used against something larger than the 20 source files he includes in the tarball.

    I wrote this in perl + shell commands a few months ago, haven't optimized it and it's in the same range of fast as 'comparator' and actually works. -- of course software that just works isn't anywhere near as exciting, or press-release worthy.

    ho hum

    --
    Linux is Linux, if One need clarify their dist: <Dist>/GNU Linux
    bsds are of course just BSD
  144. Multi-purpose Tool by Uzziel · · Score: 1
    Bravo, ESR. comparator is a perfect tool to cut SCO's legs out from under it. But it will also be very useful to people like:
    • Professors
    • Publishers
    • Newspaper Editors
    • Librarians
    • Everyday Coders
    • Statisticians
    • Lawmakers
    Anyone who's got a vested interest in knowing whether or not they are looking at an original work can benefit from this tool, or derivative works of it. With a bit of front-end processing, this can help professors and editors spot plagarism, librarians spot duplication in their collections, and coders areas of redundancy. Thanks, Mr. Raymond. I'll be compiling this tonight...
  145. Re:I love geek stereotypes as much as the next guy by Anonymous Coward · · Score: 0

    Your request has been reviewed, and has been denied. We are truly sorry. The rest of you may now resume.

  146. Nobody has mentioned this yet ... by Mostly+a+lurker · · Score: 3, Interesting
    As currently designed, Shred would obviously not defeat deliberate source misappropriation. If (big if) the method could adapted such that it could not be easily fooled by a determined violator (and without revealing how the code works) then I believe registration of the results should be required by law. BUT ...

    In order that the method should not be fooled by simple changes, at least the following is required

    * White space must be ignored

    * Comparison must be at the statement level, not the code line level

    * Variable names must be replaced by standard placeholders

    * Routine names, other than standard library calls, must be replaced by standard placeholders

    * (Probably difficult) logic will be needed in the tool to detect and ignore noops: how do you deal with

    i++;
    %include noop.i;
    a[i]=b[i];

    The trouble is: a high proportion of the code sections thus simplified will fall into a relatively small number of possibilities, vulnerable to dictionary type attacks. Thus, most of the code could be reconstructed, though admittedly as obfuscated source code. IMHO this provides a valid objection to its use.

    1. Re:Nobody has mentioned this yet ... by B.D.Mills · · Score: 1

      In order that the method should not be fooled by simple changes

      How about doing the comparison on the binaries? That would ignore whitespace, would compare at the statement level, and would replace variable names and routine names by standard placeholders.

      --

      The only thing necessary for the triumph of evil is for good men to do nothing. - Edmund Burke
    2. Re:Nobody has mentioned this yet ... by Mostly+a+lurker · · Score: 1
      How about doing the comparison on the binaries?

      The consensus seems to be that this will not work for most languages (including C) because of how modern compilers work. You cannot work back from object code to source code.

  147. Shredding will reveal SCO IP by vt0asta · · Score: 1
    Chances are slim to none that a software company would allow it's "shredded" source code to be publicly released.


    You are right that SCO will not allow it's shredded source code to be publicly released. However, the reason would be because the MD5 hashes that matched could be traced back to specific lines in the kernel. I don't know why people don't get it...

    SCO doesn't want the common code published or known until the court date.

    I know it's irrational, I know it's silly. However, everyone expecting a "rational" response from an irrational company is foolish. I guess hope springs eternal, on both sides. SCO has bet the farm on this strategy, and they are not about to let the cat out of the bag. ESR seems to think they are just going to submit to this hair brained scheme, and produce a bunch of MD5 checksums.

    If you take their point of view, this does nothing to protect their IP, it's just a thinly veiled way of tricking them into revealing the code they believe is in question.

    --
    No.
  148. OMG! by csoto · · Score: 0
    I just ran this utility and nearly 100% of the code on my system is copied! They all start with
    main ();
    !!!!!!
    --
    There exists no way of exchanging information without making judgments. --Bene Gesserit Axiom
  149. No Trade Secrets in Registered Copyrights by Iparadox · · Score: 2, Interesting

    I think we are all missing a big point here. SCO registered their copyright in SysV. It was hard to do. They had to create a copy of the source code and file it with the Patent and Trademark Office. That puppy is there so that *we* can look at it. This is specifically *fair use*. It is there so that individuals can protect themselves by comparing what they have to what has been registered. No match means no problem. It just doesn't get any more *fair use* than that. Just have somebody nip up to the PTO, copy the registration for comparison purposes only, (really!) then do the comparisons. How hard is that? Yes, IAAL, but this is not legal advice. Hire your own mouthpiece.

    1. Re:No Trade Secrets in Registered Copyrights by Error27 · · Score: 1

      Actually SCO didn't registered their copyright in SysV. That was a mischaracterization on the part of SCO executives.

      If I recall correctly they actually registered around 18 pages of changes they had made more recently.

  150. Doesn't even compile by tvm662 · · Score: 3, Informative

    Has anyone else tried to compile Eric's code?

    >gcc --version
    2.95.3

    >make /usr/bin/gcc -c -g main.c
    main.c: In function `report_time':
    main.c:311: parse error before `int'
    main.c:312: parse error before `int'
    main.c:316: `buf' undeclared (first use in this function)
    main.c:316: (Each undeclared identifier is reported only once
    main.c:316: for each function it appears in.)
    main.c:317: `minutes' undeclared (first use in this function)
    main.c:317: `seconds' undeclared (first use in this function)
    make: *** [main.o] Error 1

    Looks like Eric has been coding too much c++ or something. I'm not a c coder myself, so I might be wrong, but don't you have to declare all the variables in a block of c code before using them. In report_time, he doesn't seem to have followed that rule. Maybe he might check his code on a number of compilers before declaring he has "perfected it".

    Eric here's my patch:

    --- main.c 2003-09-10 00:28:37.000000000 -0300
    +++ main.c.fixed 2003-09-10 00:29:55.000000000 -0300
    @@ -306,12 +306,17 @@

    if (mark_time)
    {
    - int elapsed = endtime - mark_time;
    - int hours = elapsed/3600; elapsed %= 3600;
    - int minutes = elapsed/60; elapsed %= 60;
    - int seconds = elapsed;
    + int elapsed;
    + int hours;
    + int minutes;
    + int seconds;
    char buf[BUFSIZ];

    + elapsed = endtime - mark_time;
    + hours = elapsed/3600; elapsed %= 3600;
    + minutes = elapsed/60; elapsed %= 60;
    + seconds = elapsed;
    +
    va_start(ap, legend);
    vsprintf(buf, legend, ap);
    fprintf(stderr, "%% %s: %dh %dm %ds\n", buf, hours, minutes, seconds);

    1. Re:Doesn't even compile by o'reor · · Score: 1

      Damn, I'm sure Eric didn't post the source himself. He must have trusted this in the same guy who produced SCO's slideshow. He should be more picky about his collaborators.

      --
      In Soviet Russia, our new overlords are belong to all your base.
  151. Useful tool by Julian+Morrison · · Score: 2, Insightful

    I can see this tool becoming helpful for so much more than smashing SCO. Any situation where data comparison is useful, but the data itself must remain secret. All paranoid types (corporate or governmental) will love it. Lawyers could make much use of it.

    And, given the dataset it generates, it could be extended to do other useful things such as detect redundant or cut-'n-pasted code, including bugs of the "pasted it in twice" sort.

  152. There wouldn't be... by leonbrooks · · Score: 1

    ...a shred of evidence that you'd done the comparison - - unless MS were proven guilty and the Shared Source licensee spoke up about it, in which case the doctrine of dirty hands would protect the licensee.

    Microsoft couldn't successfully prosecute the licensee because they broke the law themselves with the item in question. OTOH, while that might stand in the way of the licensee themselves prosecuting Microsoft, others could then proceed themselves, sure in the knowledge that when discovery time came they'd be laughing. It'd even eclipse TSG's circus for a few days, maybe TSG's stock would tank because of that. (-:

    There'd also be nothing to stop the licensee protecting their Shared Source access and avoiding offending Microsoft by shredding the source themselves and anonymously publishing it. Then anyone could do the comparison and point the finger. Any takers?

    --
    Got time? Spend some of it coding or testing
  153. Jon Katz, dead at 46 by Anonymous Coward · · Score: 0

    I don't know how many of you have heard this already, but Jon Katz, Internet journalist and bon vivant known and loved by millions of Slashdotters, was found dead this morning by sanitation engineers. Speculation as to the means of his demise hinges on the broomstick found near the body.

    He will be sorely missed. Truly an American icon.

  154. It gets even better by isn't+my+name · · Score: 1

    Now, you will remember that ESR last week all but threatened SCO with something it wouldn't like if it didn't start playing fair.

    What if someone with legitimate access to SCO source code, shredded it and gave ESR the MD5's? Since there is no way to determine what the original lines are from the MD5's that would likely not be violating any NDA. Then, what if ESR compared that to Linux and other GPL'd code sources and only published those instances where it appears that SCO has stolen code.

    That would probably result in some new discovery processes, even if IBM isn't already there.

    Could it perhaps result in a method for other big software vendors to have their source code examined for illegal takings without them having to reveal any source code? And if so, is it possible that some of SCO's big backers might be more reluctant to let SCO keep this up? I can't imagine that MS would want to create a court approved method to compare its code to those that it might have stolen from in a way that doesn't give MS the cover of not wanting to reveal its code in public.

    1. Re:It gets even better by stanwirth · · Score: 1

      I can't imagine that MS would want to create a court approved method to compare its code to those that it might have stolen from in a way that doesn't give MS the cover of not wanting to reveal its code in public

      OOOH, now that's interesting. I wonder how much of MSDOS was ripped-off AT&T XENIX and IBM's QDOS. I I wonder how much of Win3.1, Win95 was inherited from ripped-off AT&T XENIX and IBM QDOS. I wonder how much of the ftp, tracert and telnet apps on MSDOS/Windows is line-for-line copied from 4.2 BSD. I wonder how much of the Windows GUI was "borrowed" from X.

      Shred hard. Shred fast. Shred MICROSOFT .

    2. Re:It gets even better by platypus · · Score: 1

      There was a message yesterday about the new subpoena from IBM (see groklaw if you haven't read it yet).

      24 ugly points concerning documents which IBM wants to be handed over from canopy.

      Can you imagine another subpoena from IBM with 1340243 points like

      1. All documents concerning any statements, declaration, affidavit, analysis, assessment, or opinion rrelating to plaintiff's rights to lines 23-25 of /src/drivers/net/somedriver.c

      2. All documents concerning any statements, declaration, affidavit, analysis, assessment, or opinion rrelating to plaintiff's rights lines 47-49 of /src/drivers/net/somedriver.c

      3. All documents concerning any statements, declaration, affidavit, analysis, assessment, or opinion rrelating to plaintiff's rights lines 113-115 of /src/drivers/ide/anotherdriver.c

      [...]

    3. Re:It gets even better by maharg · · Score: 1

      see my sig

      --

      $ strings FTP.EXE | grep Copyright
      @(#) Copyright (c) 1983 The Regents of the University of California.
    4. Re:It gets even better by isn't+my+name · · Score: 1

      Actually, wasn't MS-DOS based on the purchased QDOS, so seems to me that they can use anything they want from there. Also, MS can legally grab anything released under a BSD license. I think the more interesting possibility is that some of their embracing and extending may have started with some borrowing. Perhaps not, but I still can't see MS being happy at a court approved way to examine and compare their source code.

    5. Re:It gets even better by Anonymous Coward · · Score: 0

      MS-DOS is a red herring.

      IBM helped write MS-DOS for the PC. They had to make DOS work on their brand new PC, after all. For a long time, MS-DOS and IBM's version, PC-DOS (which IBM still updates and sells, BTW), were nearly identical.

      In fact, DOS 2.0 was developed almost entirely by IBM. IBM did this to support hard drives for the PC XT. DOS 2.0 added support for partitions, directories, and CONFIG.SYS and some other important stuff that I don't remember off the top of my head. In other words, most of the things that most people think of as being "DOS".

    6. Re:It gets even better by palfreman · · Score: 1
      As long as they retain the copyright message etc. they are allowed to use BSD and X source code - in fact, they are encouraged to. They had a licence from AT&T for Xenix, and they bought the source code outright for QDOS. So the only significant thing is whether Microsoft have incorporated GPLed code into propriatory productcs. If they have they would have to remove the code and/or pay compensation, or licence it seperately & non-GPL from the copyright owners

      While Microsoft were (at least up until now) happy to see SCO try and trash Linux, I personally doubt they are really at risk of significant breaches of the GPL. One reason for saying this is that there are past examples of Microsoft releasing GPLed code from their experimental labs, in situations where what they wrote was based on GPLed code - there is GPL download stuff on the MS Services for Unix page too.

      Not that I've ever used it - no point when better real Unixes are free :-). But the point is, they are making sure they are GPL complient here, and I've no reason to doubt they don't do that everywhere, sticking to BSD, X and Apache/MIT licenced code for their propriatery products.

    7. Re:It gets even better by stanwirth · · Score: 1

      As long as they retain the copyright message etc. they are allowed to use BSD and X source code - in fact, they are encouraged to. They had a licence from AT&T for Xenix, and they bought the source code outright for QDOS. So the only significant thing is whether Microsoft have incorporated GPLed code into propriatory productcs

      Is the legality of Microsoft's merely having ripped-off and marketed code developed by IBM, AT&T and UCB really the point?

      Or is the point that it might be extremely interesting to know what percentage of MSDOS was their work, and what percentage was simply cut-and-pasted verbatim from orphaned products they got at bargain-basement prices when they couldn't just download it and redistribute it (adding a single line of (c) in the binary )? A significant portion of a "cut-paste-hack it until it sorta works-release" OS development would certainly explain a great deal about the resulting security holes, memory bugs, and just plain badness of Microsoft's releases. The suspicion is, of course, that Microsoft keeps their source code under wraps, not because it's so good that others might steal it, but because it's so badly hacked, and so obviously a cut-and-paste job that it will completely discredit the Microsoft developers.

      We know that the Linux kernel, by contrast, was developed quite explicitly and demonstrably by a combination of "design-impliment-test-modify-test" and "study-understand-reimpliment from scratch" development methodology, not "cut-paste-hack". We know this because we can read the LKML and see the level of discussion, and see how this follows the succession of changes to each part of the code, and see the alacrity and level of intelligence with which bug reports are attended in Linux. In the beginning, Linux was originally a complete re-write of Minix, not a badly hacked port. Likewise, X was a complete re-implementation of the windowing systems developed at Xerox and Sun Microsystems, not a bad cut-and-paste job. But how much of the Win3.11, Win95 and MFC code was a bad cut-and-paste job from X? How much of MSDOS was a bad cut-and-paste job from QDOS, XENIX and BSD?

      Inquiring minds may want to know how extensive Microsoft's rip-off been, whether or not the rip-off was "perfectly legal."

      Furthermore, it would be extremely interesting to see what percentage of Microsoft's OS's are derived by bad cut-paste-hacked , versus the percentage of code Linux legitimately has in common with BSD and early AT&T Unices. Why is this interesting? Because perhaps the most destructive aspect of SCO's claims against linux is that it creates the impression that it's all "borrowed code." Whereas, I would wager that the percentage of *borrowed* code in MSDOS, Win3.11, Win95 and WinXP are far higher.

      And not just the percentage-of-lines of code would tell you something, but also would the size of clusters of common code. For examples, large blocks of common code would generally indicate cut-and-paste, whereas a function here and a function there, each with a line here and a line there different -- would indicate that it had been partly borrowed, but extensively modified. Identical function names, but with completely different code inside them indicates a complete re-implementation of an API, probably for backward compatibility's sake. And so on and so forth.

      A very public shred of MSDOS/Xenix/QDOS/BSD compared to a shred of Linux/BSD/AT&T Unix would not only be extremely enlightening, but, from a Public Relations point of view (as well as an intellectual integrity point of view) essentially highlight the fact that MS code was never a shiny new car, but just an old rustbucket gotten out of the junkyard, and given a new coat of paint. Whereas Linux really is a shiny new car, with many parts modelled after, and improved over the best classic designs.

      Of course, you'd expect people to be able to figure

    8. Re:It gets even better by palfreman · · Score: 1
      Is the legality of Microsoft's merely having ripped-off and marketed code developed by IBM, AT&T and UCB really the point?

      Microsoft didn't "rip off" BSD/X/Apache/MIT code, or for that matter not QDOS and probably not AT&T. They used then in complience with the terms they were licenced under. For BSD style code, the reason it is licenced in that way is to allow other companies to make commericial products out of it. That is the whole point. How else is it that so many OSes managed to get TCP/IP support, and get it working well? Becuase it was possible to freely use the BSD code for it. (Linux is unusual in having its own implementation, a good thing for tcp/ip biodiversity.) Using the term "ripped off" here is quite wrong and totally misleading.

      The suspicion is, of course, that Microsoft keeps their source code under wraps, not because it's so good that others might steal it, but because it's so badly hacked, and so obviously a cut-and-paste job that it will completely discredit the Microsoft developers.

      I don't know the state of their source code (but rumour has it that its crap, true), but I do know the overwhelming reason for keeping it under wraps is becuase it is fully commercial code that they plan to make money on, and they don't want it in the public domain. If you support copyright (I don't BTW), as GPL enthusiasts explicetly do for GPL to work) then I think logically you have to respect Microsoft's right to copyright too, within proper fair use for their customers.

    9. Re:It gets even better by stanwirth · · Score: 1

      Microsoft didn't "rip off" BSD/X/Apache/MIT code, or for that matter not QDOS and probably not AT&T. They used then in complience with the terms they were licenced under. For BSD style code, the reason it is licenced in that way is to allow other companies to make commericial products out of it. That is the whole point. How else is it that so many OSes managed to get TCP/IP support, and get it working well? Becuase it was possible to freely use the BSD code for it. (Linux is unusual in having its own implementation, a good thing for tcp/ip biodiversity.) Using the term "ripped off" here is quite wrong and totally misleading.

      Certainly Microsoft may have had a legalistic right to commercialise code that others had written. But the Microsoft customer who thinks that what they're paying for is "trusted" Microsoft code, that this is not a shiny new car--it is an old rustbucket with a flash new paint job, and Bill Gates is no better than a used car salesman .

      Furthermore, the customer has a right to know that the reason that for example, the "ping of death" bug was carried over from old BSD code to the first releases of Microsoft's TCP/IP stack -- was that they simply got it from somewhere else, and didn't bother to read, study and understand the code before cutting-and-pasting it (I won't even dignify what they did with the word "reimpliment" here).

      Microsoft's basic misunderstanding of the importance of using consistent data types (in this case it was a bad mix of signed ints and unsigned ints -- the same reason the memory limitations on MS servers are so screwy) is the kind of thing that wouldn't be tolerated for five minutes in a homework problem turned in by a first-year CS student. Because Microsoft ripped-off this code rather than study, understand, reimplement and test this code -- and furthermore, didn't even keep up, intellectually, with their colleagues at UCB by contributing their own observations based on their own study of their understanding and reimplementation of the BSD code base, the ping-of-death bug persisted in Microsoft products for years after it had been fixed in BSD.

      By contrast, other implementations, which were done from scratch, on their own dime, not big DARPA and IBM grants -- Linux, for example --of did not have this bug. Why? Because they openly studied, discussed, understood, and then re-implemented a whole new TCP/IP stack. What you averr to in passing, is actually the main point.

      This gives us a score of:
      Open Source Development :2 -- Microsoft Proprietary Development :0

      Despite having been technically out-performed by the open source community while trying to commercialise a bad cut-and-paste job (i.e. rip-off ) of the Open Source Community's earlier BSD releases (the only ones Microsoft could get its hands on legalistically after RMS' brilliant GPL -- and the widespread adoption of it), Bill Gates, rather than thanking the open source community for providing him with the IP that has made him bazillions of dollars, has the nerve to turn around and characterize the very people who's code he is using as a bunch of spotty teenagers operating out of mom's basement, who, when engaged in the very activity that MS has ripped-off in the first place -- he accuses us of being the rip-off artists -- his rationale is, apparently that we don't have to pay microsoft for the code we've developed and shared amongst ourselves-- and microsoft can't commercialise it, either. ( heh heh heh...thank YOU, FSF for the GPL!)

      I don't know the state of their source code (but rumour has it that its crap, true), but I do know the overwhelming reason for keeping it under wraps is becuase it is fully commercial code that they plan to make money on, and they don't want it in t

  155. Better yet, a reason to get MS to stop funding SCO by isn't+my+name · · Score: 2, Interesting

    Actually, combine this with the "shared source" program from MS and it would be easy to see if MS did (or did not) copy GPL code into Windows as some suggest.

    More importantly, get something like this accepted in a court of law as a legitimate way to do an initial assessment of code yet still preserve a litigants right to code privacy, and you are going to have not just MS but a number of big companies shaking in their boots. Not necessarily because they did steal anything but because they have to realize it is a possibility that one of their coders did without company knowledge. Doesn't matter, they are still liable.

    But, get a method like this accepted in a court of law and you are going to see it used again. I think this has a huge potential to hurt closed software. And perhaps a potential to convince MS to stop funding SCO, perhaps even to apply pressure to get them to start backing down.

  156. 3-line problem by jtheory · · Score: 1

    No, I'm worried that the 3-line chunks could be out of phase.

    An MD5 sum of these 3 lines:
    one
    two
    three ...won't match with the MD5 sum of these 3 lines:
    two
    three
    four ...even though they share two lines. Now imagine the same thing for a few thousand more lines, identical except that one started on "one" and the other on "two". Every single MD5 will be unique, because you'll always have only two lines in common.

    --
    There are only 10 types of people: those who understand decimal, those who don't, and, uh, 8 other types I forget.
  157. I found out myself by jtheory · · Score: 2, Informative

    Okay, here it is (from the man page):

    comparator works by first chopping the specified trees into overlapping shreds (by default 3 lines long) and computing the MD5 hash of each shred.

    (Emphasis added)

    --
    There are only 10 types of people: those who understand decimal, those who don't, and, uh, 8 other types I forget.
    1. Re:I found out myself by Chmarr · · Score: 1

      To expand... on one file, md5sums for the following groups of lines are calculated:

      1,2,3
      2,3,4
      3,4,5
      4,5,6 ... and so on.

  158. This should already have been done by pweitz · · Score: 1

    Since IBM has both versions of the code, and since much time has passed since this issue arose, it would seem reasonable that IBM has already performed an analysis like this.

    So where is the result?

    If it was to their benefit, couldn't they leak out the results, if nothing else?

    So maybe the result isn't favorable...

  159. Riiight... by Svartalf · · Score: 1

    IBM's contributions are IBM's to give in the first place. SCO's claiming some pretty twisted contractual rights are being violated by the act in question- namely IBM giving pieces of their IP to the Linux community under GPL. A thorough reading of the evidence that SCO provides on their own website invalidates that claim- i.e. that SCO, through it's purchase of this and that has a control right over whether IBM may or may not give away it's IP.

    The simplest way out is to not listen to SCO in the first place and wait and see what comes of ALL of this- the Red Hat filing and the IBM one.

    On the 15th, SCO HAS to respond, come up with a fairly compelling reason for the court to allow another delay, or face a summary judgement. If they don't come up with something to counter Red Hat properly, they face a summary judgement.

    Later in the month, they have to answer IBM under a similar set of circumstances.

    Combine this with what we're all discussing, if ESR's little program works like it appears that it does- while ESR's grandstanding, it would very easily hurt their position with the Red Hat filing.

    --
    I am not merely a "consumer" or a "taxpayer". I am a Citizen of the State of Texas
  160. Looks like "fair use" to me by Anonymous Coward · · Score: 2, Informative

    I don't know if the MD5 sums are a derivative work of the original source or not, but I would be inclined to think that they are.

    Let's look at what the law says about fair use

    Fair Use

    The four factors are: (1) the purpose and character of the use, including whether such use is of commercial nature or is for nonprofit educational use; (2) the nature of the copyrighted work; (3) amount and substantiality of the portion used in relation to the copyrighted work as a whole; and (4) the effect of the use upon the potential market for or value of the copyrighted work.

    It looks to me that under part (1), the MD5sums are a form of commentary or news reporting about the original work, not a replacement for the work. I don't know about (2). Under (3), the "amount" is definitely small, and the "substantiality" is low. And under (4), almost nobody who would buy the original work is going to substitute the MD5sum's instead, so the MD5sum's would have nil effect on the market for the original work.

    So in my AC-IANAL opinion, distribution of the MD5sum's would be protected under American copyright law as a "fair use".

  161. Poisoning the data base? by Anonymous Coward · · Score: 0

    What if SCO offers their shredded code, but "poisons" the MD5 sums with sums taken from actual Linux code? How could anyone verify they didn't?

    Even if the shredding is done by a 3rd party they could contaminate their "source" with Linux source before letting the 3rd party get to it. Same result.

    They would then claim this as proof of their case.

  162. Noisy! by Anonymous Coward · · Score: 0

    Interesting. I just ran this on some source trees on my disk. These were two related projects, but some of the matches were not the result of copying. Here's one three-line example (no, this isn't my coding style):

    }

    {

    One file was Java, the other C.

    I know you can vary the shred size to make the output less noisy but it shouldn't be difficult to find scenarios with 4, 5, or 6 lines; beyond that you risk missing small but valid similarities.

  163. RTF(A|M|S) by CyberDruid · · Score: 1

    Article, Manual, Source

    --

    Opinions stated are mine and do not reflect those of the Illuminati

  164. Copyright, no patents by gonvaled · · Score: 1

    As you know, SCO are claiming copyright infringement, no patent infringement. They claim that code has been copied verbatim into Linux. This tool is very usefull to decide on this claims.

  165. Another little question that's bugging me... by Reteo+Varala · · Score: 1

    I hear all this debate about whether or not the code itself would md5 similarly, but here's one for y'all...

    Tell me, would the file trees lend themselves to comparison? Or do we plan on catting every single file into a monolithic codeball to do the comparison on? Not counting the asm files, no doubt.

    Would the System V file tree be identical to Linux's? Somehow, this I would doubt, extremely.

  166. Possible improvements by gonvaled · · Score: 2, Interesting

    A lot of comments focus on the problems that a global search and replace will pose to the technique. I think we can improve the agorithm by doing the following:

    What we are looking for here are pieces of code with the same structure: the same for loops, while loops, variable assignment, function names, and so on. The idea would be to substitute all literals by a standard placeholder, and then generate the md5 checksums on the block level (as somebody has previously suggested).

    To be able to cheat this technique, a modification in the structure of the code is required. And in the case that exactly that has been done, it is arguably wether that can be considered copyright infringement.

  167. show the source SCO by SLOGEN · · Score: 1

    If it's already in the linux kernel, it's publicly available. The only thing you cannot see publicly is which parts are actually SCO's.

    For me to even consider whether SCO source is in the kernel, SCO should:

    1. point to which source they mean are infringing
    2. give reasonable evidence, that they actually have the rights to that source.

    Should a judge decide that the source is SCO's, they will have a good propection for that source in the future.

    --
    SLOGEN [ http://ungdomshus.nu : Sebastian cover music]
  168. You stole my watch! by SLOGEN · · Score: 1

    A: "You stole my watch"
    B: "No, i bought that"
    A: "No, you stole mine!"
    B: "Okay, show me somthing that identifies your watch"
    A: "Can't do that... it's a secret watch!"
    B: "uuuhhmmm, if you really think it's your watch, I already GOT the secret!"

    --
    SLOGEN [ http://ungdomshus.nu : Sebastian cover music]
  169. First match already found by hackerm · · Score: 1

    In other news, SCO reports that it has successfully used the Linux IP pirates' own code-comparing tools against them. A perfect match has been found between SCO code and Linux code. The offending code reads:


    #include <stdio.h>

  170. shred algorithm effective ? by dh5fbr · · Score: 1

    I am wondering how far the described method is effective to find any kind of plagiats.

    Taken source code, copy it and change variable names (like done in some student projects) and now run it via the shred algorithm. To my understanding the MD5 hash of two bitstrings only with only one bit difference are still not related at all (ie. can be shown to be closer than two bit strings with many bit difference). Hence the algorithm would not find the variable-name-changed source code, if it compares three lines in one go.

    Wouldn't code snipplets passed into new projects always require some sort of name adaption ?

  171. So, where's the data by CAPSLOCK2000 · · Score: 1

    This would have been a lot more interesting if he had actually used his tool to compare Linux to eg FreeBSD. This would test his software and proof its usefullness. It would give us a much better picture of what BSD code is in Linux, and be a great help in determining if SCO has any rights to any of it.

  172. You still don't get it .. by AftanGustur · · Score: 1


    If you've licensed code from microsft, and it turns out to be GPL, the license under which you got the code is invalid, so it wasn't illegal to determine if they improperly took code.

    You forgot the most important thing, MS has billions of dollars to burn, to "prove" they are legally right. And you can bet your kids future on that they will spend such sums to protect their source code.

    How much do you have ?

    --
    echo '[q]sa[ln0=aln80~Psnlbx]16isb572CCB9AE9DB03273snlbxq' |dc
  173. Re:The real question is: by Eunuchswear · · Score: 1

    The sentiment is basically that, if I am in a software company, I have access to legal opinion (Can I use this code?)...

    And so how is it that the only credible example of stolen code in Linux came from someone working for SGI?
    --
    Watch this Heartland Institute video
  174. My spooky prediction of Darl's response: by Rogerborg · · Score: 1

    This is just more evidence of the Open Sores community's intention to launder* the code before Darl can prove how evil they are in a court of law. Buy SCOX before it goes through the roof!

    * "launder" is Darl's actual choice of term.

    --
    If you were blocking sigs, you wouldn't have to read this.
  175. Test SCO Linux Kernel Personality by LightSail · · Score: 2, Interesting

    The best use of this technology would to test the SCO LKP for stolen Linux code.
    Confirming that SCO had incorporated Open Source code that they had access to under the GPL would destroy their credibility and open them up to countersuits. The process would only have to reveal enough similarities to have subpoenas ordered for the actual code involved. Then we could prove the theft with SCO own source code.

    I suspect that those who know that Linux code was used to create LKP would come forward once the code has been discover and posted for all to see.

  176. Scox sky-rockets on this news. by walterbyrd · · Score: 1

    Scox shares continue to sky-rocket in deference to news like this. Scox up about 11% just yesterday. Now scox is over $18 a share.

    Not bad considering that scox's core business is now *much* worse than a few months ago - when scox was under $2 a share.

    All of scox's recent profits come from msft fud money. Untill msft started throwing gobs of money at scox (supposedly for a partial unix linux) scox had never had a profitable quarter. In fact this company with a book value around $10 million was losing as much as $125 million in a year.

    Respected Wall-Street Analyst Jonathem Cohen has just prediced that scox will earn an astonishing $3 a share in 2004. I guess Cohen thinks that Bill Gates is feeling awfully generous.

    US justice system asleep at the switch. Scox insiders laughing all the way to the bank.

  177. Ain't this JUST like the Open Source Movement... by zenofjazz · · Score: 2, Funny

    Even a potential Lawsuit is just another reason to write grooooovy software.. *evil grin*

    GO ESR!!

    --
    -- All That's Evil in the Geek Space ... Allthatsevil.wordpress.com
  178. Stupid ESR by Anonymous Coward · · Score: 0

    ESR is still programming like a pig.
    His code includes linux-only headers, and doesn't pass through -Wall by a large number of fairly stupid warnings.

    It might be that corporations will be happy to give him source code shreds, if they can manage to compile and run his code...

  179. Idiocy... by poptones · · Score: 2, Insightful
    People like you who try to mush it all up are just trying to loot other people's property.

    Did your momma have any children that learned to think?

    Source code gets no copyright protection: corporations keep their source as a "trade secret" and only get protection on the executable. It is illegal to redistribute (copy) the executable, and the source is entirely within their control (and their responsibility). No real "furtherance of the arts" is accomplished except within the limited scope of usage of the tool itself. If a work is infringed at the source level, therefore, it is (nearly) impossible to prove without revealing "trade secrets" and, therefore, exposing the company to further risk.

    Source code gets copyright protection (as constitutionally mandated)

    Corporations have to register the source code, and therefore are given fulll protection on both works. It is just as illegal to redistribute (share) the source beyond the scope allowed by the rights holder, and if a work is infringed there is no risk to the rights holder in defending the work. "Furtherance of the arts" is addressed, as well as the rights of the work's creator.

    Corporations are allowed "copyright" on works they do not share.

    It becomes nearly impossible for libeled parties to defend themselves, but "rights holders" are free to make claims as they see fit. Which gives "rights holders" basically free reign to make accusations which they may never be forced to address in court, and leaves victims nearly defenseless until the (very slow) court gets around to addressing the issue. Neither "furtherance of the arts" nor protection of (libeled) rights holders is served, since the more powerful party remains free to withold (copyrighted) "evidence" that no one is allowed to see.

    How does this system serve rights holders whose works may have been infringed upon, but are forced from the marketplace by another "rights holder" with more money? How does that system serve the public interest? How does it promote progress?

    Can you answer any of these questions using sound logic?

    1. Re:Idiocy... by IM6100 · · Score: 1

      Companies can keep their source code undisclosed. It is still material that they have a copyright on.

      I can write a bad poem on a sheet of paper and I automatically have copyright on it.

      I, and companies that produce source code, can keep that source code, or my poem, secret, disclosing it to nobody. It is then a 'trade secret' and still is copyrighted. Automatically. By virtue of me having written the bad poem and by virtue of the company having produced the source code.

      If you knew how to write clearly, your counter-arguements wouldn't be a confusing mish-mash and you'd make more sense. As it stands, it looks like you're proposing a 'wide sweeping change' (always popular here on Slashdot) but you muddle around with your rationale.

      --
      A Good Intro to NetBS
  180. Check and Mate by Anonymous Coward · · Score: 1, Interesting

    This is the end for SCO. There are only three possibilities here:

    (1) That there is no infringing code that belongs exclusively to SCO. If that is the case, then its game over for SCO; perhaps followed by jail time for Darl and friends at some federal prison where they would discover a new meaning for the phrase "pump and dump".

    (2) That there seems to be code that was duplicated directly and exclusively from the Sys V source tree and that doesn't originate from any other public source. If it's a trivial amount of code, simply replace it and move on. The unlikely perpetrator alone becomes responsible for any damages to SCO.

    (3) If its non-trivial you can simply remove it from the kernel as long as it doesn't impact anyone seriously. Make it part of the final 2.6.0 or a 2.4.x interim kernel release. For example, lets say the IBM journaling file system is exactly the same. Simply remove it from the kernel until at which time IBM settles its lawsuit and resolves those matters. As long as people don't really need JFS, why encumber the kernel? I've never used it, preferring ReiserFS or ext3. Same goes for other supposed code expressions such as RCU or NUMA, although I suppose the copyright issues on those would be easy to solve since the amount of code in question is on the order of 5000 lines. If all you have to do is change the way its expressed, that should be trivial. In any case, derivitive works laws shield any code that is specifically tied to hardware implementation from being considered before the court.

    Bottom line. SCO is really screwed now. Their only recourse is hope beyond hope that they can get someone to agree with their derived works claim on some non hardware/software patent code. At that point the only thing they can do is get compensated by the infringing party. No way they will be able to shake down linux users since they will already have been paid.

    Oh, I forgot the fourth possibility -- that no one that has access to the Sys V sources will be will willing to run 'comparator' on it and generate shreads and that SCO will also refuse. This of course would be sufficient to dismiss their case. A declaratory judgement could be handed down by the federal judge supported by expert testimony that 'comparator' is a valid comparison. Any reasonable expert Software Engineer/Cryptographer would do. Perhaps Bruce Schneir could be the expert witness.

    End of story. Thank you for playing SCO, please drive through.

  181. Infinite hit points and maximum karma! by Thud457 · · Score: 1
    "A few hours ago, I learned that I am now (at least in theory) absurdly rich."

    The key word there being absurdly.

    --

    the preceding comment is my own and in no way reflects the opinion of the Joint Chiefs of Staff

  182. normfuscation by dwhite20899 · · Score: 1
    Sweet - a normalizing obfuscator. You should patent that.

    I can hear the wailing of comp sci students worldwide...