Slashdot Mirror


ESR to Shred SCO Claims?

webmaven writes "According to this article in eWEEK, ESR has released a utility called comparator for analyzing the similarity of source code trees. The technical details are interesting, in that ESR says he is using an implementation of a refined version of the 'shred' algorithm, with higher performance (on machines with enough RAM) than other versions. ESR won't say whether he intends the comparator to be used to compare older Unix code to Linux so as to be able to refute SCO's claims, but it's obviously well suited for such a purpose. Interestingly, as the shred algorithm can run reports on source trees using only the MD5 signature shreds (once generated), it is possible to use it to compare trees without direct access to the source code itself, leading to a possible use in comparing various proprietary source trees with each other and with Freely available code bases such as Linux and *BSD without requiring actual disclosure of the proprietary source code (a neutral third party could generate the shreds on a company's premises, and leave without taking a copy of the source with them). I'll be interested to see if (or which of) the proprietary vendors allow their source trees to be 'shredded' for such comparisons, and whether this becomes a standard forensic technique in source-code copyright and trade-secret disputes."

127 of 554 comments (clear)

  1. maybe... by b17bmbr · · Score: 4, Funny

    microsoft can just shred their source tree and start anew. maybe...

    --
    My problem? I was perfectly gruntled, until some numbnuts came by and dissed me.
    1. Re:maybe... by jmv · · Score: 5, Interesting

      Actually, combine this with the "shared source" program from MS and it would be easy to see if MS did (or did not) copy GPL code into Windows as some suggest.

    2. Re:maybe... by Anonymous Coward · · Score: 3, Informative

      look how Microsoft is directly trying to bias the case more with onesided biased news: check out [from today]: [msn.com article from supposed tech analyst Jonathan Cohen

      then read:
      More on Jonathan Cohen

      Microsoft MSN a biased propaganda machine. Only shows one side of the facts (the lies).

    3. Re:maybe... by fireboy1919 · · Score: 4, Insightful

      Right. Because as we all know, people who pay Microsoft the huge bag 'o money that it costs to see their source are primarily interested in the pursuits of OSS to see if Microsoft has copied anything it shouldn't have. And Microsoft's NDA surely gives them the right to do this.

      If anyone is able to prove Microsoft is doing something illegal via the shared source initiative, they'll probably have to do it illegally.

      --
      Mod me down and I will become more powerful than you can possibly imagine!
    4. Re:maybe... by toast0 · · Score: 3, Insightful

      If you've licensed code from microsft, and it turns out to be GPL, the license under which you got the code is invalid, so it wasn't illegal to determine if they improperly took code.

      On the other hand, if all their code checks out, testing for that may violate their NDA, but it'd be difficult for them to show you checked their code if you don't mention it.

    5. Re:maybe... by Courageous · · Score: 5, Insightful

      And Microsoft's NDA surely gives them the right to do this.

      A term in any contract, including any NDA, as stipulated by any party, which would obligate the other party to not report a violation of law, either statute or criminial, is PER SE unlawful and cannot be enforced within any jurisdiction of of most first world countries. Any contract bearing such a stipulation would in fact be at significant risk of invalidating the ENTIRE contract, not just the unlawful provisions therein.

      C//

    6. Re:maybe... by Webmonger · · Score: 2, Informative

      Hashes and lossy compression are different things. They're designed for completely different purposes and implemented for the purpose they serve. That's why LAME won't compress an mp3 to less than 8kbps, much less 128 bits. It's why md5sum doesn't have a --reproduce-original switch.

      For a given input and parameters, any two (independently-developed) MP3 encoders will almost certainly produce different outputs. For a given input and parameters, different md5 implementations will produce the same result.

    7. Re:maybe... by inode_buddha · · Score: 2, Funny

      Cool! This looks like the *perfect* tool to sort and find dupes in my pr0n collection...

      (AFAIK, nobody ever said the input to md5sum had to be human-readable)

      --
      C|N>K
    8. Re:maybe... by Stephan+Schulz · · Score: 3, Informative
      I anticipate this tool will be useless more often than not, simply because the slightest systemic change would result in zero matches. Replacing tabs with spaces, two spaces with three, or even line-feeds with carriage-returns would yield 100% false negatives if you use this to identify copyright violations.
      I've read the man page that comes with the program, and such things are taken care of. There is an option that will ignore horizontal and vertical white space for comparison purposes, and another one that ignores curly braces (possibly as bad a source of false negatives as formatting).

      All in all, it seems to be quite a nice little tool.

      --

      Stephan

  2. SCO! by scovetta · · Score: 3, Funny

    Of course, we can just trust SCO to show the right hashes. Why would they lie?

    --
    Wer mit Ungeheuern kämpft, mag zusehn, dass er nicht dabei zum Ungeheuer wird. --Nietzsche
    1. Re:SCO! by jmv · · Score: 2, Interesting

      Ths think is that the hashes could be generated my any organisation that has access to the SysV source code. There are many of them (IBM being one).

    2. Re:SCO! by mik · · Score: 5, Insightful

      The point is that we don't need SCO to do anything. Presumably any of the many people with legal rights to SCO source code can publish the hash list without divulging any of SCO's (ahem) "IP". Even more interesting is the theoretical possibility of comparing historical releases of SCO trees against GPL-licensed code, thus (perhaps) demonstrating that SCO has illegally violated the IP of OSS developers. Of course, hash comparisons alone would be unlikely to convince a judge/jury of anything. They ought to be sufficient grounds for some embarrasing subpoenas, and maybe some really neat cease-and-desist orders, though.

  3. Is there really that much data there? by More+Karma+Than+God · · Score: 4, Funny

    If there is, why couldn't MD5 shreds be used as a lossy compression scheme for code?

    --
    Go here to create your own Slashdot dis
    1. Re:Is there really that much data there? by Sterling+Christensen · · Score: 2, Insightful

      Because lossy compression would be useless. When decompressed the source code wouldn't work anymore.

      Source code isn't loss-tolerant (or whatever)

    2. Re:Is there really that much data there? by Paradox · · Score: 3, Informative

      No. Hashes are one way functions. So it'd be kinda pointless. Further, comparing two hashes for anything but equality is meaningless with most good hashing schemes (unless you're a cryptographer).

      --
      Slashdot. It's Not For Common Sense
    3. Re:Is there really that much data there? by Anonymous Coward · · Score: 2, Funny

      Umm... why would you want lossy compression for code? Perhaps if it only lost the bugs?

    4. Re:Is there really that much data there? by rmull · · Score: 3, Funny

      Depends. Who wrote it?

      --
      See you, space cowboy...
    5. Re:Is there really that much data there? by B'Trey · · Score: 4, Informative

      RTFA. The code is split into overlapping "shreds" of three lines. For example, 7 lines of code would generate five hashes, consisting of the following lines:

      1,2,3
      2,3,4
      3,4,5
      4,5,6
      5,6,7

      Two source trees are shredded, then unique hashes are discarded. Anywhere there are three lines of code that are the same ANYWHERE in the source tree, it'll be spotted.

      Now, it's trivial to defeat this if you're specifically aiming to do so. However, for existing source trees (such as nearly countless variations of *nix) that already exist and are duplicated in numerous places, it works nicely. It's impossible to go back and modify the tree because too many copies exist.

      --

      "The legitimate powers of government extend only to such acts as are injurious to others." Thomas Jefferson.

    6. Re:Is there really that much data there? by Krach42 · · Score: 2, Interesting
      Actually, some source code is loss-tolerant. Take C for example. In C the only significant whitespace is between any two elements of the set { identifiers, numbers }, and any that occurs in quotes, or character constants.

      Also, comments can potentially discarded without effecting the compilation of the program.

      Thus, you can take a program:
      int main(void)
      {
      printf("Hello World!\n");

      return 0;
      }
      And turn it into:
      int main(void){printf("Hello World!\n");return 0;}
      You've saved yourself space here. Now, here's the wierd thing, I wouldn't expect this to save any space after gzip'ing, or bzip'ing. I mean, after all, you're primarily just removing one character. But it turns out that on a particular file of mine:

      -rw-r--r-- 1 dfoesch staff 9184 Sep 9 19:00 navajo.c
      -rw-r--r-- 1 dfoesch staff 3213 Sep 9 18:58 navajo.c.bz2
      -rw-r--r-- 1 dfoesch staff 1832 Sep 9 18:58 navajo.c.nospaces.bz2

      And gzip is the same. This is thus a lossy compression for source code that doesn't actually modify the semantics or syntax of the program. (Of course, this won't work for language like Python.)

      Yes, the result it unreadable, but then you just run indent, with your favorite coding-style setup, and viola! It's back to "normal", but different. Just like lossy compression is supposed to work.
      --

      I am unamerican, and proud of it!
    7. Re:Is there really that much data there? by BubbleNOP · · Score: 2, Insightful

      By removing whitespace you collapsed a number of distinct substrings, i.e. what used to be different substrings of the form A\s+B are now represented as just one substring AB. A smaller set of distinct substrings leads to better compression.

  4. ESR ADMITS TO ENRON PRACTICES by Anonymous Coward · · Score: 5, Funny

    This will only serve as another black eye on the Open Source community. ESR should know better that to shred SCO material prior to a trial.

  5. But the Important Question is... by BlackBolt · · Score: 2, Funny

    Did he write it in Python? And did he complete it in under 6 hours?

    1. Re:But the Important Question is... by TMB · · Score: 2, Informative

      From the README...

      Besides the production C code, the distribution also includes working Python versions. These were used to prototype the concept.

      No word on the latter... but it's ESR... so of course! ;-)

      [TMB]

  6. Doubt it will help by Brahmastra · · Score: 5, Insightful

    I think the question here is not about whether there is common code between SCO and Linux. There is no doubt that there will be common code because of the common origins. The issue here is that SCO does not own that code.

    1. Re:Doubt it will help by djh101010 · · Score: 4, Insightful

      If there's going to be a line-by-line comparison, this is the tool to do it. Once those lines are identified, *then* it's simply a matter of finding out the origins of them; that's where we can roll it back to a textbook published in 1973 or whatever.

      Until the lines that are common are identified, it's impossible to defend against the accusations. Because of that, I bet Darling Darl won't allow it to be used. The question is, how to turn the inevitable refusal into something that shuts him (up|down).

    2. Re:Doubt it will help by Azog · · Score: 5, Insightful

      Well, this would still help determine what the common code is.

      If ESR is given the big list of MD5 sums of SCO's kernel by someone who has legitimate access to it, and he runs his shred tool to compare it to the Linux kernel, and a bunch of stuff turns up matching (as expected) he can still see WHAT was matching because he has the Linux sources.

      So then he can look at that and say, "hmmm, it looks like part of this ethernet driver is the same, and this NAT implementation, and bits and pieces of the VFAT filesystem code..." and then, find out how those got to be the way they are in Linux.

      If it can be proved that the matching code is totally legit in Linux, (which is what I would expect) then it follows that either (a) SCO actually stole stuff out of Linux, rather than the reverse, or (b) Linux and SCO both took the code from a third source, like BSD.

      Otherwise, option (c) is that Linux actually contains code from SCO which it should not. But this is still an improvement on the current situation, because it would allow the Linux development team to FIX THE PROBLEM.

      Either way, (sooner or later, depending on if Linux fixes are required) it will shoot SCO's claims so full of holes that any reputable journalist reporting on SCO's latest insane claims will have to mention that "... but the source code has been analyzed and all code in Linux similar to SCO's software has been shown to be completely legitimate...", or "... but all code in Linux which SCO might have had a valid issue about has been removed..."

      SCO's big stick right now is FUD. Fear, Uncertainity, and Doubt. The shred tool can remove the uncertainty and doubt. Only SCO will still have the Fear. :-)

      --
      Torrey Hoffman (Azog)
      "HTML needs a rant tag" - Alan Cox
  7. Nah... by SargeZT · · Score: 4, Insightful

    This shouldn't be relied upon in the court of law. Although I acknowledge that SCO likely has no IP claim over Linux, it should have a fair case. A program that would rule out code similarities does not rule out code that is based on the SCO code. There are hundreds of ways to do a single thing, and if the GNU/Linux took ideas from the SCO kernel, SCO may be as eligible for compensation as if it were directly copied from SCO.

    --
    And why did you staple the trout to the RAM?
    1. Re:Nah... by jedidiah · · Score: 4, Informative

      Don't call it the "SCO kernel".

      It is the SysV kernel.

      --
      A Pirate and a Puritan look the same on a balance sheet.
    2. Re:Nah... by jmv · · Score: 3, Interesting

      That's true in general. However, SCO has explicitly stated that thousands of lines of code have been illegaly copied *verbatim* from System V. This tool could at least prove that they lied (because of the verbatim copy allegation).

    3. Re:Nah... by jonabbey · · Score: 4, Informative

      if the GNU/Linux took ideas from the SCO kernel, SCO may be as eligible for compensation as if it were directly copied from SCO.

      IANAL, but I don't believe this is so in the general case. Copyright protects only specific expression of ideas, not the ideas themselves.

      If SCO had valid patents on some of this stuff, they'd have a point of legal leverage, but they don't from all reports.

    4. Re:Nah... by dipipanone · · Score: 3, Funny

      Don't call it the "SCO kernel".

      OK then, the GNU/SCO kernel.

  8. The truth is out there by Teahouse · · Score: 2, Interesting

    The truth is out there, we will finally get to it without signing a SCO NDA. This should end the case before it begins. SHRED ON!

    --
    "Curiosity killed the cat, but for a while I was a suspect."- Steven Wright
  9. Breaking News! by TexVex · · Score: 3, Funny

    This just in. SCO to sue ESR for patent infringement over "comparator", a software package that performs comparison between different sets of source code to determine if any code is copied between them.

    --
    Fun with Anagarams! LADS HOST, SHALT DOS. HAS DOLTS. AD SLOTHS, HATS SOLD. ASS HO, LTD.
  10. Answered My Own Question.. by BlackBolt · · Score: 3, Funny
    From the article:

    "...has two advantages: one, it's amazingly fast..."

    Guess not. ;-)

  11. Can Someone Explain? by Klync · · Score: 2, Interesting

    If you're comparing two sets of code vis. their MD5 sums, then won't that miss matching lines that differ by even one character - like, say, a space?

    --

    ----
    Not to be confused with Col.
    1. Re:Can Someone Explain? by stratjakt · · Score: 4, Interesting

      Perhaps if you parsed them both, and compared the resulting object code, right before compilation?

      That way if your variable is called numOfPorts and mine is called countOfPorts, the parsed code is the same for both, when stuff like that becomes meaningless.

      Even if not, SCO seems to be saying that much of the code is copy-n-paste anyways.

      --
      I don't need no instructions to know how to rock!!!!
    2. Re:Can Someone Explain? by Sterling+Christensen · · Score: 5, Informative

      From it's manual:
      "The -w causes all whitespace in the file (including blank lines) to be ignored for comparison purposes (line numbers in the output report will nevertheless be correct). This is recommended for comparing C code; among other things it means the comparison won't be fooled by differences in indent style."

    3. Re:Can Someone Explain? by Anonymous Coward · · Score: 2, Insightful

      While you might be able to deal with whitespace, you do still have the problem that you're really only looking at whole-file matches for identity. You can't find one function lifted from some other source. You can't find code that's had even minimal cosmetic surgery on the variable names.

      While a high degree of exact matching between two trees would demonstrate related code, lack of a high degree of identical files as determined by this method does not demonstrate that two code trees are unrelated. It's perhaps an interesting metric for comparing two projects that you already know are related, like two forks of a project or two versions of one project. But this technique is nearly useless as an anti-SCO defense.

    4. Re:Can Someone Explain? by Bob+the+Hamster · · Score: 4, Informative

      And note that it is not comparing the MD5's of whole files, it is comparing MD5's of three-line "shreds" of files

  12. Other uses? by Not_Wiggins · · Score: 4, Interesting

    It might be interesting to see how different families of Linux/Unix compare... maybe generate a veritable "family tree" of relationships.

    Of course, that also depends more on how differences are actually calculated. Still, could make an interesting project to relate OSes based on how much shared code they still retain and show it in a graphical tree format, ala "family tree." 8)

    --
    Diplomacy is the art of saying, "Nice doggie!" until you can find a rock.
    1. Re:Other uses? by JeffTL · · Score: 2, Funny

      Yeah, that'd be great. In my anthropology class we've been studying that sort of stuff, but with DNA...there are some tree diagrams of primates, so why not Unices?

  13. Genius by seldolivaw · · Score: 2, Insightful

    ESR shows us once again why exactly he has so much respect from the community. Well done, that man.

    1. Re:Genius by Anonymous Coward · · Score: 3, Insightful

      Respect? Not from me. He couldn't code his way out of a paper bag. His biggest claim to fame is failing to get kernel modules accepted and then whinging about it. And making up his own Jargon file entries. And writing about how great he is - see his 'I am considerably richer than you' essay. And writing boring long-winded pieces about guns. And claiming to speak for people he shouldn't. Etc.

  14. Who cares? by Otter · · Score: 3, Insightful
    OK, I admit that a) the guy annoys the hell out of me, b) his yapping about "one of us" DOS'ing SCO is yet another case of him embarassing Linux while aggrandizing himself and c) just the quotes in this article alone make me want to slap him. So if someone else had been involved with this, I probably wouldn't bother to care.

    Anyway -- who cares? There's no question there are plenty of common chunks between Linux and SCO-owned source. And that there are ways to find them. The question is what they are (which SCO isn't saying) and what their common origin is and where that origin falls in the murky history of the Unix codebase. It's not as if anyone has been saying, "We're helpless in the face of this computational problem. If only there were a way to compare large bodies of text for common elements!"

    Never mind that there are probably people who can compare both codebases in their heads.

    Maybe he's made some major algorithmic breakthrough. (I doubt it but, but I'll leave that to the experts.) But this story is just him yapping again.

    1. Re:Who cares? by jmv · · Score: 2, Informative

      I think the difference is that a 3rd party that has access to the SysV source can compute the hashes and make them public without violating copyright. That way anyone can look for common lines with Linux and see where they came from (legal or not).

  15. Finally ESR stops yapping and does some hacking by Anonymous Coward · · Score: 2, Funny

    ESR is ok you know, but lately he has just been doing lots of ranting and soapboaxing and no hacking.

    Finally he comes out with some hack action. About time man, I was beginning to view him as just some big windbag who hacked a little back in the day. Well I still sorta do, but this is at least pretty cool, you know.

    1. Re:Finally ESR stops yapping and does some hacking by mdxi · · Score: 3, Insightful
      ESR is ok you know, but lately he has just been doing lots of ranting and soapboaxing...I was beginning to view him as just some big windbag

      Did you read the article? Those are some of the most self-aggrandizing quotes I've ever seen in real life. SCO lawyers should "be afraid" of him. He "perfected" the algorithm. His 1500 line program is a complete masterwork; both elegant beyond compare and a paragon of maintainability!

      You don't ever see, say, Linus, Larry, or RMS talking themselves up like that.

      --
      Posted with Mozilla
  16. SCO may not know origin of code by Malfourmed · · Score: 5, Informative
    The Sydney Morning Herald continues its mainstream coverage of the SCO vs IBM roadshow by posting an article where Dr Warren Toomey, a Unix historian, says that SCO may not know the origin of their own code.

    Article text follows:

    SCO may not know origin of code, says Australian UNIX historian

    By Sam Varghese
    September 9, 2003

    More doubts have been cast on the heritage of System V Unix code, which the SCO Group claims as its own, by an Australian who runs the Unix Heritage Society.

    Dr Warren Toomey, now a computer science lecturer at Bond University, said today: "I'd like to point out that SCO (the present SCO Group) probably doesn't have an idea where they got much of their code. The fact that I had to send SCO (the Santa Cruz Organisation or the old SCO) everything up to and including Sys III says an awful lot."

    He said that even though SCO owned the copyright on Sys III, a few years ago it did not have a copy of the source code. "I was dealing with one of their people at the time, trying to get some code released under a reasonable licence. I sent them the code as a gesture because I knew they did not have a copy," he said with a chuckle.

    Dr Toomey's statements come a few days after Greg Rose, an Australian Unix hacker from the 1970s, raised the possibility that there may be code contributed by people, including himself, which has made its way into System V Unix and is thus being used by companies like the SCO Group.

    Dr Toomey said this was one reason why the code samples which the SCO Group had shown at its annual forum had turned out to be widely published code.

    SCO was unaware of the origins of much of the code and this "explains how they could wheel out the old malloc() code and the BPF (Berkeley Packet Filter) code, not realising that both were now under BSD licences - and in fact they hadn't even written the BPF code," Dr Toomey said.

    He said that there was lots of code which had been developed at the University of New South Wales in the 70s which went to AT&T and was incorporated into UNIX without any copyright notices.

    "At that time the development that was going on was similar to open source - the only difference was that the developers all had to have copies of the code licensed from AT&T," he said.

    Dr Toomey, who served 12 years with the Australian Defence Force Academy, an offshoot of the University of New South Wales, before joining Bond University, said he had source code for Unices from the 3rd version of UNIX which came out in 1974 to the present day. "I don't have Sys V code but there are people with licences for that code who are members of the Unix Heritage Society. We can compare code samples any time," he said.

    He agreed that the codebase of Sys V was a terribly tangled mess. "It is very difficult to trace origins now. There is an awful lot of non-AT&T and non-SCO code in Sys V. There is a lot of BSD code there," he said.

    In March, the SCO Group filed a billion-dollar lawsuit against IBM, for "misappropriation of trade secrets, tortious interference, unfair competition and breach of contract."

    SCO also claimed that Linux was an unauthorised derivative of Unix and warned commercial Linux users that they could be legally liable for violation of intellectual copyright. SCO later expanded its claims against IBM to US$3 billion in June when it said it was withdrawing IBM's licence for its own Unix, AIX.

    IBM has counter-sued SCO while Red Hat Linux has sued SCO to stop it from making "unsubstantiated and untrue public statements attacking Red Hat Linux and the integrity of the Open Source software development process."

    -----

    Wordforge writing contest now open: deadline 2003-03-28

  17. What respect? by Anonymous Coward · · Score: 3, Interesting

    Most people *I* know consider ESR to be a bloated windbag with a penchant for fanatical gunrights. He's regarded as pretty much being on the same level as the late Jon Katz.

  18. Be careful... by nolife · · Score: 4, Interesting

    The more points you discover and disprove now with SCO's claims.. the higher quality, more refined, and detailed SCO's evidence will be when this setup finally gets to a court in front of a judge. If they went to court two months ago or even today, they would have been sent home quickly with bascially easy to disprove evidence. With the help of the open source community, they are slowly changing their weapon of choice from a shotgun to a rifle.

    --
    Bad boys rape our young girls but Violet gives willingly.
    1. Re:Be careful... by Daniel+Phillips · · Score: 3, Insightful

      The more points you discover and disprove now with SCO's claims.. the higher quality, more refined, and detailed SCO's evidence will be when this setup finally gets to a court in front of a judge.

      Having many thousands of bright minds working on our side much more balances the advantage SCO can get by snooping on our discourse, if they can even come close to following it all, that is. We outnumber them, it's stupid not to capitalize on that.

      Just think, if the word doesn't go out, there are many people who might not have come out of the woodwork to contribute their valuable input, historical recollection, interesting files, legal insight, whatever. We work in the open, we share information, we cooperate, we are many in number. They work in the dark, they trust nobody, they're afraid to ask for help, they are few. It's open source versus closed source all over again.

      Also, we each do our own thinking, we try to come up with the part we can contribute, then we go looking for the best place to contribute it. Multiply by 10's of thousands. Compare to a few fevered minds going over and over the same rotten thoughts then sending out marching orders. Seen two systems like that before? Right, it's a free market economy versus Soviet-style central planning. In the end, the free market won because it is more efficient.

      With the help of the open source community, they are slowly changing their weapon of choice from a shotgun to a rifle.

      A rifle will not help you much against a herd of 50,000 enraged penguins stampeding towards you at an average speed in excess of 100 miles per hour.

      --
      Have you got your LWN subscription yet?
  19. But SCO's no ordinary rabbit! by Anonymous Coward · · Score: 3, Funny

    Bruce Perens:
    Three. Three. And we'd better not risk another frontal assault. Their legal team is dynamite.
    Linus:
    Would it help to confuse it if we run away more?
    Bruce Perens:
    Oh, shut up and go change your firewall!

    Alan Cox:
    Let us taunt it! Darl may become so cross that he will make a mistake.
    Bruce Perens:
    Like what?
    Alan Cox:
    Well... ooh.
    ESR:
    Have we got bows?
    Bruce Perens:
    No.
    ESR:
    We have the Holy Hand Grenade.
    Bruce Perens:
    Yes, of course! The Holy Hand Grenade of Antioch! 'Tis one of the sacred relics Brother Richard carries with him.
    Brother Richard! Bring up the Holy Hand Grenade!
    MONKS: [chanting]
    Pie Iesu domine, dona eis requiem.
    Pie Iesu domine, dona eis requiem. Pie Iesu domine, dona eis requiem. Pie Iesu domine, dona eis requiem.

    Bruce Perens: How does it, um-- how does it work?
    ESR:
    I know not, my liege.
    Bruce Perens:
    Consult the Book of Armaments!
    RMS:
    Armaments, chapter two, verses nine to twenty-one.
    OPEN SOURCE ZEALOT:
    And Saint Attila raised the hand grenade up on high, saying, 'O Lord, bless this Thy hand grenade that, with it, Thou mayest blow Thine enemies to tiny bits in Thy mercy.'
    And the Lord did grin, and the people did feast upon the lambs and sloths and carp and anchovies and orangutans and breakfast cereals and fruit bats and large chu--
    RMS:
    Skip a bit, Brother.
    OPEN SOURCE ZEALOT:
    And the Lord spake, saying, 'First shalt thou take out the Holy Pin. Then, shalt thou count to three. No more. No less. Three shalt be the number thou shalt count, and the number of the counting shall be three. Four shalt thou not count, nor either count thou two, excepting that thou then proceed to three. Five is right out. Once the number three, being the third number, be reached, then, lobbest thou thy Holy Hand Grenade of Antioch towards thy foe, who, being naughty in My sight, shall snuff it.'
    Richard:
    Amen.
    KNIGHTS:
    Amen.
    Bruce Perens:
    Right!

    One!... Two!... Five!
    Alan Cox:
    Three, sir!
    Bruce Perens:
    Three!
    [sco dies]

  20. Unfortunate name by tordon · · Score: 2, Insightful

    Upper mangeement in most enterprises have a low level of technical knowledge. To them the thought of something called shredding coming anywhere near the 'voodoo' of software development would be abhorrent.

  21. Dibs on naming the KDE GUI by Eberlin · · Score: 4, Funny

    KDE GUI version should be called Krang since Shredder would obviously be used from the command line (shell). Maybe it should have helper apps called Bebop and Rocksteady. And if the need should arise, the project shouldn't fork...it should splinter.

  22. This is actually a darn good idea by RocketRick · · Score: 5, Informative

    By computing MD5 hashes of consecutive (overlapping) line triplets, the shred algorithm makes it easy to identify copied code, without ever seeing the actual code. This might be a perfect way for companies to allow a third party to compare code, without giving away any trade secrets in the process.

    Of course, since MD5 is a very good cryptographic hash function, *any* one-bit change in the source will result in, on average, half of the bits in the result being flipped. So, this method of identifying copied code would only work if the code had never been run through an obfuscator. It would also be defeatable by running the source through a script to have its variable names search-and-replaced with similar names (such as replacing every variable name with a new name consisting of the old name plus "_newname")....

    In short, this might be a useful technique for allowing a third party to look for trivial wholesale copying of code, but it would be useless for finding a motivated miscreant, determined to steal code without being caught.

    1. Re:This is actually a darn good idea by Trailer+Trash · · Score: 5, Insightful

      So, this method of identifying copied code would only work if the code had never been run through an obfuscator.

      You've hit the nail on the head, possibly without knowing it. The source code needs to be run through an obfuscator *before* shredding. Actually, I'm thinking a special obfuscator, let me explain.

      Let's take a piece of C source, not randomly chosen:

      malloc(mp, size) struct map *mp; { register int a; register struct map *bp; for (bp = mp; bp->m_size; bp++) { if (bp->m_size >= size) { a = bp->m_addr; bp->m_addr =+ size; if ((bp->m_size =- size) == 0) do { bp++; (bp-1)->m_addr = bp->m_addr; } while ((bp-1)->m_size = bp->m_size); return(a); } } return(0); } Now, the structure of the code is 99% of what matters. Variable names can change, but few people would change anything beyond that. Let's modify the code in a couple of important ways. First, all variable names are changed to new names, on a per-line basis. Blank lines and unneeded blanks are all removed. Each statement is on its own line, and formatting styles (such as curly bracket placement) are standardized. malloc(a, b) struct a *b; { register int a; register struct map *b; for (a=b;a->c;a++) { if (a->b>= c) { a=b->c; a->b=+c; if ((a->b=-c)==0) do { a++; (a-1)->b=a->b; } while ((a-1)->b=a->b); return(a); } } return(0); }

      This might not be perfect, but it should do the trick. A programmer can change variable names, spacing, or format, but as long as the code is the same, it'll match. Obviously, changing the code would have an impact, but nearly every line would have to be changed for it to not match, and in a substantial way. That's literally not always possible to even do in a way that would trick this function.

      Anyone want to write it?

      Michael

  23. Ups and downs by autocracy · · Score: 4, Informative
    Upside: we can maybe help catch more stolen code.
    Downside: Uh... it just came out... and it's making some big, big claims involving fuzzy logic. I think it's gonna need some testing first, eh?

    Also, anybody else think it only works on larger sections of code than just say 10 lines?

    --
    SIG: HUP
  24. Bad for Students by chicagoan · · Score: 2, Funny

    I'm just glad that I finished college before they had this technology otherwise I might have been caught for cheating. Although I was really good at renaming variables.

  25. Automating people's careers away by YetAnotherName · · Score: 4, Funny

    Thanks ESR. You've just put a team of mathematicians at SCO who were somehow related to MIT out of their jobs.

  26. Re:fire the "laser" by be-fan · · Score: 3, Interesting

    You know the sad thing about all this? I can't tell the difference between the auto-generator or your average Slashdotter. Does this mean that the auto-generator passes the Turing Test, or that the average Slashdotter doesn't?

    --
    A deep unwavering belief is a sure sign you're missing something...
  27. Slim to None by tomRakewell · · Score: 5, Insightful

    Chances are slim to none that a software company would allow it's "shredded" source code to be publicly released. What happens if the proprietary source is found to violate the GPL?

    Proprietary (closed) source companies have a tremendous advantage over open source software when it comes to violating intellectual property. Who will ever know if they did it? A source code "comparator" eliminates that crucial advantage.

    1. Re:Slim to None by JoeBuck · · Score: 4, Interesting

      But IBM already has a copy of SCO's code; they licensed it after all. They can release the output of "shred" without violating their agreements with SCO.

    2. Re:Slim to None by k98sven · · Score: 2, Interesting

      Proprietary (closed) source companies have a tremendous advantage over open source software when it comes to violating intellectual property. Who will ever know if they did it? A source code "comparator" eliminates that crucial advantage

      Not really.. Open-source software usually has a nice setup with mailing-lists, CVS, etc. Most of the code is well accounted-for. The same is not as true with a lot of proprietary software.

      Remember, it's not enough that two pieces of code match to prove an infringement in court.
      In fact, the court will most likely take into consideration the fact defending code is open-source, and the burden of proving that they originated the code would be increased for the plaintiff.

      Also, failing to prove that they originated the code could leave them open to a countersuit in which the tables would be turned against them, since they obviously had access to the open-sourced code.

  28. Results Will Appear "Tainted" by zapf · · Score: 5, Insightful

    While I fully support ESR and the rest of the open source movement's defense of Linux against SCO, I have a feeling that this tool's results will not immediately be accepted by established media simply because of ESR's bias. A reporter looking into the SCO story who knows little about open source wouldn't trust a tool made by one side of the disagreement.

    It seems very important to me that "third parties" and experts who are not an integral part of the open-source movement validate that comparator works as intended and is effective at detecting code similarities. Hopefully we'll see some articles on respected sites in the next week or so with conclusive analyses of comparator. Not to mention a chance for someone to use it on SCO's code!

    Oh, and "Yes, I'm being deliberately vague and tantalizing" is quite funny.

    1. Re:Results Will Appear "Tainted" by Brandybuck · · Score: 5, Insightful

      A reporter looking into the SCO story who knows little about open source wouldn't trust a tool made by one side of the disagreement.

      Then why would a reporter trust the press releases that SCO puts out on an daily basis?

      The unfortunate reality is that they DO trust them. We may all think this is a joke here in our insular community, but the great majority of reporters report the press releases "as is". Then the analysts come along and refine those press releases into easily digestible chunks. Then the pundits come along with preconceptions based on those chunks. Ever wonder why the SCO stock keeps going up and up and up? It's because the only thing the general public knows about this issue has come from SCO.

      Anything that can help get the truth before the public eye is a Good Thing(tm). A tool that can mathematically "prove" that SCO is lying is valuable, even if most reporters suspect a bias.

      --
      Don't blame me, I didn't vote for either of them!
  29. IBM has a project called History Flow by TedTschopp · · Score: 5, Interesting

    This is perhaps a better project and it would be interesting to see this tool run against the source.

    History Flow The following is from their website:

    history flow
    visualizing dynamic, evolving documents and the interactions of multiple collaborating authors:

    Motivation
    Most documents are the product of continual evolution. An essay may undergo dozens of revisions; source code for a computer program may undergo thousands. And as online collaboration becomes increasingly common, we see more and more ever-evolving group-authored texts. This site is a preliminary report on a simple visual technique, history flow, that provides a clear view of complex records of contributions and collaboration.

    --
    Fantasy remains a human right; we make in our measure and in our derivative mode... -- JRR Tolkien
  30. Would this really work? by Paradox · · Score: 3, Insightful

    Well, I was looking at ESR's description of the code (I haven't read the code yet), and it seems to say that he takes 3 line slices, MD5s them, then compares them for identical points. I'm sure he compensates for funky whitespace and whatnot like diff and patch do...

    But if even one bit of the source is different, the MD5 hash will be quite different. So, the code slices have to be IDENTICAL. This is not a very good system because a simple find-replace could defeat it. A variable's name changed by one letter, or even capitalization, will defeat it.

    Unless the code reveals much more complex tricks than ESR describes in the help file, this tool wouldn't be much use in the SCO case. Hell, it wouldn't be much use catching college class cheaters even.

    --
    Slashdot. It's Not For Common Sense
    1. Re:Would this really work? by net_bh · · Score: 3, Insightful
      But SCO claims that a lot of code has been copied *as-is* along with the comments.

      The tool ought to be able to highlight all those flagrant cases (if any) and the report generator would then generate something that would be perused by a human.

      --
      There is no patch for stupidity

      Visit my blog

    2. Re:Would this really work? by shaitand · · Score: 2

      except that according to SCO millions of lines were copied VERBATIM into linux.

      Verbatim would give a matching md5 sum, sysv code isn't tough to get your hands on (especially since IBM has it, as well as their own code they supposedly contributed). Making the md5 hashes will be a breeze.

  31. Re:Nonsensical idea by El · · Score: 4, Insightful

    Comparing the hashes doesn't give you a definitive answer; it does, however, tell you where to look. Or which submitters to ask for clarification on the origins of potentially infringing code. That's more than we have now!

    --

    "Freedom means freedom for everybody" -- Dick Cheney

  32. Its been around for years by Anonymous Coward · · Score: 3, Interesting

    check out this research project coming out of berkeley CAP

    Drop in the code you are interested in and it will tell you where its found in a bunch of open source stuff, including the linux kernel.

  33. He's a clever so and so, isn't he? by NickFortune · · Score: 2, Insightful
    Firstly this gives us a way around the SCO NDA bullshit. So far the only way to disprove their case has been to look at the code, after which the NDA stops you from telling anyone. This lovely piece of work sidesteps that nicely. Furthermore, it opens a whole load of possibilties.

    It gives software houses a way of publishing commercial code for copyright purposes. If you claim copyright on code, you can publish the MD5 shred sigs for the code. No one can rip you off, but you can enforce your rights in a court.

    Even better - no one now has an excuse for not publishing. That means that we can make sure the kernel never comes within spitting distance of anyone else's property again. And if it does - well they should have published.

    Now if SCO aren't willing to publish their MD5 shreds, then that can only be because they have no case. In which case - game over!

    On the other hand, if they do, the world at large can then go through their published shreds and see exactly whose code SCO have been ripping off. Given the likely origins of those samples they exhibited a while back, I'd say that's likely to be quite a bit.

    This looks like the best news for the war against everyone's favourite Stupidly Corrupt Organisation since the whole mess kicked off.

    --
    Don't let THEM immanentize the Eschaton!
  34. It seems impossible to compare without leaking by expro · · Score: 3, Insightful

    Think of the chance that any given line of source code in an arbitrary program is repeated somewhere else in a large open source program such as the Linux Kernel. This is even more true if some degree of fuzziness is added to handle changes such as adding or removing spaces in insignificant places, removing comments, (and there are many other things like brace style which affect multiple lines so you might want to physically reformat between lines to a standard format....

    If the number of lines is even only 1% that are found somewhere in the open source code base, I think a source who wants to keep their code base secret will have a big problem with someone computing the checksums. In reality, I wouldn't be suprised to see a much-higher percentage of lines leaked this way. And this is not the only way leaking can occur (think of application of simple cryptography).

    I would not want to be the one publishing the checksums of the closed source due to possible legal liability. The checksums are a derived work in any case.

  35. In all fairness.. by k98sven · · Score: 3, Insightful

    SCO was unaware of the origins of much of the code and this "explains how they could wheel out the old malloc() code and the BPF (Berkeley Packet Filter) code, not realising that both were now under BSD licences - and in fact they hadn't even written the BPF code," Dr Toomey said.

    The SCO Group (not old SCO) hasn't written any code in SysV UNIX.

    Anyway.. One could hope that when this is all over, the UNIX sources will be bought up from the carcass of SCOX and open-sourced, finally putting it out of its misery..

    That is, as long as SysV UNIX doesn't have more stolen code in addition to the BSD code we all know about..

    The sooner the zombie of UNIX is put to rest, the better for all the live Unices.

    1. Re:In all fairness.. by Dr_Marvin_Monroe · · Score: 4, Insightful

      In all fairness, SCO's value is not in being purchased so that the source code can be freed...

      SCO's value is in acting as a totem against future companies who would try this same stunt....Their value is in their smoking carcass with Daryl's chared head mounted promanently on a high pike...

      At this point, there can be no comprimise with people who commit fraud to inflate their stock price and to promote FUD.... I believe that Daryl KNOWS that his claims are false...he deserves to fry....

      I say, "smoking head on stake" for all the SCO/Canopy group members.... leave all the execs at SCO without a job and discredited like the MCI/ENRON execs....Leave all the investors holding worthless stock certs....Somebody needs to be an example, and SCO volunteered by inflating/changing/hyping/FUDing their claims.

      I could have had a little sympathy for them if they had just filed their suit and shut-up until the trial....but at $17/share now, we need to destroy some wallets to remind everyone that it's not over till the gavel falls......

  36. SCO claims, Open Source acts by Arnulf · · Score: 2, Insightful
    I'm impressed.

    While Dark McBride and Chris Sontag shoot their mouths off, the community develops tools to finally make something clearer. :)

    -Arnulf

  37. I can write such a utility also! by pclminion · · Score: 5, Funny

    int main()
    {
    printf("These source trees appear to be entirely different!\n");
    return 0;
    }

  38. Who says SCO gets to court first? by JoeBuck · · Score: 4, Interesting

    If we can show that SCO's violating the BSD license, maybe we can convince some BSD copyright holder to sue them first, and demand as part of discovery the MD5 checksums from "shred", showing duplicated BSD code but no duplicated BSD copyright.

  39. What if...? by bladernr · · Score: 4, Insightful

    What if this ESR tool runs and finds commonality, and the research shows that, in fact, SCO's rights were breached. Remember, this type of analysis is a two-edged sword. The purpose of this ESR is to remove doubt... but remember doubt could be removed either direction.

    So, given that hypothetical, what would people here think? Would you forive SCO? Would you concede SCO's point, but think that SCO defended their rights in a very poor manner? (this, btw, is what I would probably do). Would you stick your fingers in your ears and refuse to accept the outcome, and believe in some vast -wing conspiracy?

    Obviously, the Linx movement would carry on. I don't think the death of Linux is even worth discussion. Some recourse would happen, probably monetary damages, and the offending code would be removed.

    My real curiosity is how people's attitudes or feelings would change (or not change) if it turns out SCO is right (however unlikely that is).

    --
    Sarcasm and hyperbole are the final refuges for weak minds
    1. Re:What if...? by good-n-nappy · · Score: 2, Insightful

      It still would not prove SCO's point. They have to answer the question of why they distributed the Linux source even after finding out it contained their valuable "IP."

      Most of us are relying on common sense and don't really care whether a few lines of archaic code were copied. Given SCO's

      1) previous sales of Linux
      2) misinformation about owning Unix
      3) waffling on what IP is violated
      4) refusal to show copied code
      5) frequent, inconsistent press releases
      6) heavy insider trading
      7) ridiculous licensing terms
      8) collusion with MS
      9) discredited evidence of code being copied
      10) ?
      11) profiting

      Common sense says that SCO does not have a legitimate claim. There is no rabbit left for SCO to pull. So no, most people would not "forgive" them although they might concede the point that a few lines of ancient code were copied.

      --
      Never underestimate the power of fiber.
    2. Re:What if...? by Anonymous Coward · · Score: 2, Informative

      It's not that people around here think SCO is evil for saying their IP has been stolen. Some people around here think SCO is 'evil' for how they've handled the situation.

      If I remember correctly, the open source community has made several offers to remove the tainted code if SCO would just say what code is in violation.

    3. Re:What if...? by The+Cydonian · · Score: 2, Insightful

      A two-cent observation regarding most media "debates":- folks usually attack other people, not positions. It's usually very rare to find anyone who's changed his position based on new evidence or inference. Not impossible, just rare. (Which, in a way, is why not getting emotional about anything is always a good idea; that way, you can base your support on rational thought, not objects)

    4. Re:What if...? by ls+-lR · · Score: 2, Insightful

      Realistically speaking, if indeed there were infringing parts, we never would have heard of any of this. The whole tone of this article, and the quotes from Raymond, smack of "I've already done the comparison and nothing's there." I think if by some small chance there was something illicit in the Linux tree, Raymond would have notified the maintainer and/or Torvalds and put out a patch to remove it ASAP. Or at least, that what they've constantly stated they would do were this the case.

      In other words, I believe them when they say that "If by some chance there were infringing code we'd do everything we could to remove it very quickly." SCO's lawyermongering only really applies to IBM anyway, so that point is rather moot.

  40. how about the Bible Code Algorithm? by The+Lynxpro · · Score: 3, Funny

    Have any of those techno-Rabbis run a comparison search with their "Bible Code" program on SCO? Did it come up with the phrases "bankrupt in 2004," "full of camel dung," and "Serpent of Utah"? How about running the "Bible Code" on Unix System V. code? Considering SCO's fondness for converting code over to Greek symbols for their presentations, converting to sanskrit, Hebrew or Aramaic shouldn't be a problem...

    --
    "Right now, somewhere in this world, Scott Baio is plowing a woman he doesn't love," - Peter Griffin, *Family Guy*
  41. Here's how you defeat obfuscators by Serveert · · Score: 3, Insightful

    Compare C parse trees. That's right, look at the parse trees, use some fancy graph algorithm to compare the calculations and parse tree nodes.

    Someone mod this up I think I'm on to something!

    --
    2 years and no mod points. Join reddit. Because openness is good.
  42. Re:The real question is: by bladernr · · Score: 3, Insightful
    Hmmmm.... that is an interesting point, but I'm not sure it applies.

    Here is the reason: the people that "stole" SCO's code (if indeed that happened) probably were not acting with ill intent. They probably thought they were doing genuine, valid reuse, in which case, why hide it? Obfuscating runs the risk of introducing new bugs.

    OSS programmers, even the ones that cut corners, are not malicious in my experience. There are honest mistakes made, because, well, they are lone programmers, not lawyers, or professional managers, or finacial experts, or whatever.

    However, if code was diliberatly obfuscated, that would be very, very bad news for Linux. That shows that it was not an honest mistake, but the programmer knew something about the origins and they needed to be hidden. At the best, he could argue that he didn't think that it was an IP violation, he was just trying to make himself look better by not giving credit. The other side could argue he obviously new he was breaking the law.

    Of course, as I said, I honestly don't think this case will come about. Even if code found its way in, I don't think it was a programmer say "Hey, I'm going to do this, but it is illegal, so I will cover my tracks."

    --
    Sarcasm and hyperbole are the final refuges for weak minds
  43. derivative work? by donutz · · Score: 4, Interesting

    Presumably any of the many people with legal rights to SCO source code can publish the hash list without divulging any of SCO's (ahem) "IP".

    Would these hashes of SCO source code be considered derivative works? That could have copyright implications...

    1. Re:derivative work? by Wumpus · · Score: 4, Insightful

      Let's apply a powerful legal tool: The silly analogy.

      Take a copyrighted work (Harry Potter and The Chamber of Secrets, for example).

      Now, rearrange all the letters randomly, and pick (say) every 10th letter. Apply rot13 to the result, and print it.

      Is this derivative work? If you think it is, then, yes, copyright holders should be able to control MD5 hashes produced from their work.

    2. Re:derivative work? by Ninja+Programmer · · Score: 2, Insightful
      Would these hashes of SCO source code be considered derivative works?
      Its derivative, but is it work? I think this could easily be explained to the court. There is no expressive usable content in MD5 hashes and its not realistically reversable -- that's the whole point of it.
    3. Re:derivative work? by ls+-lR · · Score: 2, Interesting

      I think we all agree that the obvious "duh" answer is that "of course they wouldn't be derivative works." But SCO has proven that it has a knack for just making stuff up or interpreting things funny. However, even based on the letter of the law I don't think this would qualify as a "transformation." That would seem to apply to a case where you shift the representation of the data to a different format but retain its essence, such as copying a DVD to a VHS tape. However, creating MD5 sums does not seem like it would be a transformation in that sense, in that the new work has none of the qualities of the original -- it's not code, it won't compile, it cannot be used to divine any algorithms, methods, etc. In sort it's completely useless, other than for comparing to other source code fragments.

  44. Re: Copyright by daigu · · Score: 2, Insightful

    While IANAL, I don't suspect you are either. Copyright is not something that applies to ideas - it applies to expressions of ideas. I'll quote the Apple vs Microsoft case note by Joseph Meyers:

    Typically, copyright protection is awarded to literary work, and Congress has included the code that makes up a computer program in this category.[7] Additionally, some non-literal expression is protected. For example, not only are the actual words in an author's copyrighted novel protected, but the structure and plot may be protected as well.[8] The debate[9] in the area of computer science is whether, by analogy, this means that the result, output, organization or display (the "look and feel") of a computer program might be protected as well, even if the source code is different.[10] This is to be distinguished from the idea underlying the program, which is not subject to copyright protection.[11]
    In other words, the question on the table is whether portions of the Linux kernel are a derivative work of SCO's code - not whether it uses SCO's ideas.
  45. Not as useful in court by klui · · Score: 2, Interesting

    It will just tell someone two trees are similar/identical. The important thing to prove in court is who copied from whom.

  46. Better way to compare code by Brikus · · Score: 2, Informative

    Speaking of BSD, a better way of doing this comes from Berkley too. It's a program called Moss that is used by many universities to detect plagarism in CS classes. I know from firsthand experience that this is a very powerful program. Unlike the shredding technique, things like changing variable names won't affect the comparsion value Moss returns. It even does a pretty good job of noticing changes like replacing for loops with while loops.

    One disadvantage it does have though is that it won't work with the MD5 checksums, although I'm a bit skeptical of how well that would work anyway.

  47. Re:IT WILL NOT WORK! Here's technical reason why by Q2Serpent · · Score: 4, Funny

    create a C language parser that reduced the C-code down to op codes

    like gcc?

  48. Re:Is this really as useful as it seems? by toast0 · · Score: 3, Insightful

    Finding obfuscated copied code is a difficult problem to solve. Presumably, SCO has put forth much effort into that, but they refuse to make public their claims.

    Straight forward copying of code is much easier to find, and much easier to show is copying in a court. If we look at all the instances of duplicate code, and determine if they are license violations or not, it will be a start to making SCO go away.

  49. Bah! FSS developers will never learn... by greppling · · Score: 3, Funny

    ...how to write good user interfaces. With coders like you we will never achieve complete world domination. The correct program is, of course, s.th. like this:

    int main()
    {
    int i;
    printf("Comparing source trees...\n");
    sleep(2);
    printf("Check started.\n");
    for (i = 1000; i--;) {
    printf(".");
    sleep(1);
    if (i % 100 == 0)
    printf("\n%d0 percent remaining\n", i / 100);
    }
    printf("\n\nThese source trees appear to be entirely different!\n");
    return 0;
    }

    1. Re:Bah! FSS developers will never learn... by joe_bruin · · Score: 2, Informative
      Line 7: for (i = 1000; i--;) {

      Where's the limit test? Or did you mean:

      for (i = 1000; ;i--) {


      what the original poster had works correctly. i-- returns the value i (pre-increment), and satisfies the end condition when i is zero.
  50. Open Source by digidave · · Score: 3, Interesting

    THIS is exactly why Open Source works. It's not because of IBM or Red Hat or geeks from Finland. It's because people in the community are willing to step up to any challenge.

    Thanks, ESR.

    --
    The global economy is a great thing until you feel it locally.
  51. You guys are missing the point. by LinuxParanoid · · Score: 4, Informative

    Pardon me, but a lot of you guys are missing the point of this comparator.

    1) There are people out there with legit source licenses to SVR5 source trees. And not just Unix OEMs. Various people in large companies with SVR4/5 source licenses etc.

    2) Such people cannot release the source code, and may (if paranoid of how they interpret 'derived works') not want to publish hashed MD5 codes of SVR5.

    3) However, it should be possible for a legit SVR5 source licensee to publish openly a list of areas of code that are similar across trees, without that list being either A) a derived work, B) violating their NDAs (um, do check the fine print first though) and C) spending tons of their own, presumably expensive time, digging through stuff.

    Then Linux advocates can then sift through the resulting large pieces of code and doublecheck/crosscheck the origins of it. At the very least, we'd have a public list of suspicious areas of Linux and could determine that certain parts are A) BSD-licensed, B) are verified as original by a known Linux coder, and C) don't fall into the above categories and remain 'suspicious'. This presumably is what ESR is referring to by "various persons will apply it in useful ways. Yes, I'm being deliberately vague and tantalizing". Let's say that its likely the percentages in A and B will be large.

    Of course it's true that there could be code that this primitive tool doesn't catch. But SCO probably started their analysis by using tools like this also. Looking through millions lines of code by hand is no cakewalk, so one will inevitably start with code like this in such an investigation. (Unless one is concerned about one specific predetermined critical/sensitive piece of code.)

    Oh, and the other thing about this tool that is nice IMHO? It demonstrates a "good faith" effort on the part of Linux advocates and coders to correct the problem -- despite the barriers raised by SCO (no code release except via NDA).

    Finally, running this tool across a Linux and a BSD release should turn up some data that is both interesting and relevant for this dispute. I'm almost tempted to try that myself.

    --LP

  52. Press release! by mflaster · · Score: 2, Interesting

    Why isn't this a press release?

    If I go to Yahoo, and look at news related to SCOX, this doesn't show up. Here is the open source community trying to help find any misappropriated IP - and no one that doesn't read slashdot/eWeek will know about it!

    Isn't there someone who subscribes to a wire service, that can issue a press release? In order to fight FUD, we have to get info out to people that don't read slashdot!!

    Mike

  53. No source = no copyright by poptones · · Score: 4, Insightful
    This entire argument is happening for ONE reason: various governments of the world )specifically, in this case, the US) has afforded COPYRIGHT protection to works that contribute nothing to "furtherance of the state of the art" and nothing to "the progress of science." If I build a power saw, I can patent unique aspects of its design but have to reveal those aspects.

    Copyright is misapplied to source code. Either REVEAL THE SOURCE or you only get protection on that which you "publish" - namely, the binary.

    Put up or shut up; no source, no copyright on the source. You won't share it, you don't need it protected.

    1. Re:No source = no copyright by poptones · · Score: 3, Interesting

      Apparently your reading comprehension skills are right on par with the dolt who modded the post down.

    2. Re:No source = no copyright by IM6100 · · Score: 3, Interesting

      Get a clue. Nobody who copyrights a work is under any obligation to widely spread around the work. Copyright is inherent in any written work. I can write a poem intended only for my lover, just give the one copy of the poem to that lover, and it's protected by copyright. Break into my lover's house, steal a copy of the poem, and publish it, and you've broken copyright and I have standing to nail you good for it.

      Patents, in order to be patented, need to be fully disclosed. That's inherent in the patent process, you're saying 'this is MY idea, here's the whole deal laid out, I assert that it's mine.' There's no comparable oblication for copyright.

      People like you who try to mush it all up are just trying to loot other people's property.

      --
      A Good Intro to NetBS
    3. Re:No source = no copyright by Raffaello · · Score: 2, Informative

      You are missing the context to which the OP refers, which is Article I, Section 8 of the United States Constitution. This Section gives Congress the power:

      "To promote the progress of science and useful arts, by securing for limited times to authors and inventors the exclusive right to their respective writings and discoveries; "
      The "requirement" that the grantee of either a copyright or a patent publish the work in question follows from the first clause of the sentence, i.e. "To promote the progress of science and useful arts."

      If grantees were allowed to keep their works secret, the grant would not be promoting the "progress of science and useful arts," since no other scientist or author would have access to their work.

      The whole idea of patents and copyrights in the U.S. constitution is that the grantee goes public with the invention/work, thus letting others advance "science and the useful arts" by using the grantee's work. In exchange for disclosing this information (remember, the word "patent" means "public."), the grantee is given a legal monopoly on the right to profit from the invention/work for a limited period of time.

      As the constitution sees it there are only two alternatives. Either the potential grantee keeps the work/invention a trade secret, never publishing it, but thereby giving up legal rights to time limited exclusive profitability, or, the potential grantee "promotes the sciences and useful arts" by publishing the work/invention, and thereby gains a time limited exclusive right to profit from it.

  54. SCO's trade secrets --- it's all FUD by EmbeddedJanitor · · Score: 5, Funny

    They would be divulging SCO's biggest trade secret, that all their claims are just FUD.

    --
    Engineering is the art of compromise.
  55. Re:IT WILL NOT WORK! Here's technical reason why by Megaslow · · Score: 4, Informative

    RTFM:
    Name

    comparator, filterator -- fast comparisons among large source trees
    Synopsis

    comparator -c [-d dir] [-o file] [-s shredsize] [-w] [-x] path...

    [snip]

    The -s option changes the shred size. Smaller shred sizes are more sensitive to small code duplications, but produce correspondingly noisier output. Larger ones will suppress both noise and small similarities.

    [snip]

    The -w causes all whitespace in the file (including blank lines) to be ignored for comparison purposes (line numbers in the output report will nevertheless be correct). This is recommended for comparing C code; among other things it means the comparison won't be fooled by differences in indent style.

    Using the appropriate switches will address two of your points. I'm sure it wouldn't be extraordinarily difficult to modify the code to ignore other things such as string constants, variable names, etc.

  56. Re:Who says ESR can't code? by finkployd · · Score: 4, Insightful

    Actually fetchmail proves that he can code.

    This program just proves that md5 is not the correct hash for doing this kind of comparison. It is TOO GOOD of a one way hash, and will only return is positive if the lines being compared are 100% equal.

    Finkployd

  57. Someone Did this in June. by Popsikle · · Score: 2, Interesting

    I cant dig up the slashdot post, but here is The Inquirer article from Jun 18th. Someone did this well before esr did.

    Its not new, Its not esr's Idea, Its almost 3 months old!!!

  58. Comparison algorithms? by Ivan+the+Terrible · · Score: 2, Interesting

    I'm interested in algorithms that could be used to compare code. Moss and CAP from Berkeley are not interesting because the algorithm is secret (AFAIK).

    What algorithms other than ESR's comparator are there? (I recall but can't locate a recent comment on Slashdot that said something like "most plagiarism detection programs used by professors use the XXXX algorithm".)

  59. Re:IT WILL NOT WORK! Here's technical reason why by miniver · · Score: 3, Informative

    Download & read the source. Or just read the documentation.

    Comparator has the capability (-w) to ignore whitespace while generating the hash, while at the same time tracking the actual line numbers for purposes of merging and reporting. In my experience, most code-copiers are dumb and/or lazy -- to get past ESR's tool, the code-copier would have to (a) realize that they're violating a license, (b) not care, (c) be smart enough to realize that a pure cut-and-paste might get caught, and (d) energetic enough to munge up the code logic and variables. While I'm sure there are people like that, I would argue that most of them wouldn't be interested in contributing the result to the community, and the code wouldn't get past Linus if they did. The more logical case is some one/company who believe that they have a legitimate right to copy code from one kernel to another (BSD -> Linux / Linux -> SysV / SysV -> Linux) and thus not feeling the need to cover things up. Either of the SCO User Group examples would fit this category.

    --
    We call it art because we have names for the things we understand.
  60. I've seen this before from ESR... by Dr.+Smeegee · · Score: 2, Informative

    He developed a Callcenter Training Utility for our company in the early 80's. It used genetic algorithms to generate simulated customer complaints that were _very_ realistic, even to the point of using sample voices to "whine". Of course, the helpdesk trainees hated it...

    But hey, the mewling was featureful.

  61. Nobody has mentioned this yet ... by Mostly+a+lurker · · Score: 3, Interesting
    As currently designed, Shred would obviously not defeat deliberate source misappropriation. If (big if) the method could adapted such that it could not be easily fooled by a determined violator (and without revealing how the code works) then I believe registration of the results should be required by law. BUT ...

    In order that the method should not be fooled by simple changes, at least the following is required

    * White space must be ignored

    * Comparison must be at the statement level, not the code line level

    * Variable names must be replaced by standard placeholders

    * Routine names, other than standard library calls, must be replaced by standard placeholders

    * (Probably difficult) logic will be needed in the tool to detect and ignore noops: how do you deal with

    i++;
    %include noop.i;
    a[i]=b[i];

    The trouble is: a high proportion of the code sections thus simplified will fall into a relatively small number of possibilities, vulnerable to dictionary type attacks. Thus, most of the code could be reconstructed, though admittedly as obfuscated source code. IMHO this provides a valid objection to its use.

  62. No Trade Secrets in Registered Copyrights by Iparadox · · Score: 2, Interesting

    I think we are all missing a big point here. SCO registered their copyright in SysV. It was hard to do. They had to create a copy of the source code and file it with the Patent and Trademark Office. That puppy is there so that *we* can look at it. This is specifically *fair use*. It is there so that individuals can protect themselves by comparing what they have to what has been registered. No match means no problem. It just doesn't get any more *fair use* than that. Just have somebody nip up to the PTO, copy the registration for comparison purposes only, (really!) then do the comparisons. How hard is that? Yes, IAAL, but this is not legal advice. Hire your own mouthpiece.

  63. Doesn't even compile by tvm662 · · Score: 3, Informative

    Has anyone else tried to compile Eric's code?

    >gcc --version
    2.95.3

    >make /usr/bin/gcc -c -g main.c
    main.c: In function `report_time':
    main.c:311: parse error before `int'
    main.c:312: parse error before `int'
    main.c:316: `buf' undeclared (first use in this function)
    main.c:316: (Each undeclared identifier is reported only once
    main.c:316: for each function it appears in.)
    main.c:317: `minutes' undeclared (first use in this function)
    main.c:317: `seconds' undeclared (first use in this function)
    make: *** [main.o] Error 1

    Looks like Eric has been coding too much c++ or something. I'm not a c coder myself, so I might be wrong, but don't you have to declare all the variables in a block of c code before using them. In report_time, he doesn't seem to have followed that rule. Maybe he might check his code on a number of compilers before declaring he has "perfected it".

    Eric here's my patch:

    --- main.c 2003-09-10 00:28:37.000000000 -0300
    +++ main.c.fixed 2003-09-10 00:29:55.000000000 -0300
    @@ -306,12 +306,17 @@

    if (mark_time)
    {
    - int elapsed = endtime - mark_time;
    - int hours = elapsed/3600; elapsed %= 3600;
    - int minutes = elapsed/60; elapsed %= 60;
    - int seconds = elapsed;
    + int elapsed;
    + int hours;
    + int minutes;
    + int seconds;
    char buf[BUFSIZ];

    + elapsed = endtime - mark_time;
    + hours = elapsed/3600; elapsed %= 3600;
    + minutes = elapsed/60; elapsed %= 60;
    + seconds = elapsed;
    +
    va_start(ap, legend);
    vsprintf(buf, legend, ap);
    fprintf(stderr, "%% %s: %dh %dm %ds\n", buf, hours, minutes, seconds);

  64. Useful tool by Julian+Morrison · · Score: 2, Insightful

    I can see this tool becoming helpful for so much more than smashing SCO. Any situation where data comparison is useful, but the data itself must remain secret. All paranoid types (corporate or governmental) will love it. Lawyers could make much use of it.

    And, given the dataset it generates, it could be extended to do other useful things such as detect redundant or cut-'n-pasted code, including bugs of the "pasted it in twice" sort.

  65. Re:MD5 easily fooled by dmiller · · Score: 4, Interesting

    So, you've downloaded Comparator, and run tests, then.

    I didn't need to, the following is in the readme:

    comparator does not attempt to do semantic analysis and catch relatively trivial changes like renaming of variables, etc. This is because comparator is designed not as a tool to detect plagiarism of ideas (the subject of patent law), but as a tool to detect copying of the expression of ideas (the subject of copyright law).

    He's wrong BTW (and he is smart enough to know it, which makes this a deliberate deception). A work is no less subject to copyright if someone does a global search and replace on a variable name.

  66. Re:Who says ESR can't code? by joeytsai · · Score: 4, Informative

    Actually fetchmail proves that he can code.


    You may want to check out "The Emperor Has No Clothes", a look at ESR's real code contributions.
    --
    http://www.talknerdy.org
  67. Better yet, a reason to get MS to stop funding SCO by isn't+my+name · · Score: 2, Interesting

    Actually, combine this with the "shared source" program from MS and it would be easy to see if MS did (or did not) copy GPL code into Windows as some suggest.

    More importantly, get something like this accepted in a court of law as a legitimate way to do an initial assessment of code yet still preserve a litigants right to code privacy, and you are going to have not just MS but a number of big companies shaking in their boots. Not necessarily because they did steal anything but because they have to realize it is a possibility that one of their coders did without company knowledge. Doesn't matter, they are still liable.

    But, get a method like this accepted in a court of law and you are going to see it used again. I think this has a huge potential to hurt closed software. And perhaps a potential to convince MS to stop funding SCO, perhaps even to apply pressure to get them to start backing down.

  68. I found out myself by jtheory · · Score: 2, Informative

    Okay, here it is (from the man page):

    comparator works by first chopping the specified trees into overlapping shreds (by default 3 lines long) and computing the MD5 hash of each shred.

    (Emphasis added)

    --
    There are only 10 types of people: those who understand decimal, those who don't, and, uh, 8 other types I forget.
  69. Re:Slightly less lazy by dspeyer · · Score: 2, Funny
    The hashes would reveal the code in question, because they could be compared to the Linux hashes (appropriate version) and then the relevant lines read out of the Linux code. Those lines will now be revealed to have passed SCO's elaborate QA procedures [trying to keep straight face] and this valuable trade secret will be breached. After all, no one would want to use grubby hobbyist-written code [keeping straight face is getting harder] but if they knew it was good enough for SCO, they'd all jump on it, and SCO's unique virtues would be spread to all their competetors!

    It makes as much sense as anything SCO's said.

    Hey, I just realized -- I'm typing this. I don't have to keep a straight face!

  70. Re:What a weird tool by dazk · · Score: 2, Informative

    Eric's tool allows to compare larger and smaller chunks. Simple lines will easily match very often. Simple lines are not a problem. The problem is always lying in a sequence of lines. That's why you need overlapping sequences.

  71. Looks like "fair use" to me by Anonymous Coward · · Score: 2, Informative

    I don't know if the MD5 sums are a derivative work of the original source or not, but I would be inclined to think that they are.

    Let's look at what the law says about fair use

    Fair Use

    The four factors are: (1) the purpose and character of the use, including whether such use is of commercial nature or is for nonprofit educational use; (2) the nature of the copyrighted work; (3) amount and substantiality of the portion used in relation to the copyrighted work as a whole; and (4) the effect of the use upon the potential market for or value of the copyrighted work.

    It looks to me that under part (1), the MD5sums are a form of commentary or news reporting about the original work, not a replacement for the work. I don't know about (2). Under (3), the "amount" is definitely small, and the "substantiality" is low. And under (4), almost nobody who would buy the original work is going to substitute the MD5sum's instead, so the MD5sum's would have nil effect on the market for the original work.

    So in my AC-IANAL opinion, distribution of the MD5sum's would be protected under American copyright law as a "fair use".

  72. Possible improvements by gonvaled · · Score: 2, Interesting

    A lot of comments focus on the problems that a global search and replace will pose to the technique. I think we can improve the agorithm by doing the following:

    What we are looking for here are pieces of code with the same structure: the same for loops, while loops, variable assignment, function names, and so on. The idea would be to substitute all literals by a standard placeholder, and then generate the md5 checksums on the block level (as somebody has previously suggested).

    To be able to cheat this technique, a modification in the structure of the code is required. And in the case that exactly that has been done, it is arguably wether that can be considered copyright infringement.

  73. Re:MD5 easily fooled by ESR · · Score: 2, Informative

    But for changes of that kind, the burden of proof
    is heavy and on the party alleging infringement.
    The comparator tool isn't designed to try to catch
    such deliberate obfuscation, because that would get
    into murky territory near the boundary of expression and idea. Did you really think I failed to study the legal questions before I wrote this?

    --
    >>esr>>
  74. Test SCO Linux Kernel Personality by LightSail · · Score: 2, Interesting

    The best use of this technology would to test the SCO LKP for stolen Linux code.
    Confirming that SCO had incorporated Open Source code that they had access to under the GPL would destroy their credibility and open them up to countersuits. The process would only have to reveal enough similarities to have subpoenas ordered for the actual code involved. Then we could prove the theft with SCO own source code.

    I suspect that those who know that Linux code was used to create LKP would come forward once the code has been discover and posted for all to see.

  75. Ain't this JUST like the Open Source Movement... by zenofjazz · · Score: 2, Funny

    Even a potential Lawsuit is just another reason to write grooooovy software.. *evil grin*

    GO ESR!!

    --
    -- All That's Evil in the Geek Space ... Allthatsevil.wordpress.com
  76. Idiocy... by poptones · · Score: 2, Insightful
    People like you who try to mush it all up are just trying to loot other people's property.

    Did your momma have any children that learned to think?

    Source code gets no copyright protection: corporations keep their source as a "trade secret" and only get protection on the executable. It is illegal to redistribute (copy) the executable, and the source is entirely within their control (and their responsibility). No real "furtherance of the arts" is accomplished except within the limited scope of usage of the tool itself. If a work is infringed at the source level, therefore, it is (nearly) impossible to prove without revealing "trade secrets" and, therefore, exposing the company to further risk.

    Source code gets copyright protection (as constitutionally mandated)

    Corporations have to register the source code, and therefore are given fulll protection on both works. It is just as illegal to redistribute (share) the source beyond the scope allowed by the rights holder, and if a work is infringed there is no risk to the rights holder in defending the work. "Furtherance of the arts" is addressed, as well as the rights of the work's creator.

    Corporations are allowed "copyright" on works they do not share.

    It becomes nearly impossible for libeled parties to defend themselves, but "rights holders" are free to make claims as they see fit. Which gives "rights holders" basically free reign to make accusations which they may never be forced to address in court, and leaves victims nearly defenseless until the (very slow) court gets around to addressing the issue. Neither "furtherance of the arts" nor protection of (libeled) rights holders is served, since the more powerful party remains free to withold (copyrighted) "evidence" that no one is allowed to see.

    How does this system serve rights holders whose works may have been infringed upon, but are forced from the marketplace by another "rights holder" with more money? How does that system serve the public interest? How does it promote progress?

    Can you answer any of these questions using sound logic?