ESR to Shred SCO Claims?
webmaven writes "According to this article in eWEEK, ESR has released a utility called comparator for analyzing the similarity of source code trees. The technical details are interesting, in that ESR says he is using an implementation of a refined version of the 'shred' algorithm, with higher performance (on machines with enough RAM) than other versions. ESR won't say whether he intends the comparator to be used to compare older Unix code to Linux so as to be able to refute SCO's claims, but it's obviously well suited for such a purpose. Interestingly, as the shred algorithm can run reports on source trees using only the MD5 signature shreds (once generated), it is possible to use it to compare trees without direct access to the source code itself, leading to a possible use in comparing various proprietary source trees with each other and with Freely available code bases such as Linux and *BSD without requiring actual disclosure of the proprietary source code (a neutral third party could generate the shreds on a company's premises, and leave without taking a copy of the source with them). I'll be interested to see if (or which of) the proprietary vendors allow their source trees to be 'shredded' for such comparisons, and whether this becomes a standard forensic technique in source-code copyright and trade-secret disputes."
microsoft can just shred their source tree and start anew. maybe...
My problem? I was perfectly gruntled, until some numbnuts came by and dissed me.
Of course, we can just trust SCO to show the right hashes. Why would they lie?
Wer mit Ungeheuern kämpft, mag zusehn, dass er nicht dabei zum Ungeheuer wird. --Nietzsche
If there is, why couldn't MD5 shreds be used as a lossy compression scheme for code?
Go here to create your own Slashdot dis
This will only serve as another black eye on the Open Source community. ESR should know better that to shred SCO material prior to a trial.
Did he write it in Python? And did he complete it in under 6 hours?
... then the number of lawyers it'll retire.
Words to men, as air to birds.
I think the question here is not about whether there is common code between SCO and Linux. There is no doubt that there will be common code because of the common origins. The issue here is that SCO does not own that code.
This shouldn't be relied upon in the court of law. Although I acknowledge that SCO likely has no IP claim over Linux, it should have a fair case. A program that would rule out code similarities does not rule out code that is based on the SCO code. There are hundreds of ways to do a single thing, and if the GNU/Linux took ideas from the SCO kernel, SCO may be as eligible for compensation as if it were directly copied from SCO.
And why did you staple the trout to the RAM?
The truth is out there, we will finally get to it without signing a SCO NDA. This should end the case before it begins. SHRED ON!
"Curiosity killed the cat, but for a while I was a suspect."- Steven Wright
This just in. SCO to sue ESR for patent infringement over "comparator", a software package that performs comparison between different sets of source code to determine if any code is copied between them.
Fun with Anagarams! LADS HOST, SHALT DOS. HAS DOLTS. AD SLOTHS, HATS SOLD. ASS HO, LTD.
And I just got rid of my paper shredder...
Without music, life would be a mistake. --- Nietzsche
"...has two advantages: one, it's amazingly fast..."
Guess not. ;-)
If you're comparing two sets of code vis. their MD5 sums, then won't that miss matching lines that differ by even one character - like, say, a space?
----
Not to be confused with Col.
Does this mean I bought a SCO license in vain?!
It might be interesting to see how different families of Linux/Unix compare... maybe generate a veritable "family tree" of relationships.
Of course, that also depends more on how differences are actually calculated. Still, could make an interesting project to relate OSes based on how much shared code they still retain and show it in a graphical tree format, ala "family tree." 8)
Diplomacy is the art of saying, "Nice doggie!" until you can find a rock.
ESR shows us once again why exactly he has so much respect from the community. Well done, that man.
I can't decide which is funnier - the point about IBM orchestrating all the outrage, or the point that SCO is somehow more "relevant" to the tech community because they've filed a bunch of press releases! SCO has committed the most vile of sin. Ok, I'll stop now.
This Comment was generated with the Comment-O-Matic for SCO Stories.
--
est modus in rebus
Anyway -- who cares? There's no question there are plenty of common chunks between Linux and SCO-owned source. And that there are ways to find them. The question is what they are (which SCO isn't saying) and what their common origin is and where that origin falls in the murky history of the Unix codebase. It's not as if anyone has been saying, "We're helpless in the face of this computational problem. If only there were a way to compare large bodies of text for common elements!"
Never mind that there are probably people who can compare both codebases in their heads.
Maybe he's made some major algorithmic breakthrough. (I doubt it but, but I'll leave that to the experts.) But this story is just him yapping again.
What I'm listening to now on Pandora...
ESR is ok you know, but lately he has just been doing lots of ranting and soapboaxing and no hacking.
Finally he comes out with some hack action. About time man, I was beginning to view him as just some big windbag who hacked a little back in the day. Well I still sorta do, but this is at least pretty cool, you know.
Article text follows:
SCO may not know origin of code, says Australian UNIX historian
By Sam Varghese
September 9, 2003
More doubts have been cast on the heritage of System V Unix code, which the SCO Group claims as its own, by an Australian who runs the Unix Heritage Society.
Dr Warren Toomey, now a computer science lecturer at Bond University, said today: "I'd like to point out that SCO (the present SCO Group) probably doesn't have an idea where they got much of their code. The fact that I had to send SCO (the Santa Cruz Organisation or the old SCO) everything up to and including Sys III says an awful lot."
He said that even though SCO owned the copyright on Sys III, a few years ago it did not have a copy of the source code. "I was dealing with one of their people at the time, trying to get some code released under a reasonable licence. I sent them the code as a gesture because I knew they did not have a copy," he said with a chuckle.
Dr Toomey's statements come a few days after Greg Rose, an Australian Unix hacker from the 1970s, raised the possibility that there may be code contributed by people, including himself, which has made its way into System V Unix and is thus being used by companies like the SCO Group.
Dr Toomey said this was one reason why the code samples which the SCO Group had shown at its annual forum had turned out to be widely published code.
SCO was unaware of the origins of much of the code and this "explains how they could wheel out the old malloc() code and the BPF (Berkeley Packet Filter) code, not realising that both were now under BSD licences - and in fact they hadn't even written the BPF code," Dr Toomey said.
He said that there was lots of code which had been developed at the University of New South Wales in the 70s which went to AT&T and was incorporated into UNIX without any copyright notices.
"At that time the development that was going on was similar to open source - the only difference was that the developers all had to have copies of the code licensed from AT&T," he said.
Dr Toomey, who served 12 years with the Australian Defence Force Academy, an offshoot of the University of New South Wales, before joining Bond University, said he had source code for Unices from the 3rd version of UNIX which came out in 1974 to the present day. "I don't have Sys V code but there are people with licences for that code who are members of the Unix Heritage Society. We can compare code samples any time," he said.
He agreed that the codebase of Sys V was a terribly tangled mess. "It is very difficult to trace origins now. There is an awful lot of non-AT&T and non-SCO code in Sys V. There is a lot of BSD code there," he said.
In March, the SCO Group filed a billion-dollar lawsuit against IBM, for "misappropriation of trade secrets, tortious interference, unfair competition and breach of contract."
SCO also claimed that Linux was an unauthorised derivative of Unix and warned commercial Linux users that they could be legally liable for violation of intellectual copyright. SCO later expanded its claims against IBM to US$3 billion in June when it said it was withdrawing IBM's licence for its own Unix, AIX.
IBM has counter-sued SCO while Red Hat Linux has sued SCO to stop it from making "unsubstantiated and untrue public statements attacking Red Hat Linux and the integrity of the Open Source software development process."
-----
Wordforge writing contest now open: deadline 2003-03-28
a world in progress...
Most people *I* know consider ESR to be a bloated windbag with a penchant for fanatical gunrights. He's regarded as pretty much being on the same level as the late Jon Katz.
It's "than" not "then".
They're different words you know.
The more points you discover and disprove now with SCO's claims.. the higher quality, more refined, and detailed SCO's evidence will be when this setup finally gets to a court in front of a judge. If they went to court two months ago or even today, they would have been sent home quickly with bascially easy to disprove evidence. With the help of the open source community, they are slowly changing their weapon of choice from a shotgun to a rifle.
Bad boys rape our young girls but Violet gives willingly.
Bruce Perens:
Three. Three. And we'd better not risk another frontal assault. Their legal team is dynamite.
Linus:
Would it help to confuse it if we run away more?
Bruce Perens:
Oh, shut up and go change your firewall!
Alan Cox:
Let us taunt it! Darl may become so cross that he will make a mistake.
Bruce Perens:
Like what?
Alan Cox:
Well... ooh.
ESR:
Have we got bows?
Bruce Perens:
No.
ESR:
We have the Holy Hand Grenade.
Bruce Perens:
Yes, of course! The Holy Hand Grenade of Antioch! 'Tis one of the sacred relics Brother Richard carries with him.
Brother Richard! Bring up the Holy Hand Grenade!
MONKS: [chanting]
Pie Iesu domine, dona eis requiem.
Pie Iesu domine, dona eis requiem. Pie Iesu domine, dona eis requiem. Pie Iesu domine, dona eis requiem.
Bruce Perens: How does it, um-- how does it work?
ESR:
I know not, my liege.
Bruce Perens:
Consult the Book of Armaments!
RMS:
Armaments, chapter two, verses nine to twenty-one.
OPEN SOURCE ZEALOT:
And Saint Attila raised the hand grenade up on high, saying, 'O Lord, bless this Thy hand grenade that, with it, Thou mayest blow Thine enemies to tiny bits in Thy mercy.'
And the Lord did grin, and the people did feast upon the lambs and sloths and carp and anchovies and orangutans and breakfast cereals and fruit bats and large chu--
RMS:
Skip a bit, Brother.
OPEN SOURCE ZEALOT:
And the Lord spake, saying, 'First shalt thou take out the Holy Pin. Then, shalt thou count to three. No more. No less. Three shalt be the number thou shalt count, and the number of the counting shall be three. Four shalt thou not count, nor either count thou two, excepting that thou then proceed to three. Five is right out. Once the number three, being the third number, be reached, then, lobbest thou thy Holy Hand Grenade of Antioch towards thy foe, who, being naughty in My sight, shall snuff it.'
Richard:
Amen.
KNIGHTS:
Amen.
Bruce Perens:
Right!
One!... Two!... Five!
Alan Cox:
Three, sir!
Bruce Perens:
Three!
[sco dies]
Upper mangeement in most enterprises have a low level of technical knowledge. To them the thought of something called shredding coming anywhere near the 'voodoo' of software development would be abhorrent.
What I see from the article is that it can only compare whether two code snippets are exactly alike (which makes sense from the standpoint of MD5 - they're really only useful for equality checks) - and from the claims that are being thrown around about obfuscating the supposedly legal code, that isn't going to help much of anything.
[evilscobox]
Duh!
Putting the sig back into +1, Insightful since 1995!
KDE GUI version should be called Krang since Shredder would obviously be used from the command line (shell). Maybe it should have helper apps called Bebop and Rocksteady. And if the need should arise, the project shouldn't fork...it should splinter.
By computing MD5 hashes of consecutive (overlapping) line triplets, the shred algorithm makes it easy to identify copied code, without ever seeing the actual code. This might be a perfect way for companies to allow a third party to compare code, without giving away any trade secrets in the process.
Of course, since MD5 is a very good cryptographic hash function, *any* one-bit change in the source will result in, on average, half of the bits in the result being flipped. So, this method of identifying copied code would only work if the code had never been run through an obfuscator. It would also be defeatable by running the source through a script to have its variable names search-and-replaced with similar names (such as replacing every variable name with a new name consisting of the old name plus "_newname")....
In short, this might be a useful technique for allowing a third party to look for trivial wholesale copying of code, but it would be useless for finding a motivated miscreant, determined to steal code without being caught.
What do we get? It's like SCO is holding a handgrenade and people are slowly moving away from the madman. Shhh! You are breaking my concentration! I'm trying to shed a bitter tear for them. You mean this whole lawsuit thing is for real?
This Comment was generated with the Comment-O-Matic for SCO Stories.
Linux: Free if your time is worthless.
This will also find all the places SCO has
violated the terms of other organizations IP...
Downside: Uh... it just came out... and it's making some big, big claims involving fuzzy logic. I think it's gonna need some testing first, eh?
Also, anybody else think it only works on larger sections of code than just say 10 lines?
SIG: HUP
I'm just glad that I finished college before they had this technology otherwise I might have been caught for cheating. Although I was really good at renaming variables.
SCO is going to get its corporate head handed to it on a platter, and I hope that the courts allow the corporate veil to be pierced so that McBride and company have to bear the cost of their misdeeds personally (and not just the duped stockholders).
Scientists restrict study to entire physical universe; creationist
Thanks ESR. You've just put a team of mathematicians at SCO who were somehow related to MIT out of their jobs.
A simple re-indentation or a variable change would fool comparator. What someone needs to do is to implement a parse tree comparison tool which would be able to compare files on a semantic level.
When I saw the headline 'shred SCO', 'from the woodchipper department' a vision of a corporate purge 'Fargo Style' popped into my head.
__ Someday, but not this morning, I'll finally learn to use the preview button.
Great. So cool. And so stupid.
First, IBM, Sequent, SGI and Linux wouldn't be off the hook if the provenance of each line of code were proven to have come from other sources. There are a number of trade secret issues that still could crop up.
But let's assume that Raymond's work was actually run on the SCO source and on Linux. Would the results be meaningful?
No.
Suppose I have a routine that comes originally from source B. I work for a company which has the right to copy B, but which redistributes the results of its work under a closed license. Call that new source S. It so happens that the code my company got from B had a nasty bug in it, and I spent a month finding a fix for that bug. Suppose also that the fix is quite small relative to the original code, as is ususally the case. A shredder is going to find significant similarities between at routine as implemented in source B and source in S. Now, suppose source L comes along. The authors of L had the right to copy from B, but not from S. They have a very similar routine, originally derived from B. After shredding, the routines in B, S, and L will all look similar -- but whether there's an infringement between S and L will depend solely on a tiny fragment of the code. Without disclosing that fragment, there is no way to determine if there's in infringment or not.
Chances are slim to none that a software company would allow it's "shredded" source code to be publicly released. What happens if the proprietary source is found to violate the GPL?
Proprietary (closed) source companies have a tremendous advantage over open source software when it comes to violating intellectual property. Who will ever know if they did it? A source code "comparator" eliminates that crucial advantage.
While I fully support ESR and the rest of the open source movement's defense of Linux against SCO, I have a feeling that this tool's results will not immediately be accepted by established media simply because of ESR's bias. A reporter looking into the SCO story who knows little about open source wouldn't trust a tool made by one side of the disagreement.
It seems very important to me that "third parties" and experts who are not an integral part of the open-source movement validate that comparator works as intended and is effective at detecting code similarities. Hopefully we'll see some articles on respected sites in the next week or so with conclusive analyses of comparator. Not to mention a chance for someone to use it on SCO's code!
Oh, and "Yes, I'm being deliberately vague and tantalizing" is quite funny.
This is perhaps a better project and it would be interesting to see this tool run against the source.
History Flow The following is from their website:history flow
visualizing dynamic, evolving documents and the interactions of multiple collaborating authors:
Motivation
Most documents are the product of continual evolution. An essay may undergo dozens of revisions; source code for a computer program may undergo thousands. And as online collaboration becomes increasingly common, we see more and more ever-evolving group-authored texts. This site is a preliminary report on a simple visual technique, history flow, that provides a clear view of complex records of contributions and collaboration.
Fantasy remains a human right; we make in our measure and in our derivative mode... -- JRR Tolkien
Shut up Darl
Well, I was looking at ESR's description of the code (I haven't read the code yet), and it seems to say that he takes 3 line slices, MD5s them, then compares them for identical points. I'm sure he compensates for funky whitespace and whatnot like diff and patch do...
But if even one bit of the source is different, the MD5 hash will be quite different. So, the code slices have to be IDENTICAL. This is not a very good system because a simple find-replace could defeat it. A variable's name changed by one letter, or even capitalization, will defeat it.
Unless the code reveals much more complex tricks than ESR describes in the help file, this tool wouldn't be much use in the SCO case. Hell, it wouldn't be much use catching college class cheaters even.
Slashdot. It's Not For Common Sense
Do I smell a Court Order in the works? If you really can do this without divulging the original code, then someone could conceivably convince a judge to issue an order to have a "neutral third party" create the MD5 sums on SCO's codebase, giving us a chance to look for pirated GPL code hidden inside of SCO proprietary products, without having to look directly at the SCO code.
Your Servant, B. Baggins
check out this research project coming out of berkeley CAP
Drop in the code you are interested in and it will tell you where its found in a bunch of open source stuff, including the linux kernel.
SCO claims that Opensource is weak because nobody will warrant that the code does not violate somebody's IP rights.
However, what about SCO's products and other propietary code? Does SCO warrant that UNIX does not violate somebody else's IP rights?
If not, then any current SCO customer should be very afraid of the outcome of this lawsuit, because during the trial we'll almost certainly find that SCO accidentally or purposely copied GPL code into UNIX. By SCO's own logic, this means that all customer must immediatly stop using the product.
For this reason, SCO should be very reluctant to let anyone use ESR's new tool, or even plain old eyeballs to examine the code.
It gives software houses a way of publishing commercial code for copyright purposes. If you claim copyright on code, you can publish the MD5 shred sigs for the code. No one can rip you off, but you can enforce your rights in a court.
Even better - no one now has an excuse for not publishing. That means that we can make sure the kernel never comes within spitting distance of anyone else's property again. And if it does - well they should have published.
Now if SCO aren't willing to publish their MD5 shreds, then that can only be because they have no case. In which case - game over!
On the other hand, if they do, the world at large can then go through their published shreds and see exactly whose code SCO have been ripping off. Given the likely origins of those samples they exhibited a while back, I'd say that's likely to be quite a bit.
This looks like the best news for the war against everyone's favourite Stupidly Corrupt Organisation since the whole mess kicked off.
Don't let THEM immanentize the Eschaton!
Think of the chance that any given line of source code in an arbitrary program is repeated somewhere else in a large open source program such as the Linux Kernel. This is even more true if some degree of fuzziness is added to handle changes such as adding or removing spaces in insignificant places, removing comments, (and there are many other things like brace style which affect multiple lines so you might want to physically reformat between lines to a standard format....
If the number of lines is even only 1% that are found somewhere in the open source code base, I think a source who wants to keep their code base secret will have a big problem with someone computing the checksums. In reality, I wouldn't be suprised to see a much-higher percentage of lines leaked this way. And this is not the only way leaking can occur (think of application of simple cryptography).
I would not want to be the one publishing the checksums of the closed source due to possible legal liability. The checksums are a derived work in any case.
From ESR's README:
Besides the production C code, the distribution also includes working Python versions. These were used to prototype the concept.
So the answer to your question is yes and no.
SCO was unaware of the origins of much of the code and this "explains how they could wheel out the old malloc() code and the BPF (Berkeley Packet Filter) code, not realising that both were now under BSD licences - and in fact they hadn't even written the BPF code," Dr Toomey said.
The SCO Group (not old SCO) hasn't written any code in SysV UNIX.
Anyway.. One could hope that when this is all over, the UNIX sources will be bought up from the carcass of SCOX and open-sourced, finally putting it out of its misery..
That is, as long as SysV UNIX doesn't have more stolen code in addition to the BSD code we all know about..
The sooner the zombie of UNIX is put to rest, the better for all the live Unices.
That's a legitimate question. If the CS students that code Linux have learned anything it's how to obfuscate someone else's code to avoid getting caught cheating. Hell, I do it all the time when I don't want to write a stupid program for class. Obfuscation is an artform.
Isn't that the command you have to use to delete kiddie porn when the fbi comes knockin?
It's a tossup between ESR and Perens as my favorite Open Source advocates. Perens is funny and answers emails and offers to help anybody out. ESR likes guns and writes extremely well. They both have some big brass ones, leading the fight against SCO.
As for Stallman... I think he's still bitching about Debian and their not completely, totatlly, 100% free packages. I haven't seen him contribute anything in a long time except complaints and rhetoric.
This guy is way out there
http://www.advogato.org/article/702.html
http://www.welton.it/davidw/
You're going on the assumption that they're referring to MD5 hashes of entire files. Not so. Typically such comparisons generate many hashes per file, by going through each file line-by-line, and for each line generating a hash of it and the next 5 lines or so. When all is said and done, this gives you many hashes of small blocks of code within a file, which can then be compared to the hashes from a different codebase. Any time 5 lines are the same between the codebases, you would get a match.
-j
Terrorists Suprised to Find Themselves In Hell
From: The Onion . com
JAHANNEM, OUTER DARKNESS--The hijackers who carried out the Sept. 11 attacks on the World Trade Center and Pentagon expressed confusion and surprise Monday to find themselves in the lowest plane of Na'ar, Islam's Hell.
Above: Mohammed Atta (top) and Ahmed al-Haznawi.
"I was promised I would spend eternity in Paradise, being fed honeyed cakes by 67 virgins in a tree-lined garden, if only I would fly the airplane into one of the Twin Towers," said Mohammed Atta, one of the hijackers of American Airlines Flight 11, between attempts to vomit up the wasps, hornets, and live coals infesting his stomach. "But instead, I am fed the boiling feces of traitors by malicious, laughing Ifrit. Is this to be my reward for destroying the enemies of my faith?"
The rest of Atta's words turned to raw-throated shrieks, as a tusked, asp-tongued demon burst his eyeballs and drank the fluid that ran down his face.
According to Hell sources, the 19 eternally damned terrorists have struggled to understand why they have been subjected to soul-withering, infernal torture ever since their Sept. 11 arrival.
"There was a tumultuous conflagration of burning steel and fuel at our gates, and from it stepped forth these hijackers, the blessed name of the Lord already turning to molten brass on their accursed lips," said Iblis The Thrice-Damned, the cacodemon charged with conscripting new arrivals into the ranks of the forgotten. "Indeed, I do not know what they were expecting, but they certainly didn't seem prepared to be skewered from eye socket to bunghole and then placed on a spit so that their flesh could be roasted by the searing gale of flatus which issues forth from the haunches of Asmoday."
"Which is strange when you consider the evil with which they ended their lives and those of so many others," added Iblis, absentmindedly twisting the limbs of hijacker Abdul Aziz Alomari into unspeakably obscene shapes.
"I was told that these Americans were enemies of the one true religion, and that Heaven would be my reward for my noble sacrifice," said Alomari, moments before his jaw was sheared away by faceless homunculi. "But now I am forced to suckle from the 16 poisoned leathern teats of Gophahmet, Whore of Betrayal, until I burst from an unwholesome engorgement of curdled bile. This must be some sort of terrible mistake."
Exacerbating the terrorists' tortures, which include being hollowed out and used as prophylactics by thorn-cocked Gulbuth The Rampant, is the fact that they will be forced to endure such suffering in sight of the Paradise they were expecting.
"It might actually be the most painful thing we can do, to show these murderers the untold pleasures that would have awaited them in Paradise, if only they had lived pious lives," said Praxitas, Duke of Those Willingly Led Astray. "I mean, it's tough enough being forced through a wire screen by the callused palms of Halcorym and then having your entrails wound onto a stick and fed to the toothless, foul-breathed swine of Gehenna. But to endure that while watching the righteous drink from a river of wine? That can't be fun."
Underworld officials said they have not yet decided on a permanent punishment for the terrorists.
"Eventually, we'll settle on an eternal and unending task for them," said Lord Androalphus, High Praetor of Excruciations. "But for now, everyone down here wants a crack at them. The legions of fang-wombed hags will take their pleasure on their shattered carcasses for most of this afternoon. Tomorrow, their flesh will be melted from their bones like wax in the burning embrace of the Mother of Cowards. The day after that, they'll be sodomized by the Fallen and their bowels shredded by a demonic ejaculate of burning sand. Then, on Sunday, Satan gets them all day. I can't even imagine what he's got cooked up for them."
"Curiosity killed the cat, but for a while I was a suspect."- Steven Wright
While Dark McBride and Chris Sontag shoot their mouths off, the community develops tools to finally make something clearer. :)
-Arnulf
Isn't there a problem with this, that you have to trust that the shreds that SCO provides are really the shreds to their source code? After all, they could just grab big chunks of Linux source code and stick it in their tree and then shred it. This technique only seems reliable for comparing two secret codebases without revealing either one. The asymmetry in this case makes the test unfair.
I've seen a lot of bitching that ESR can't code. One look at this proves otherwise.
Disinfect the GNU General Public Virus!
Could someone explain to me again why this supposedly proprietary code is still a secret when it has been released under the GPL?
First rule of journalism: don't assume your audience knows anything about the subject.
int main()
{
printf("These source trees appear to be entirely different!\n");
return 0;
}
Something like this would be useless because any good (wink: evil) programmer that changed the source even a little would render the compiled code to be different. To be legitement, you would need source trees from BOTH and compare line by line (really, function by function).
This doesn't negate the fact that SCO is claiming blatent code copying were the source wasn't changed at all...
Can't someone just burn down SCO's buildings and end this whole idiotic process? SCO is an obvious MS puppet now.
(some lawyer at MS --> "All your source is now belong to us")
If we can show that SCO's violating the BSD license, maybe we can convince some BSD copyright holder to sue them first, and demand as part of discovery the MD5 checksums from "shred", showing duplicated BSD code but no duplicated BSD copyright.
What if this ESR tool runs and finds commonality, and the research shows that, in fact, SCO's rights were breached. Remember, this type of analysis is a two-edged sword. The purpose of this ESR is to remove doubt... but remember doubt could be removed either direction.
So, given that hypothetical, what would people here think? Would you forive SCO? Would you concede SCO's point, but think that SCO defended their rights in a very poor manner? (this, btw, is what I would probably do). Would you stick your fingers in your ears and refuse to accept the outcome, and believe in some vast -wing conspiracy?
Obviously, the Linx movement would carry on. I don't think the death of Linux is even worth discussion. Some recourse would happen, probably monetary damages, and the offending code would be removed.
My real curiosity is how people's attitudes or feelings would change (or not change) if it turns out SCO is right (however unlikely that is).
Sarcasm and hyperbole are the final refuges for weak minds
Then we'll have the ability to identify the lines of code in the Linux tree that are the same as lines that SCO says is in their codebase. And show exactly where that code came from. Why? Because the process is OPEN! There are LKML archives with all-out flamewars over some of that code. There are companies whose legal departments have vetted the code they've contributed, and have files that document the process in excruciating detail.
We will undoubtedly find some of the 'copied' code is in fact BSD code, and the shred algorithm will show that the code differs exactly where the California Regents' copyright notice has been taken out, which will prove that SCO violated not only the GPL, but the BSDL as well. And just like AT&T before them, they'll lose big.
Posting as AC from work, but you know who I am...
SVM, ERGO MONSTRO
I can't read ESR's mind, but I strongly suspect that this is intended to provide ammunition for a counter-suit claiming that SCO has pirated GPL code and illegally embedded it into their software. The comparison isn't a proof of copied code, but it could indicate "hot spots" and provide sufficient basis for a sort of "search warrant" to force SCO to show its code. If it turns out, as it did in the famous AT&T vs. BSD case, that SCO has been whole-sale ripping off other people's code, then things might turn really uncomfortable for McB and his cronies...
Your Servant, B. Baggins
...but can't we at least move on to ANOTHER Monty Python movie? I hate "The Holy Grail".
Right, I remember one of the early comments from CEO SCO... that one of the major similarities were the consistent and repeated spelling errors as shown in the most recent example they released.
While sometimes spelling errors are consistent among diffrent programers, wouldn't it not be wise to.
1. Strip out all the comments into a diffrent file
2. Peform a spell check, isolate mispelled words
3. Scan the linux kernel for these mispellings
There is no sanctuary. There is no sanctuary. SHUT UP! There is no shut up. There is no shut up.
... and it kept noticing "we welcome our new ??? overlords."
Something's wrong with this bloody thing.
Have any of those techno-Rabbis run a comparison search with their "Bible Code" program on SCO? Did it come up with the phrases "bankrupt in 2004," "full of camel dung," and "Serpent of Utah"? How about running the "Bible Code" on Unix System V. code? Considering SCO's fondness for converting code over to Greek symbols for their presentations, converting to sanskrit, Hebrew or Aramaic shouldn't be a problem...
"Right now, somewhere in this world, Scott Baio is plowing a woman he doesn't love," - Peter Griffin, *Family Guy*
Compare C parse trees. That's right, look at the parse trees, use some fancy graph algorithm to compare the calculations and parse tree nodes.
Someone mod this up I think I'm on to something!
2 years and no mod points. Join reddit. Because openness is good.
Here is the reason: the people that "stole" SCO's code (if indeed that happened) probably were not acting with ill intent. They probably thought they were doing genuine, valid reuse, in which case, why hide it? Obfuscating runs the risk of introducing new bugs.
OSS programmers, even the ones that cut corners, are not malicious in my experience. There are honest mistakes made, because, well, they are lone programmers, not lawyers, or professional managers, or finacial experts, or whatever.
However, if code was diliberatly obfuscated, that would be very, very bad news for Linux. That shows that it was not an honest mistake, but the programmer knew something about the origins and they needed to be hidden. At the best, he could argue that he didn't think that it was an IP violation, he was just trying to make himself look better by not giving credit. The other side could argue he obviously new he was breaking the law.
Of course, as I said, I honestly don't think this case will come about. Even if code found its way in, I don't think it was a programmer say "Hey, I'm going to do this, but it is illegal, so I will cover my tracks."
Sarcasm and hyperbole are the final refuges for weak minds
This sounds like a great tool. The copied code I'm concerned about, is the code myself and other developers on the same project have copied from one file to another file.
If I can use this tool to find that code and refactor it into subroutines/classes then that would be superb.
And if someone could write a plugin for Eclipse to help automate/assist with the refactoring...
-P
RimuHosting - Linux VPS Hosting
You're obviously a little misinformed, check out the stats on that snazzy IIS server the International Lesbian and Gay Association are using. Don't worry, most Windows users are "in the closet" just like you.
Presumably any of the many people with legal rights to SCO source code can publish the hash list without divulging any of SCO's (ahem) "IP".
Would these hashes of SCO source code be considered derivative works? That could have copyright implications...
While IANAL, I don't suspect you are either. Copyright is not something that applies to ideas - it applies to expressions of ideas. I'll quote the Apple vs Microsoft case note by Joseph Meyers:
In other words, the question on the table is whether portions of the Linux kernel are a derivative work of SCO's code - not whether it uses SCO's ideas.What's particularly bizarre is that he wrote an apology for Franco. Not exactly the side of the Spanish civil war you'd expect an 'anarchist' to take!
So if the "Ancient Unix Code" is still available and its license permits, we can do the comparisons today.
OK, I'm a bastard and all, but with the most recent SCO (SCOX) stock jumps can you wait to leave them in the dirt for a few more days. I'll be able to buy my 8 160gb drives in a just another day or two if you'll let the stock continue to climb.
But seriously. This is a good thing. SCO needs to be slapped down like a bad habbit. Darl has pushed this long enough, and with a tool like this, we can finally push back.
Great work Eric!
harryk
think before you write, it'll save me moderator points.
It will just tell someone two trees are similar/identical. The important thing to prove in court is who copied from whom.
the question hasn't been asked:
But does it run on Linux?
Of course it runs NetBSD. BTC: 1NT7QvbetmANwaMzhpVL6
Speaking of BSD, a better way of doing this comes from Berkley too. It's a program called Moss that is used by many universities to detect plagarism in CS classes. I know from firsthand experience that this is a very powerful program. Unlike the shredding technique, things like changing variable names won't affect the comparsion value Moss returns. It even does a pretty good job of noticing changes like replacing for loops with while loops.
One disadvantage it does have though is that it won't work with the MD5 checksums, although I'm a bit skeptical of how well that would work anyway.
While the concept sounds nice, any line by line comparison could easily be fooled. A run through indent, a comment change or a common search & replace on a variable will change the MD5 sum. A (rather more difficult) enhancement would be to compare code at the semantic level (perhaps using gcc's intermediate RTL or TenDRA's ANDF).
OSS programmers are usually highly-connected programmers, not lone ones. I haven't met one in years that doesn't have an IRC window open in the background for chitchat while their brain is idling. All the proprietary programmers I've met (mainly shareware authors) have been sad, lonely, bitter, microsoft-haters, while the OSS crowd just treat MS with mild derision.
Lawyers, professional managers and financial experts all make huge, glaring mistakes, sometimes even making the news.
Even If that rogram is to be used to compare SCO's source tree to Linux, given what we know of SCO's ethics, how can we be sure they wouldnt mess with their own code so as to get their "expected" resules?
At this point, SCO should have 0.00000... credibility, and nothing short of clear exposure of the allegedly copied code should be accepted.
errera hunamum ets
And among those the marvelous SCOX... cannot copy here SCO group description and stock analysis from this "unbalanced" financial article, it is quite disgusting to me.... go and read if you want.
Just another example when real and reality follow different paths.
wtf
It should be interesting bizarro statement SCO comes up with to counter this?
Sorry my bullshit sensor overloaded.
The sentiment is basically that, if I am in a software company, I have access to legal opinion (Can I use this code?), financial resouces (if not, can we license this code), and other stuff. OSS doesn't always have easy access to the same resources.
That is why I think honest mistakes could be made. In the adsense of a legal team, financial team, etc, sometimes you just have to do what you think is right, or best, and it may not always be.
Sarcasm and hyperbole are the final refuges for weak minds
that SCO is even interested in having their source examined by such tools? Do you really think that Microsoft is going to submit their code base to any such examination?
This points up a very basic fallacy in SCo's IP arguments: they keep claiming that people cannot be protected in an OSS environment. What they really mean is: you don't have to worry about it in a proprietary environment because we won't let any one see the code! We can steal whatever we want with impunity from lawsuits! We can generate whatever lawsuits we want and no one can prove us wrong because we can see their source and they can't see ours!
Now, don't get me wrong: I do not think that Microsoft will avoid any of this kind of comparison due to theft. Rather, I think they will avoid it because of liability issues. If their code was ever examined against the most common causes of buffer overrun vulnerabilities, it would most certainly fail. Given that the tools exist to examine source code for exactly these common design problems and that Microsoft code continues to suffer from these problems, why wouldn't they be liable for damages due to exploits of these things? They must have been either incompetent or stupid. Either looks bad in a claims court!
I think you miss the point. AFAIK SCO didn't sue IBM on the basis of copyright infrigment, but rather breaching trade secrets. They admit the code in question is owned and copyrighted by IBM. They talk a lot about Linux code infringing their copyrights, but they haven't so far sued anybody because of it, nor did they say what the problem is. They are only talking, making press releases, etc., no legal action. They are not probably even offering those licenses for real - I haven't heard about anybody who bought them - did you? It could be a basis for a lawsuit against them after all. It seems to me they are playing safe on the legal side - one lawsuit, nothing more.
The real issue here is the media attention they get. Microsoft has lots of lawsuit - the Timeline case (users of SQL Server might be held responsible), the browser case (they might be forced to remove ActiveX support from IE and 0.5 billion fine), the case brought against them by Intertrust (DRM infringing their patents, might affect almost all of their products, Microsoft is loosing so far). These are real cases, yet few notice it. But one lawsuit by a little known company, with little chances for success gets so much attention, and moreover everybody seems to get it wrong - I don't recall reading a note, commentary or an article in a newspaper or a magazine that would get it right. This is (correct me if I'm wrong) a case about a breach of contract by IBM, because it contributed its own code, that has not a single line of SCO in it to Linux, which was later sold by SCO. They claim now, that they didn't know what they were selling. They didn't know Linux contained (at that moment) JFS support, NUMA, SMP, contributed by IBM, SGI, etc. After all - how could they? Did they have access to the source code?
Of course they did. Get it finally. The goal of this company is not to make profit. It's not find justice. It is to make everybody believe, that Linux is illegal. This is why ESR moves are waste of time.
People treat their claims seriously, because the media do. The question is - why the media behave the way they do?
Seriously. Perens and ESR are fueling SCO's flames by giving them poorly-thought-out statements to cull choice quotes from to support SCO's case. And SCO's words are the only ones seeing mainstream attention (check who's stories are linked from any pages about the SCOX stock prices, and you'll see the public is only getting SCO's side of the story).
In my most reasonable, humble opinion, anyone who is not an IBM lawyer really needs to STFU concerning this matter. The wise man waits his turn to speak.
create a C language parser that reduced the C-code down to op codes
like gcc?
I smell a troll so fiendish, so rotten, so utterly insane; LADIES and GENTLEMEN: May I introduce our pugilist/trollmeister of the evening...in the right corner, weighing in at 400 lbs of bullshit, heavyweight champion of the obscure trolling world, the best, the worst, the uncrowned king of sociopaths:
Kadaitscha Man! - "insert favorite troll"
(Trumpets)
So, "insert favorite troll", how's that BS in Scatology coming along?
Ah.
I see.
Yeah, I believe you...no,no, eating shit is probably good for you, in fact, err, I think I read somewhere that a certain african tribe use this practise to entice...
Say what?
Eh, well, I guess chimp-turds are kind of unconventional for a person in your office, but hey, what...
Hello?
Where'd you go, "insert favorite troll"?
Hellooooo!
(Fade out)
(and yes, permission granted for inclusion in "the peanut gallery")
I didn't know the old extremist fart could still actually write code.
It would be hard to prove that the differences between Unix and Linux amount to deliberate obfuscation.
SCO thought that Linux's BPF implementation was an obfuscated copy of the original, when in fact it's a clean room reimplementation based on published specifications.
In addition, natural evolution of code could be easily mistaken as "obfuscation" by someone who is not a kernel hacker.
next time peeps rip off codes, they litter every other line with dummy lines, n = n;
n = n * 1
n = n + 0
and bam his tool fails.
------ Curiosity killed the cat. {satisfaction brought it back | it didn't die ignorant | lack of it is killing mankind
If SCO doesn't want people to find out which lines are "copied into linux", so they can be removed, then why would they allow their code to be run through this program?
The open source community would then know which lines SCO has any business talking about, those lines could be investigated as to their source/license, changed if neccesary, and everyone could move on.
But wait! SCO wouldn't have a case. (Not that they have one anyway, but that's beside the point. This point, at least). This is exactly the same situation as SCO just telling everyone which lines they are talking about. How would it be different?
I mean if you compare two sets of data, and you know which lines are identical....
Is the difference that maybe someone wouldn't be breaking their NDA with SCO if they ran the code through this and provided MD5 hashes (rather than the raw source code)? (i.e. so that this can actually be done, whereas SCO will not show their code themselves...)
-ETF EOM
would be very useful, as it can cut both ways. I wonder how many Linux or BSD device drivers found their way into SCO source code? Once similarities have been detected, it would be a simple matter to go through the Linux kernel archives and trace the source of the device driver code. Could SCO do the same? Would they dare?
Of course, the hashes would have to invariant against camoflauge by variable name search and replace, whitespace and blank line insertion, etc. And tests would have to be made to ensure nobody is pulling a fast one by swapping code - It would have to be shown to compile correctly to the executable.
My rights don't need management.
> A single change in the white space would make the
> MD5keys not match. any code run through an
> obfuscator woul dnot match. a change in case
> would cause it not to match.
Sorry to flame here chaps but the comparator code already strips all white space characters from the lines before checksuming them. Thus trivial change such as indentation size will be ignored (though switching from 'proper' K&R indenting to some other form will break it).
...how to write good user interfaces. With coders like you we will never achieve complete world domination. The correct program is, of course, s.th. like this:
int main()
{
int i;
printf("Comparing source trees...\n");
sleep(2);
printf("Check started.\n");
for (i = 1000; i--;) {
printf(".");
sleep(1);
if (i % 100 == 0)
printf("\n%d0 percent remaining\n", i / 100);
}
printf("\n\nThese source trees appear to be entirely different!\n");
return 0;
}
To keep up the /. tradition of analogy, imagine if I stole something from the store, but, in court, I claimed that I already had the item, and won. Does it make the theft right? No. In fact, it may make it worse, because I added lying on top of stealing.
Oh yeah, remember, in a civil suit, SCO does not have to get beyond reasonable doubt. So if a reasonable person would think that it is an obfuscation of the original, SCO wins. To use your example, a jury will not be stacked with kernel hackers. That is something that goes both ways, but it worries me. Hell, if they can find OJ innocent, maybe they can find IBM (Linux) guilty?
Sarcasm and hyperbole are the final refuges for weak minds
You mean that by digesting SCO crap, they may produce improved crap?
------ The only greater hazard to your liberty than n politicians is n+1 politicians.
lcc would be a better solution. It can spit out text dags. I wasn't aware gcc could do anything other than asm.
Do you even lift?
These aren't the 'roids you're looking for.
-1, karma whore.
his algorithm is really cheap...
... Zn
... Yn
:D the beauty of this is that we can even use this algorithm to compare DIFFERENT languages, for example, C code that has been ported to Java or Python.
:D
comparing object code will not work if one line of code is there as the check sum will be different.
a lot of codes don't port directly, so when stolen perhaps a few lines are added or deleted. if i was the author of this tool (note, i have no idea how it works besides checksums) i will look at the logic of code.
for example,
if statement will the value 1
variables have 2
logical operators will have their own numbers
say, && = 4
comparasion operators will have their etc...
== = 6
~ = 7
( = 8
numbers = 9
& = 10
this code
if (a && b) & ~0xf000 will yeild
1,8,2,4,8,10,9
now we can pass those numbers to f()our result will be Z, Z defines the logic for that code
the next logic we can call Z1, Z2
now if the offending code has
if (foobar && feh) & ~0xffff it will likewise yield
1,8,2,4,8,10,9
we get Y, and the next line Y1, Y2
We can now then compare logics and compare how often they match.
So if Z0 matches Y3, we check
if Z1 matches Y4
and if Z2 matches Y5
by using this technique we can find logics that match, even if someone inserts 5 lines without our code, we do not have to worry about the checksums not matching or matching object codes.
basically we are matching by logic and if we notice 20 consecutive matching logics or whatever the threshold we set, we can yeild a positive result.
I just made this algorithm on the spot, perhaps someone has already done it, perhaps note.
Wooops, it's now left for someone to implement and attempt to patent it, then profit!
------ Curiosity killed the cat. {satisfaction brought it back | it didn't die ignorant | lack of it is killing mankind
does anyone have a copy of his essay from when VA Linux stock hist $0.50 a share? ($75,000)
Do you even lift?
These aren't the 'roids you're looking for.
Wouldn't SCO then be able to just copy from linux and then the comparisons would also be equal?
in girum imus nocte et consumimur igni
If you're comparing Linux to the legacy Unix source released by Caldera, wouldn't you want to throw away the common signatures? The unique signatures are what "might" contain any leeched code which is uniquely System V.
Theoretically, you can repeat the comparisons iteratively with other code bases (e.g., NetBSD/FreeBSD/OpenBSD), or compare against other signature sets (results of other comparisons). A SVR licensee could compare their branded version of Unix against SCO's, isolating the code which they own/created.
Thanks ESR!
THIS is exactly why Open Source works. It's not because of IBM or Red Hat or geeks from Finland. It's because people in the community are willing to step up to any challenge.
Thanks, ESR.
The global economy is a great thing until you feel it locally.
Pardon me, but a lot of you guys are missing the point of this comparator.
1) There are people out there with legit source licenses to SVR5 source trees. And not just Unix OEMs. Various people in large companies with SVR4/5 source licenses etc.
2) Such people cannot release the source code, and may (if paranoid of how they interpret 'derived works') not want to publish hashed MD5 codes of SVR5.
3) However, it should be possible for a legit SVR5 source licensee to publish openly a list of areas of code that are similar across trees, without that list being either A) a derived work, B) violating their NDAs (um, do check the fine print first though) and C) spending tons of their own, presumably expensive time, digging through stuff.
Then Linux advocates can then sift through the resulting large pieces of code and doublecheck/crosscheck the origins of it. At the very least, we'd have a public list of suspicious areas of Linux and could determine that certain parts are A) BSD-licensed, B) are verified as original by a known Linux coder, and C) don't fall into the above categories and remain 'suspicious'. This presumably is what ESR is referring to by "various persons will apply it in useful ways. Yes, I'm being deliberately vague and tantalizing". Let's say that its likely the percentages in A and B will be large.
Of course it's true that there could be code that this primitive tool doesn't catch. But SCO probably started their analysis by using tools like this also. Looking through millions lines of code by hand is no cakewalk, so one will inevitably start with code like this in such an investigation. (Unless one is concerned about one specific predetermined critical/sensitive piece of code.)
Oh, and the other thing about this tool that is nice IMHO? It demonstrates a "good faith" effort on the part of Linux advocates and coders to correct the problem -- despite the barriers raised by SCO (no code release except via NDA).
Finally, running this tool across a Linux and a BSD release should turn up some data that is both interesting and relevant for this dispute. I'm almost tempted to try that myself.
--LP
Why isn't this a press release?
If I go to Yahoo, and look at news related to SCOX, this doesn't show up. Here is the open source community trying to help find any misappropriated IP - and no one that doesn't read slashdot/eWeek will know about it!
Isn't there someone who subscribes to a wire service, that can issue a press release? In order to fight FUD, we have to get info out to people that don't read slashdot!!
Mike
It seems to me that the first order of business should be to compare it against Caldera's Linux tree to see if it lights up at 100%. How do you know it was the UNIX tree that was shredded? "Oops... that wasn't a shred-tree we intend to use in court."
Look how Slashdot turned this into another SCO article.
The news was a simple source tree comparison tool. Why is the headline "ESR to Shred SCO Claims?" He didn't mention anything about SCO whatsoever.
Just noticing. Now we'll have yet another few hundred SCO bitch posts. The Darl McBride troll will post and get modded up, people will try to act like intellectual property experts, and we'll all go about our day as usual. There is nothing new here but another attempt for more page hits on the part of corporate-owned Slashdot...
"Sufferin' succotash."
MD5 sums are only userful for determining if two chunks of data are EXACTLY the same. If you hash each file, then only copying a whole file would be detected. What if you hash each line? Well, then you still can't detect single character changes (such as re-indentation or added whitespace), and you risk having the line:
void main() {
occur in multiple places, resulting in collisions.
What is really needed is a funcitonal checker that compares the code logically, and a syntax checker that compares unique word fragments.
Copyright is misapplied to source code. Either REVEAL THE SOURCE or you only get protection on that which you "publish" - namely, the binary.
Put up or shut up; no source, no copyright on the source. You won't share it, you don't need it protected.
So how did he meet his fate?
In all fairness, SCO's value is not in being purchased so that the source code can be freed...
Well.. that's not what I had in mind either.
Rather, I was thinking that the source and rights to UNIX will be up for sale with the rest of SCOs assets after they file for bankruptcy. (And THAT, I assure you is just a matter of time..)
Interestingly, Red Hat's lawsuit and IBM's countersuit could very well leave the future smoldering wreck of SCO owing them money..
(I think we can all agree that their cases definetly have more merit than SCOs)
In fact.. if that happens UNIX could be included in a financial settlement between either of the above parties and the trustee of SCOs bankruptcy.
Now.. -that- would be sweet.
Is there a page that details the uses of the MD5 algorithm? I've just been using it for password security, but it seems like every other /. article references MD5 hashes. I'd like to know more about it.
what if SCO deliberately releases a list of shreds MD5fied from the linux kernel, as if they were from their own sources, without ever disclosing any code to public observation?
They would be divulging SCO's biggest trade secret, that all their claims are just FUD.
Engineering is the art of compromise.
RTFM:
Name
comparator, filterator -- fast comparisons among large source trees
Synopsis
comparator -c [-d dir] [-o file] [-s shredsize] [-w] [-x] path...
[snip]
The -s option changes the shred size. Smaller shred sizes are more sensitive to small code duplications, but produce correspondingly noisier output. Larger ones will suppress both noise and small similarities.
[snip]
The -w causes all whitespace in the file (including blank lines) to be ignored for comparison purposes (line numbers in the output report will nevertheless be correct). This is recommended for comparing C code; among other things it means the comparison won't be fooled by differences in indent style.
Using the appropriate switches will address two of your points. I'm sure it wouldn't be extraordinarily difficult to modify the code to ignore other things such as string constants, variable names, etc.
Problem is, at least in the case of SCO and probably in the case of many other code comparisons, it'll match on code that was commonly duplicated from open source or various sorts of PD or free sources. Consequently, the degree of similarity between the two trees will not be an indication of the extent of any copyright infringement.
Circumvent copy protection-- refuse to buy media that use it.
I cant dig up the slashdot post, but here is The Inquirer article from Jun 18th. Someone did this well before esr did.
Its not new, Its not esr's Idea, Its almost 3 months old!!!
I nearly blew Mountain Dew through my nose on that one. That's just crazy! SCO has committed the most vile of sin.
This Comment was generated with the Comment-O-Matic for SCO Stories.
I thought we already had one:
$ diff -u
I'm interested in algorithms that could be used to compare code. Moss and CAP from Berkeley are not interesting because the algorithm is secret (AFAIK).
What algorithms other than ESR's comparator are there? (I recall but can't locate a recent comment on Slashdot that said something like "most plagiarism detection programs used by professors use the XXXX algorithm".)
It is far more likely that people were not thinking, made a mistake or actually thought they had the right to include the code in which case they would not have tried to cover it up. I believe that ESR's goal is to identify these cases so they can be looked into.
e s r s c o m o u s e
who cares dude linix is fscking unstoppabull. it offers a caring community tripping all over itself to cater to the end user. oh, wait.. i think i just decribed the windows community.
Download & read the source. Or just read the documentation.
Comparator has the capability (-w) to ignore whitespace while generating the hash, while at the same time tracking the actual line numbers for purposes of merging and reporting. In my experience, most code-copiers are dumb and/or lazy -- to get past ESR's tool, the code-copier would have to (a) realize that they're violating a license, (b) not care, (c) be smart enough to realize that a pure cut-and-paste might get caught, and (d) energetic enough to munge up the code logic and variables. While I'm sure there are people like that, I would argue that most of them wouldn't be interested in contributing the result to the community, and the code wouldn't get past Linus if they did. The more logical case is some one/company who believe that they have a legitimate right to copy code from one kernel to another (BSD -> Linux / Linux -> SysV / SysV -> Linux) and thus not feeling the need to cover things up. Either of the SCO User Group examples would fit this category.
We call it art because we have names for the things we understand.
If it encounters a file with an unknown extension that has a blank first line, it gets a divide by zero (line 64 of shredtree.c).
I think it would be a lot more interesting with a program that could make algorithmic comparisons. Say, a program specialized in analysing e.g. C-code, prescinding naming, commenting, spacing and such from the actual algorithm.
That could allow finding even obfuscated similarities. (?)
So if a reasonable person would think that it is an obfuscation of the original
Not quite. It isn't that a reasonable person has to think it is an obfuscation. I believe the phrase you are looking for is "preponderance of the evidence", but I am not a lawyer. What they'll have to do is show that the preponderance of the evidence indicates that the code in Linux cannot legally be there. The judge will tell the jury what the law is and the jury will determine what the facts are.
Even if SCO allowed an "independent" party to hash their source with this tool, they could still present an "impure" source tree that has been deliberately peppered with code taken out of linux, specifically for the purposes of getting a match.
The bottom line is they need to front-up the code.
End of story.
It's just not possible to make a (small) set of MD5 hashes that represent all the "useful ways to structure a C statement". Even with extensive pre-processing, there's just way too many different ways to express the same algorithm.
/, you could generate (literally) an infinite number of arithmetic expressions. Not all of those would be "useful", but there's no way to even enumerate the possible "reasonable" arithmetical expressions, much less calculate an MD5 sum for any combination of three of them.
Because of the nature of MD5 as a cryptographic hash, the value of the hash gives you almost no useful information at all about the structure of the code.
For example, given 3 variables (a,b, and c) and just the basic arithmetic operations of +, -, * and
-Mark
If all you want to do is keep the source secret, then a utility to spit out MD5 hashes of each line triple would be sufficient. Then pipe that into "sort | uniq -d" to find duplicate lines. You can even use uniq's "-w" switch to allow you to append line number information to the hashes. Voila, a 1-line shell script that duplicates most of ESR's tool:
find -name '*.[ch]' -exec codehasher {} \; | sort | uniq -d -w32
Why is ESR's super tool better than this?
Patrick Doyle
I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....
Only do the shredding on documents which have been run through a script which converts all characters to uppercase and removes whitespace in a systematic fashion: no tabs, anything beyond three spaces becomes one space etc. Wouldn't that resolve most of the objections you raise? Fuzzy hashes, mentioned elsewhere in threads on this slashdot posting, could also be useful.
http://tinyurl.com/4ny52
It hashes the code in 3 line chunks, and the unique hashes from the shreds of both source trees are thrown out. Simple. And yes, whitespace can be ignored.
I still have a question about how the 3-line thing works, though (I read the article, but not the source or docs): if the source files are exactly the same, but the Linux version has an extra single line comment on top, won't all of the 3-line chunks come out as unique because they're out of phase?
Maybe it has some logic to restart the 3-line patter after any double linebreak, or some pattern detection (i.e., restart at comment start/end) to ameliorate this.
Anyway, I do have to say: this is by far the MOST EXCITING news I've heard yet in this whole mess. What reason can SCO possibly give to refuse providing the shreds of their code? Either they provide the shred results and their lying is exposed, or they refuse, and their lying is exposed.
There are only 10 types of people: those who understand decimal, those who don't, and, uh, 8 other types I forget.
Heh. This puts the ball squarely back in SCO's court. Darl McBride's whinging "open letter" demanded that the Open Source community had to fix their development process in order to monitor for, and prevent, leaks of "intellectual property". Well, this tool seems to do just that...
... in your face, McBride!
It's obvious that Darl was no doubt expecting his challenge about "fixing the process" to throw the Open Source community into a panic of bickering about what to do next which would last for weeks, triggering discussions about potential liability (which the press would no doubt interpret as an admission of guilt), generally make us look bad and thereby strengthen his case in the sadly all-too-relevant court of public opinion.
So he clearly he wasn't expecting Eric to solve his problem later THE VERY SAME DAY!
Jeez, you couldn't make this stuff up. Ha ha bloody ha
He developed a Callcenter Training Utility for our company in the early 80's. It used genetic algorithms to generate simulated customer complaints that were _very_ realistic, even to the point of using sample voices to "whine". Of course, the helpdesk trainees hated it...
But hey, the mewling was featureful.
Compiled with gcc 3.3+ it will actually run the analysis but it won't generate the output files if used against something larger than the 20 source files he includes in the tarball.
I wrote this in perl + shell commands a few months ago, haven't optimized it and it's in the same range of fast as 'comparator' and actually works. -- of course software that just works isn't anywhere near as exciting, or press-release worthy.
ho hum
Linux is Linux, if One need clarify their dist: <Dist>/GNU Linux
bsds are of course just BSD
- Professors
- Publishers
- Newspaper Editors
- Librarians
- Everyday Coders
- Statisticians
- Lawmakers
Anyone who's got a vested interest in knowing whether or not they are looking at an original work can benefit from this tool, or derivative works of it. With a bit of front-end processing, this can help professors and editors spot plagarism, librarians spot duplication in their collections, and coders areas of redundancy. Thanks, Mr. Raymond. I'll be compiling this tonight...Your request has been reviewed, and has been denied. We are truly sorry. The rest of you may now resume.
In order that the method should not be fooled by simple changes, at least the following is required
* White space must be ignored
* Comparison must be at the statement level, not the code line level
* Variable names must be replaced by standard placeholders
* Routine names, other than standard library calls, must be replaced by standard placeholders
* (Probably difficult) logic will be needed in the tool to detect and ignore noops: how do you deal with
The trouble is: a high proportion of the code sections thus simplified will fall into a relatively small number of possibilities, vulnerable to dictionary type attacks. Thus, most of the code could be reconstructed, though admittedly as obfuscated source code. IMHO this provides a valid objection to its use.
You are right that SCO will not allow it's shredded source code to be publicly released. However, the reason would be because the MD5 hashes that matched could be traced back to specific lines in the kernel. I don't know why people don't get it...
SCO doesn't want the common code published or known until the court date.
I know it's irrational, I know it's silly. However, everyone expecting a "rational" response from an irrational company is foolish. I guess hope springs eternal, on both sides. SCO has bet the farm on this strategy, and they are not about to let the cat out of the bag. ESR seems to think they are just going to submit to this hair brained scheme, and produce a bunch of MD5 checksums.
If you take their point of view, this does nothing to protect their IP, it's just a thinly veiled way of tricking them into revealing the code they believe is in question.
No.
There exists no way of exchanging information without making judgments. --Bene Gesserit Axiom
I think we are all missing a big point here. SCO registered their copyright in SysV. It was hard to do. They had to create a copy of the source code and file it with the Patent and Trademark Office. That puppy is there so that *we* can look at it. This is specifically *fair use*. It is there so that individuals can protect themselves by comparing what they have to what has been registered. No match means no problem. It just doesn't get any more *fair use* than that. Just have somebody nip up to the PTO, copy the registration for comparison purposes only, (really!) then do the comparisons. How hard is that? Yes, IAAL, but this is not legal advice. Hire your own mouthpiece.
Has anyone else tried to compile Eric's code?
/usr/bin/gcc -c -g main.c
>gcc --version
2.95.3
>make
main.c: In function `report_time':
main.c:311: parse error before `int'
main.c:312: parse error before `int'
main.c:316: `buf' undeclared (first use in this function)
main.c:316: (Each undeclared identifier is reported only once
main.c:316: for each function it appears in.)
main.c:317: `minutes' undeclared (first use in this function)
main.c:317: `seconds' undeclared (first use in this function)
make: *** [main.o] Error 1
Looks like Eric has been coding too much c++ or something. I'm not a c coder myself, so I might be wrong, but don't you have to declare all the variables in a block of c code before using them. In report_time, he doesn't seem to have followed that rule. Maybe he might check his code on a number of compilers before declaring he has "perfected it".
Eric here's my patch:
--- main.c 2003-09-10 00:28:37.000000000 -0300
+++ main.c.fixed 2003-09-10 00:29:55.000000000 -0300
@@ -306,12 +306,17 @@
if (mark_time)
{
- int elapsed = endtime - mark_time;
- int hours = elapsed/3600; elapsed %= 3600;
- int minutes = elapsed/60; elapsed %= 60;
- int seconds = elapsed;
+ int elapsed;
+ int hours;
+ int minutes;
+ int seconds;
char buf[BUFSIZ];
+ elapsed = endtime - mark_time;
+ hours = elapsed/3600; elapsed %= 3600;
+ minutes = elapsed/60; elapsed %= 60;
+ seconds = elapsed;
+
va_start(ap, legend);
vsprintf(buf, legend, ap);
fprintf(stderr, "%% %s: %dh %dm %ds\n", buf, hours, minutes, seconds);
I can see this tool becoming helpful for so much more than smashing SCO. Any situation where data comparison is useful, but the data itself must remain secret. All paranoid types (corporate or governmental) will love it. Lawyers could make much use of it.
And, given the dataset it generates, it could be extended to do other useful things such as detect redundant or cut-'n-pasted code, including bugs of the "pasted it in twice" sort.
...a shred of evidence that you'd done the comparison - - unless MS were proven guilty and the Shared Source licensee spoke up about it, in which case the doctrine of dirty hands would protect the licensee.
Microsoft couldn't successfully prosecute the licensee because they broke the law themselves with the item in question. OTOH, while that might stand in the way of the licensee themselves prosecuting Microsoft, others could then proceed themselves, sure in the knowledge that when discovery time came they'd be laughing. It'd even eclipse TSG's circus for a few days, maybe TSG's stock would tank because of that. (-:
There'd also be nothing to stop the licensee protecting their Shared Source access and avoiding offending Microsoft by shredding the source themselves and anonymously publishing it. Then anyone could do the comparison and point the finger. Any takers?
Got time? Spend some of it coding or testing
I don't know how many of you have heard this already, but Jon Katz, Internet journalist and bon vivant known and loved by millions of Slashdotters, was found dead this morning by sanitation engineers. Speculation as to the means of his demise hinges on the broomstick found near the body.
He will be sorely missed. Truly an American icon.
Now, you will remember that ESR last week all but threatened SCO with something it wouldn't like if it didn't start playing fair.
What if someone with legitimate access to SCO source code, shredded it and gave ESR the MD5's? Since there is no way to determine what the original lines are from the MD5's that would likely not be violating any NDA. Then, what if ESR compared that to Linux and other GPL'd code sources and only published those instances where it appears that SCO has stolen code.
That would probably result in some new discovery processes, even if IBM isn't already there.
Could it perhaps result in a method for other big software vendors to have their source code examined for illegal takings without them having to reveal any source code? And if so, is it possible that some of SCO's big backers might be more reluctant to let SCO keep this up? I can't imagine that MS would want to create a court approved method to compare its code to those that it might have stolen from in a way that doesn't give MS the cover of not wanting to reveal its code in public.
Actually, combine this with the "shared source" program from MS and it would be easy to see if MS did (or did not) copy GPL code into Windows as some suggest.
More importantly, get something like this accepted in a court of law as a legitimate way to do an initial assessment of code yet still preserve a litigants right to code privacy, and you are going to have not just MS but a number of big companies shaking in their boots. Not necessarily because they did steal anything but because they have to realize it is a possibility that one of their coders did without company knowledge. Doesn't matter, they are still liable.
But, get a method like this accepted in a court of law and you are going to see it used again. I think this has a huge potential to hurt closed software. And perhaps a potential to convince MS to stop funding SCO, perhaps even to apply pressure to get them to start backing down.
No, I'm worried that the 3-line chunks could be out of phase.
...won't match with the MD5 sum of these 3 lines: ...even though they share two lines. Now imagine the same thing for a few thousand more lines, identical except that one started on "one" and the other on "two". Every single MD5 will be unique, because you'll always have only two lines in common.
An MD5 sum of these 3 lines:
one
two
three
two
three
four
There are only 10 types of people: those who understand decimal, those who don't, and, uh, 8 other types I forget.
Okay, here it is (from the man page):
comparator works by first chopping the specified trees into overlapping shreds (by default 3 lines long) and computing the MD5 hash of each shred.
(Emphasis added)
There are only 10 types of people: those who understand decimal, those who don't, and, uh, 8 other types I forget.
Since IBM has both versions of the code, and since much time has passed since this issue arose, it would seem reasonable that IBM has already performed an analysis like this.
So where is the result?
If it was to their benefit, couldn't they leak out the results, if nothing else?
So maybe the result isn't favorable...
IBM's contributions are IBM's to give in the first place. SCO's claiming some pretty twisted contractual rights are being violated by the act in question- namely IBM giving pieces of their IP to the Linux community under GPL. A thorough reading of the evidence that SCO provides on their own website invalidates that claim- i.e. that SCO, through it's purchase of this and that has a control right over whether IBM may or may not give away it's IP.
The simplest way out is to not listen to SCO in the first place and wait and see what comes of ALL of this- the Red Hat filing and the IBM one.
On the 15th, SCO HAS to respond, come up with a fairly compelling reason for the court to allow another delay, or face a summary judgement. If they don't come up with something to counter Red Hat properly, they face a summary judgement.
Later in the month, they have to answer IBM under a similar set of circumstances.
Combine this with what we're all discussing, if ESR's little program works like it appears that it does- while ESR's grandstanding, it would very easily hurt their position with the Red Hat filing.
I am not merely a "consumer" or a "taxpayer". I am a Citizen of the State of Texas
I don't know if the MD5 sums are a derivative work of the original source or not, but I would be inclined to think that they are.
Let's look at what the law says about fair use
Fair Use
The four factors are: (1) the purpose and character of the use, including whether such use is of commercial nature or is for nonprofit educational use; (2) the nature of the copyrighted work; (3) amount and substantiality of the portion used in relation to the copyrighted work as a whole; and (4) the effect of the use upon the potential market for or value of the copyrighted work.
It looks to me that under part (1), the MD5sums are a form of commentary or news reporting about the original work, not a replacement for the work. I don't know about (2). Under (3), the "amount" is definitely small, and the "substantiality" is low. And under (4), almost nobody who would buy the original work is going to substitute the MD5sum's instead, so the MD5sum's would have nil effect on the market for the original work.
So in my AC-IANAL opinion, distribution of the MD5sum's would be protected under American copyright law as a "fair use".
What if SCO offers their shredded code, but "poisons" the MD5 sums with sums taken from actual Linux code? How could anyone verify they didn't?
Even if the shredding is done by a 3rd party they could contaminate their "source" with Linux source before letting the 3rd party get to it. Same result.
They would then claim this as proof of their case.
Interesting. I just ran this on some source trees on my disk. These were two related projects, but some of the matches were not the result of copying. Here's one three-line example (no, this isn't my coding style):
}
{
One file was Java, the other C.
I know you can vary the shred size to make the output less noisy but it shouldn't be difficult to find scenarios with 4, 5, or 6 lines; beyond that you risk missing small but valid similarities.
Article, Manual, Source
Opinions stated are mine and do not reflect those of the Illuminati
As you know, SCO are claiming copyright infringement, no patent infringement. They claim that code has been copied verbatim into Linux. This tool is very usefull to decide on this claims.
I hear all this debate about whether or not the code itself would md5 similarly, but here's one for y'all...
Tell me, would the file trees lend themselves to comparison? Or do we plan on catting every single file into a monolithic codeball to do the comparison on? Not counting the asm files, no doubt.
Would the System V file tree be identical to Linux's? Somehow, this I would doubt, extremely.
The Penguin Producer
A lot of comments focus on the problems that a global search and replace will pose to the technique. I think we can improve the agorithm by doing the following:
What we are looking for here are pieces of code with the same structure: the same for loops, while loops, variable assignment, function names, and so on. The idea would be to substitute all literals by a standard placeholder, and then generate the md5 checksums on the block level (as somebody has previously suggested).
To be able to cheat this technique, a modification in the structure of the code is required. And in the case that exactly that has been done, it is arguably wether that can be considered copyright infringement.
If it's already in the linux kernel, it's publicly available. The only thing you cannot see publicly is which parts are actually SCO's.
For me to even consider whether SCO source is in the kernel, SCO should:
1. point to which source they mean are infringing
2. give reasonable evidence, that they actually have the rights to that source.
Should a judge decide that the source is SCO's, they will have a good propection for that source in the future.
SLOGEN [ http://ungdomshus.nu : Sebastian cover music]
A: "You stole my watch"
B: "No, i bought that"
A: "No, you stole mine!"
B: "Okay, show me somthing that identifies your watch"
A: "Can't do that... it's a secret watch!"
B: "uuuhhmmm, if you really think it's your watch, I already GOT the secret!"
SLOGEN [ http://ungdomshus.nu : Sebastian cover music]
In other news, SCO reports that it has successfully used the Linux IP pirates' own code-comparing tools against them. A perfect match has been found between SCO code and Linux code. The offending code reads:
#include <stdio.h>
I am wondering how far the described method is effective to find any kind of plagiats.
Taken source code, copy it and change variable names (like done in some student projects) and now run it via the shred algorithm. To my understanding the MD5 hash of two bitstrings only with only one bit difference are still not related at all (ie. can be shown to be closer than two bit strings with many bit difference). Hence the algorithm would not find the variable-name-changed source code, if it compares three lines in one go.
Wouldn't code snipplets passed into new projects always require some sort of name adaption ?
This would have been a lot more interesting if he had actually used his tool to compare Linux to eg FreeBSD. This would test his software and proof its usefullness. It would give us a much better picture of what BSD code is in Linux, and be a great help in determining if SCO has any rights to any of it.
If you've licensed code from microsft, and it turns out to be GPL, the license under which you got the code is invalid, so it wasn't illegal to determine if they improperly took code.
You forgot the most important thing, MS has billions of dollars to burn, to "prove" they are legally right. And you can bet your kids future on that they will spend such sums to protect their source code.
How much do you have ?
echo '[q]sa[ln0=aln80~Psnlbx]16isb572CCB9AE9DB03273snlbxq' |dc
And so how is it that the only credible example of stolen code in Linux came from someone working for SGI?
Watch this Heartland Institute video
This is just more evidence of the Open Sores community's intention to launder* the code before Darl can prove how evil they are in a court of law. Buy SCOX before it goes through the roof!
* "launder" is Darl's actual choice of term.
If you were blocking sigs, you wouldn't have to read this.
The best use of this technology would to test the SCO LKP for stolen Linux code.
Confirming that SCO had incorporated Open Source code that they had access to under the GPL would destroy their credibility and open them up to countersuits. The process would only have to reveal enough similarities to have subpoenas ordered for the actual code involved. Then we could prove the theft with SCO own source code.
I suspect that those who know that Linux code was used to create LKP would come forward once the code has been discover and posted for all to see.
Scox shares continue to sky-rocket in deference to news like this. Scox up about 11% just yesterday. Now scox is over $18 a share.
Not bad considering that scox's core business is now *much* worse than a few months ago - when scox was under $2 a share.
All of scox's recent profits come from msft fud money. Untill msft started throwing gobs of money at scox (supposedly for a partial unix linux) scox had never had a profitable quarter. In fact this company with a book value around $10 million was losing as much as $125 million in a year.
Respected Wall-Street Analyst Jonathem Cohen has just prediced that scox will earn an astonishing $3 a share in 2004. I guess Cohen thinks that Bill Gates is feeling awfully generous.
US justice system asleep at the switch. Scox insiders laughing all the way to the bank.
Even a potential Lawsuit is just another reason to write grooooovy software.. *evil grin*
GO ESR!!
-- All That's Evil in the Geek Space
ESR is still programming like a pig.
His code includes linux-only headers, and doesn't pass through -Wall by a large number of fairly stupid warnings.
It might be that corporations will be happy to give him source code shreds, if they can manage to compile and run his code...
Did your momma have any children that learned to think?
Source code gets no copyright protection: corporations keep their source as a "trade secret" and only get protection on the executable. It is illegal to redistribute (copy) the executable, and the source is entirely within their control (and their responsibility). No real "furtherance of the arts" is accomplished except within the limited scope of usage of the tool itself. If a work is infringed at the source level, therefore, it is (nearly) impossible to prove without revealing "trade secrets" and, therefore, exposing the company to further risk.
Source code gets copyright protection (as constitutionally mandated)
Corporations have to register the source code, and therefore are given fulll protection on both works. It is just as illegal to redistribute (share) the source beyond the scope allowed by the rights holder, and if a work is infringed there is no risk to the rights holder in defending the work. "Furtherance of the arts" is addressed, as well as the rights of the work's creator.
Corporations are allowed "copyright" on works they do not share.
It becomes nearly impossible for libeled parties to defend themselves, but "rights holders" are free to make claims as they see fit. Which gives "rights holders" basically free reign to make accusations which they may never be forced to address in court, and leaves victims nearly defenseless until the (very slow) court gets around to addressing the issue. Neither "furtherance of the arts" nor protection of (libeled) rights holders is served, since the more powerful party remains free to withold (copyrighted) "evidence" that no one is allowed to see.
How does this system serve rights holders whose works may have been infringed upon, but are forced from the marketplace by another "rights holder" with more money? How does that system serve the public interest? How does it promote progress?
Can you answer any of these questions using sound logic?
This is the end for SCO. There are only three possibilities here:
(1) That there is no infringing code that belongs exclusively to SCO. If that is the case, then its game over for SCO; perhaps followed by jail time for Darl and friends at some federal prison where they would discover a new meaning for the phrase "pump and dump".
(2) That there seems to be code that was duplicated directly and exclusively from the Sys V source tree and that doesn't originate from any other public source. If it's a trivial amount of code, simply replace it and move on. The unlikely perpetrator alone becomes responsible for any damages to SCO.
(3) If its non-trivial you can simply remove it from the kernel as long as it doesn't impact anyone seriously. Make it part of the final 2.6.0 or a 2.4.x interim kernel release. For example, lets say the IBM journaling file system is exactly the same. Simply remove it from the kernel until at which time IBM settles its lawsuit and resolves those matters. As long as people don't really need JFS, why encumber the kernel? I've never used it, preferring ReiserFS or ext3. Same goes for other supposed code expressions such as RCU or NUMA, although I suppose the copyright issues on those would be easy to solve since the amount of code in question is on the order of 5000 lines. If all you have to do is change the way its expressed, that should be trivial. In any case, derivitive works laws shield any code that is specifically tied to hardware implementation from being considered before the court.
Bottom line. SCO is really screwed now. Their only recourse is hope beyond hope that they can get someone to agree with their derived works claim on some non hardware/software patent code. At that point the only thing they can do is get compensated by the infringing party. No way they will be able to shake down linux users since they will already have been paid.
Oh, I forgot the fourth possibility -- that no one that has access to the Sys V sources will be will willing to run 'comparator' on it and generate shreads and that SCO will also refuse. This of course would be sufficient to dismiss their case. A declaratory judgement could be handed down by the federal judge supported by expert testimony that 'comparator' is a valid comparison. Any reasonable expert Software Engineer/Cryptographer would do. Perhaps Bruce Schneir could be the expert witness.
End of story. Thank you for playing SCO, please drive through.
The key word there being absurdly.
the preceding comment is my own and in no way reflects the opinion of the Joint Chiefs of Staff
I can hear the wailing of comp sci students worldwide...