Usenix President - Linux Needs Better Paper Trail
Anonymous Coward writes "Usenix Association president Marshall Kirk McKusick is a veteran of BSD's intellectual property scuffle with AT&T in the 1990s, and he's got some thoughts and advice for the keepers of the Linux kernel going forward, commenting: 'There isn't a well-documented ownership trail with Linux. So, they have opened themselves up to a swamp of 'he said-she said' about where code came from'."
Dating back to when linux (the kernel) didn't even have a version number, code was always attributed to where it came from. I'm sure everyone is familiar with at least the changelog and its attributions. And of course actual comments with names and email addresses are all over the sourcecode itself.
/. in The Mysterious Future!) In the unlikely event of SCO ever saying which lines are thiers, we may end up with the interesting situation where a Caldera/SCO employee put them there - and get to slap SCO for abusing the legal system.
Now, Mr. McKusick might have a partial point. Its entirely possible that some gremlin over at Caldera took a bunch of SCO's 'Intelectual' Property and threw it into the main kernel under the GPL. In which case once the lines of code are actually identified, I suspect we will know who contributed them in under 20 minutes (10 minutes of which will be the article sitting on
In any event, I'd be willing to put money on Linux's source code source documentation beating SCO's out any day of the week.
Karma: SELECT `karma` FROM `users` WHERE `userid`=138474;
http://news.com.com/Linux+contributors+face+new+ru les/2100-7344_3-5218724.html?tag=nefd.top
I wrote it, the whole thing. Linus was my roomate at the time, he took credit for all of it. I was the one that worked with Santa Claus and the tooth fairy to develop it, not Linus! Problem solved.
I boycott signatures
About a thousand geeks just simutaneously
wondered what the hell paper is.
Oddly enough, all of those thousand geeks could tell you what a scroll is.
There's nothing Intelligent about Intelligent Design.
It's not like this is some surprising new insight, see another article posted today: here.
Is intended to allow the developers of Linux, as well as the various UNI*es, to register and tell what they know of their own roles, as well as the development of each feature of each version of UNIX flavored operating system. Stay tuned to Groklaw for the official announcement...we're working on getting the site up within the next couple of days.
Site's a little slow already (darn subscribers), so here's a Mirror.
Note: This doesn't mean I agree with this crap. As a coder, I can certainly understand their wanting to write code more than document everything. Really, shouldn't CVS logs be as much "proof" you wrote it as you need? It's far more work to try to fake writing it by changing other's code, than it is to just do the work itself.
Ehh. Linux /always/ had a version number. Since day one, with v0.01, back in 1991.
Uh, this must be a typo. Linux developers arguing over the source for changes would always be; "he said-he said - then they got into a hissy fit hair-pulling fight"
Glad I could clear that up.
I guess, in the spirit of the GNU GPL, they'll have to come up with something, call it the FDA - a "Full Disclosure Agreement" that you *must* sign before contributing code, stating that you WILL tell everybody about the project and publish your code contribution, sort of a bizarro-world NDA.
try { do() || do_not(); } catch (JediException err) { yoda(err); }
From what I've seen ownership never becomes a problem until large amounts of money become involved or until one group attempts to sandbag another group based upon their ownership. Since this is the open source community, most commonly under the GPL license, there is no worry about this sandbagging unless someone attempts to take a fork and make it commercial.
Is this where the need for a paper trail comes in? Suppose someone takes the kernel and starts their own independent development on it. Hypothetically, in five years, they could rewrite or replace over 50% of the kernel with their own code. From what I understand the GPL license requires that any code that it becomes part of must also be GPL. If the total code package is several million lines, however, who is going to pay to subpoena the source code for a commerical product to prove that it was indeed started from a GPL/open source project? Who will pay to have the code audited and what prevents a potentially unscrupulous commerical entity from playing mix and match with subroutines so carefully that the resulting audit would take more time to arrange properly that to actually audit the lines one by one?
I suppose, in this case, the paper trail wouldn't make a darn bit of difference. The paper trail isn't going to make it any easier to subpoena source code from a commercial entity if they're stonewalling.
Enter my tin-foil argument that Windows9x/2x is nothing more than badly mangled Linux and a customized window manager built with a crytpically designed compiler--but no one ever gets to see the source code so they'll never be able to prove it.
+++ATHZ 99:5:80
Ok, so if I hypothetically had this idea to include a few lines into the kernel...I managed to slip a couple of lines of code into a "thank you" postcard to Linus eons ago. After reading it, he thought it was utter rubbish and tossed it away.
Actually, he was so pissed off about the whole stupidity that it motivated him to stay up an extra two hours hacking away. So technically, some of his code should be attributed to me, right?
Much like how some of that code should be credited to Pizza Hut, Starbucks, and a few different candy bars. So where's the documentation on that?
All the changelogs, the comments, and any other bits of documentation aren't enough. Where's the credit to the pizza delivery guy? He helped develop some of that code! Ingrates, I tell ya.
Slashdot: Process Improvements Wasn't Linus just talking about authors signing kernel submissions?
No. It's that no one can take away your right to fork the software, your right to use the software as you see fit, your right (or your proxy's right) to examine and change the software if you desire, and your right to redistribute the software, as long as you allow other people the same rights.
how to invest, a novice's guide
Also, I would imagine that pretty much every kernel code submission is traceable to it's submitter. As far as I know, every specific line of code that has been brought up by SCO has been tracked down and attributed to it's submitter. Beyond that, there's really no way for BSD, Linux or anybody else to _know_ that the person submitting a patch really owns the copyright to it, or is acting as an authorized agent of their employer who owns the copyright to it. At some point, there is good faith. Yes, a well-documented paper trail would be nice, but requiring patch submitters to submit signed documentation with their patches would place an immense administrative burden on somebody, and it wouldn't prove that no copyright infringement has occurred, it would just blame-shift to whoever submitted the patch. I don't think that would legally remove the possibility that an unscrupulous company could go fishing for damages, a la SCO. It would also effectively remove the bazaar-like openness that Linux has, in contrast with more closed, insular projects with their rigid committer lists and uberpolitical machinations (XFree86 anyone?).
But I guess from a PR perspective this guy has a good point. Having some big pile of papers to point to and say "look, this documents that all contributers have copyright to their patches, and every line of code is accounted for" - this might help dissuade outrageous claims like SCOs and allay the fears of the business community, which likes to know that there are reams of bureaucratic documentation proving that the code is clean.
And maybe *you* should read RTFA (the McKusick interview)? He says explicitly that the paper trail was lacking "until recently" (by which he means the switch to BitKeeper). Maybe you should also learn some respect for people like McKusick who've been hacking free Unix since back when Linus was a kid. Among other things, this guy pretty much invented the modern Unix "fast file system", from which ext2 takes a lot of ideas. More recently, he's been responsible for softupdates in UFS (gaining the speed benefits of async mounts without compromising filesystem integrity in case of crashes).
How many closed source companies copy code from various places? I would say open source is the least likely place people will do this considering how easy it is to get caught.
Anyone out there have personal knowledge?
You know, I suspect they've already discovered copied code...by a Caldera employee. Possibly even with written permission on file.
But the point of their lawsuit is to prove that at least some of the code in Linux came from SCO's IP through IBM. They're damned sure not to point out any pasting they did. It would point to an apparent flaw in their logic.
(And whether that flaw is really a flaw, and not a dynamic company changing its policies, is a subject for another debate. But I won't represent them.)
tasks(723) drafts(105) languages(484) examples(29106)
The changelog is insufficient documentation. It contains vague attributions that something changed somewhere in the code. It isn't specific as to what lines of code changed. Later, when you go back and try to find where a set of lines came from, a changelog doesn't help much.
With a source code control system, you know that so and so added on such and such a date. You can then go to that person and ask them where they got it from if there's ever any question.
In the BSD world (I do a lot with FreeBSD), this has come in very handy when code disputes come up. Being able to talk to the actual people that inserted the code into FreeBSD has helped to clear up what otherwise might have been viewed as something improper.
I've tried to do similar things with versions of linux in the past, only to discover that I could, at best, find what version they came into the tree at, and who collected the patch and sent it to Linus. I wasn't able to track it further without searching public mailing lists for the information (with mixed results).
while you might believe that it will take 20 minutes to identify the code in question, my guess is that's overly optimistic, unless the code in question was contributed since bk. It usually takes me at least 5 minutes to find out where code comes from in FreeBSD when there's a question, and cvs annotate makes the process *MUCH* faster.
I'm not sure I'd disagree with your comments about SCO being able to come up with where the code came from relative to Linux.
No, the purpose of the GPL is to provide everyone with access to the code and allow them to use it in their own GPL programs.
All contributors to Linux still own the sections that they contributed. Some projects are run differently, for instance the FSF owns the code to all of the official gnu projects, because they ask contributors to assign copyright to them.
The ownership is important if you later want to change the license, for example by granting somebody permission to do something that isn't usually allowed by the GPL (e.g. distribute a modified version that isn't under the GPL).
If ownership of the code is restricted to a few well-known people this can be done, in the case of the linux kernel it couldn't, because if any contributor couldn't be contacted/refused (there'll be quite a few, I suspect), then their code would have to be removed. If it were important it would then have to be replaced.
Paper? What's paper?
The man who trades freedom for security does not deserve nor will he ever receive either. - Benjamin Franklin
We all know Linux hasn't been in any sort of a version control system since version 2.2 after which the issues started alledgedly creeping up.
Marxist evolution is just N generations away!
In the end ? I like to think there's no time like the present.
Meanwhile, SCOX is down to 4.74 today. Volume is about a third of the 3-month average; they're falling off the investment radar. IBM's latest set of legal moves put SCO in worst shape than they've been since the litigation started. SCO has an earnings call and webcast on June 2. Tune in and hear Darl try to talk his way out of this one.
'There isn't a well-documented ownership trail with Linux. So, they have opened themselves up to a swamp of 'he said-she said' about where code came from'
So what? There is a basic flaw in this argument! In the USA anyway, you are presumed innocent until proven guilty. Anybody alleging that source was stolen and placed into Linux must prove that source code:
a. existed somewhere prior to being placed into Linux
b. was stolen, not just happens to resemble code that might have been developed independently by someone else
In short, there should be no burden of proof on Linux's part to prove that the source was not stolen; the burden of proof must be on the accuser to prove that the source was stolen!
Knowing who submitted exactly which piece of code to Linux will not drain the swamp of 'he said-she said' about where code came from'. In fact, it will make it a lot worse. Consider: company A claims that some portion of Linux source, submitted by person B, was stolen. Person B had business dealings with company A prior to or during the time that the source was submitted. Company A will say that this proves the source was stolen from them since person B obviously had opportunity! They will claim this even if person B had dealings totally unrelated to software within company A.
It's very easy to document where code did come from. But it's virtually impossible (if not 100% impossible!) to document that code did not come from any commercial source. By definition, to "prove" that any given piece of code didn't come from a commercial source, you'd have to take every single piece of commercial source code written up to and including the day of the disputed source's release, and grep it.
Honey, I shrunk the Cygwin
Was this not one of the reasons the GNU project wanted copyright assigned to it?
In the commercial world, you have change numbers which link to a documentation trail which shows who implemented something and why and who approved it. Linus is trying at least to improve the code provenance by looking at a certification chain between the patch generator, the maintainer and eventually Linus as release manager. Unfortunately, it still looks like a hunt through LKML for the documentation as you suggest.
See my journal, I write things there
This is why the old 4-clause bsd license enforced the notion of not being able to remove the copyright notice itself, and always giving credit for authorship of the code, plus the normal lack of warranty bits. RMS has quotes on the internet and his fsf.org site about this, and to summarize he says that it is too much of a burden to mark the names of each and every contributor to the code. This is just the way the GPL assymilates code, and makes it stink. Marshal is probably right about this since he was at the CSRG when BSD came under the gun about att code infringment..
It isn't a lie if you belive it.
Linus has already acted.
Date: Sun, 23 May 2004 06:48:09 GMT
From: Linus Torvalds <torvalds@osdl.org>
To: Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: [RFD] Explicitly documenting patch submission
Hola!
This is a request for discussion..
Some of you may have heard of this crazy company called SCO (aka "Smoking
Crack Organization") who seem to have a hard time believing that open
source works better than their five engineers do. They've apparently made
a couple of outlandish claims about where our source code comes from,
including claiming to own code that was clearly written by me over a
decade ago.
People have been pretty good (understatement of the year) at debunking
those claims, but the fact is that part of that debunking involved
searching kernel mailing list archives from 1992 etc. Not much fun.
For example, in the case of "ctype.h", what made it so clear that it was
original work was the horrible bugs it contained originally, and since we
obviously don't do bugs any more (right?), we should probably plan on
having other ways to document the origin of the code.
So, to avoid these kinds of issues ten years from now, I'm suggesting that
we put in more of a process to explicitly document not only where a patch
comes from (which we do actually already document pretty well in the
changelogs), but the path it came through.
Why the full path, and not just originator?
These days, most of the patches in the kernel don't actually get sent
directly to me. That not just wouldn't scale, but the fact is, there's a
lot of subsystems I have no clue about, and thus no way of judging how
good the patch is. So I end up seeing mostly the maintainers of the
subsystem, and when a bug happens, what I want to see is the maintainer
name, not a random developer who I don't even know if he is active any
more. So at least for me, the _chain_ is actually mostly more important
than the actual originator.
There is also another issue, namely the fact than when I (or anybody else,
for that matter) get an emailed patch, the only thing I can see directly
is the sender information, and that's the part I trust. When Andrew sends
me a patch, I trust it because it comes from him - even if the original
author may be somebody I don't know. So the _path_ the patch came in
through actually documents that chain of trust - we all tend to know the
"next hop", but we do _not_ necessarily have direct knowledge of the full
chain.
So what I'm suggesting is that we start "signing off" on patches, to show
the path it has come through, and to document that chain of trust. It
also allows middle parties to edit the patch without somehow "losing"
their names - quite often the patch that reaches the final kernel is not
exactly the same as the original one, as it has gone through a few layers
of people.
The plan is to make this very light-weight, and to fit in with how we
already pass patches around - just add the sign-off to the end of the
explanation part of the patch. That sign-off would be just a single line
at the end (possibly after _other_ peoples sign-offs), saying:
Signed-off-by: Random J Developer <random@developer.org>
To keep the rules as simple as possible, and yet making it clear what it
means to sign off on the patch, I've been discussing a "Developer's
Certificate of Origin" with a random collection of other kernel
developers (mainly subsystem maintainers). This would basically be what
a developer (or a maintainer that passes through a patch) signs up for
when he signs off, so that the downstream (upstream?) developers know
that it's all ok:
Developer's Certificate of Origin 1.0
By making a contribution to this project, I certify that:
(a) The contribution was created in whole or in part by me and I
have the
It's out now!
GrokLine
Groklaw story on Grokline
No matter how convoluted the system you propose to "track" this stuff, it will *always* come down to whether you beleive or can trust "the first order contributer".
If we knew where every last keystroke came from, there would still be the "bob is lying, that keystroke didn't come from him, he stole it from his bos/frind/company/disassembly-fo-windows or whatever. Or worse, he typed in the code but he got the idea from watching the wonderful-world-of-Disny while reading Cryptonomicon so Eisner and Stephensen are the inventors and deserve X in consideration.
Many jobs worth doing are only worth doing to a certian standard of completeness. The problem with the porely-named Intellectual Property domain is that, reguardless of whether ideas want to be free or $40 a barrel, the boundary and origin of all ideas is undocumented-bastardary at best.
All works of any creative mind are, at least in part, stolen from the fertile field of experience.
There is no fixing that, and the supposition that all the progenitors of what came before do *NOT*, a-prioria, deserve recognition and a stake.
Turning the provenance of each line of code into a preverse kind of Oscal(tm) acceptance speach *still* wont insure that someone isn't slighted somewhere.
"I'd like to thank the academy, and my third grade comp-sci teacher for this for-loop, without them I would have never understood that pre-increment saves a temporary. And of course a shout-out to the CPU manufacturer, without whom I'd have never had a chance at direct increment of non-register memory. And of course my Mom, who never let me leave the table without eating all my peas; if it weren't for her I'd have never learned the value of bounds-checking in the completion of a problem domian. I know I'm forgetting someone, but you have all been so wonderful..." -- Rob White, Linux Kernel 6.2 Changelog for kernel.c line 722.
NOTE: The Above attribution is Under Dispute from the GCC board of optimizers for failure to credit the optimizing community's efforts in envisioning the need for loop unroling and the value of peep-hole allocation of registers...
Really, how bad does "intellectual property" have to get before people get it into their heads that the Founding Fathers *DID* understand that you cannot own an idea. The absence of computer science from their accumen doesn't mean that these topics are sacrocant, wholly new, and innumerable to that prior understanding.
Clue please people...
Innocent people shouldn't be forced to pay for inferior software development.
--"Code Complete" Microsoft Press
His point is that you need to be able to document that no one else owned the code before it was merged into the kernel. If someone did own it, you need to document that they legally passed rights to the code to your project.
What the GPL says is not pertinent to that issue. Put the SCO hysteria aside momentarily. This guy is speaking from his own experience in a very similar environment: When someone gets a lawyer and says they owned some of the code in your project, you'd better come up with documentation that proves them wrong. If you can't, it is your word against theirs.
-- Slashdot: When Public Access TV Says "No"
cvs annotate is an excellent first start to see where code came into the tree. Other tools allow one to see where the code really came from in the face of formatting changes and the like.
Like I've said in prior posts, having this information is invaluable. It also allows one to more easily back out changes that might be tainted, reguardless of where they come from, since you know all the parts to that change, which is impossible with the changelog data. In this respect, bk is better than cvs since bk's change mechanism links multiple files that have changed, while CVS does not.
You *MAY* have this, or you may not. There are many shops that don't have this level of beaurocracy. However, I've never worked for any place that has had this independent of an underlying source code control system (and many places that didn't have source code control systems, let alone change numbers).
The issue can be further complicated if there's been a cross fertilization between projects for things like device drivers. Project A figures out how to do feature Z and project B integrates it. B then figures out Y and project A integrates that. Project C takes code from a data sheet and includes that under license X and Project A then takes it and incldues it under license Y and then Project B wants to bring it it, but is unsure if they can because they see substantially similar code under both X and Y licenses, not being aware of the common datasheet code example being present and gets confused. In situations like this, a clear SCM trail can help sort out who to talk to and how to resolve what might appear to be something bad.
I've seen many organic patches/drivers grow up over the years in linux that are litterally impossible to track down who wrote what originally. Some have email addresses, some do not, some have had them removed, some email addresses are stale, etc. In such a chaotic enviornment, it can be difficult to know where code came from. There are many strengths to this model, but code history isn't one of them.
Warner