Statistics On Free Software projects
GenericBoy writes: "The first edition of The Orbiten Free Software Survey is out online. Some of the stats are number of authors and projects, the top 10 contributing authors, how many MB are in all of the free software projects put together (!) and a bunch more. " Now, as they themselves point out in the their Scope and Method, the methodology is crude, and I don't think Orbiten could quite submit it to Nature yet or anything, but it's an interesting bunch of stats.
Yep. I came to the same conclusion. The authors of the survey do a brute force analysis and count whatever name shows up.
So if you manage to show up on some file that gets included in a lot of projects, like the C/C++ libraries, you will score very high. That is what put Ulrich Drepper on number 8.
On the contrary I was not able to spot a lot of hard working folks from the BSD crowd. So the authors of the survey did not scan through a FreeBSD, OpenBSD or NetBSD tree. Even giants, like Donald E. Knuth (DEK) did not show up. So TeX was not included either.
What to think of it?
The basic idea is nice, the equivalent of a Open Source top ten. It could appeal to the same people who try to score high on distributed.net or Seti. (But especially these projects had people show up who increased their scores bei illegal methods)
I however like the idea to, in a few years on from now, to be able to look up on what stuff I worked. But guess this will need a much improved system.
My conclusion is these guys had the right idea, that the existing body of free code screams to be analyzed. So let's forget that they did it poor, and let's try to improve things.
At first they should extend their input, an easy way is to scan the contents of the former Walnut Creek ftp server, as it cover a lot of free software. However one would need to add a lot of different servers too. Adding the major free systems, commercial stuff like mozilla, projects from science (there is a lot of free Fortran out too!
If anyone is interested in setting up a better attempt, please contact me.
Yep. The author credited is usually the person who wrote the first version of a particular file. This neglects the maintainer and the many people who might advance the state with their patches. All of them, plus web masters, documenters, release and source code repository engineers (maybe I forget a couple of important folks too) deserve credit!
If done properly, patch submitters should be noted in the CVS logs. Some projects (like FreeBSD) route that comments in commit logs too.
Ergo: scan the cvs trees and not the release packages.
- Oh, yeah? You have the source. Write it yourself, you moron!
- QT/GTK is for idiots.
- Apple is so stupid. If they open-sourced everything we'd fix it for them.
- M$ code is terrible.
- Why isn't Company X open-sourcing their product? Proprietary software is evil!
- Free software project X sucks.
or such things, be expected to link to this site showing exactly how much they've contributed.Although, given that the study has managed to overlook my insignificant but non-zero contributions, maybe I shouldn't propose that.
What I'm listening to now on Pandora...
Yeah, I'm on that list! Right at position 771 AND 772!
:)
:-/
:(
What!? They counted me TWICE? Once as tord.jansson@swipnet and then later as tord.jansson... hm... 248447 bytes for each of them... Hm, seems like they somehow counted me twice but with the SAME value or maybe they somehow split it in half.
Let's click on my name and see what projects they have mentioned me participating in, should be just BladeEnc... What!? makeMP3.codd!!! What the heck is THAT program!? Hm, I see... got to be some kind of frontend that has included the BladeEnc code...
Feels a bit odd getting credited for a program I don't know anything about, but still kind of okay...
On the other hand, I wonder how they came up with 248447 bytes, the BladeEnc code is about 1.5 meg
But then again, it wouldn't be fair to credit me for more anyway since BladeEnc is so heavily based on the original ISO code and the other BladeEnc contributors haven't gotten any credits since they're just mentioned on the homepage.
Guess this shows how far from precise this study is. A good attempt to measure something quite
imessurable though. Kudoz to all the people who must have put down an awfull lot of work on this and hope you could get some usefull out of the big picture although the small details are terribly wrong.
Tord Jansson
BladeEnc Creator
I think this would more different if they did the survey on something like debian.
--
Secondly, most of this community, by its very nature, is distributed, decentralized, and hard to account for. That's not a coincidence - many of us like remaining anonymous.. the man behind the scenes. As anecotal(sp?) evidence look at the .sig blocks on slashdot - how many famous people note their OSS accomplishments in their sig? Very few. And as Linus himself said.. it's not like girls are throwing their underwear at him. Many people don't *want* to be counted.. an anonymous patch here and there is sufficient.. "I just want it to work".
So before people start using this report as a metric of people's contributions, remember two things: Even small contributions count, and this is an inclusive rather than exclusive community - you are welcome here whether you contribute source or not. People who write documentation, help the newbies, and convince management to put their company printers on linux (3Com anyone?) ought to be commended too. There's alot more here than code!
In general the handling of large packages such as KDE seem fairly poor. For example KDE apparantly has no authors according to the by-project listing. I think this is a great idea, but it needs a cleaner source of data, for example Coolo has been able to give some very interesting and detailed figures by running scripts on the KDE CVS repository. Perhaps this is the sort of thing they need to be using as the initial data set from which they make their analysis.
Rich.
on the other hand, the collection of the data -- if it can be arranged in some meaningful manner and then processed in a reasonable way that will yield thoughtful conclusions -- is no small task and rishab and his associates should be applauded for the hard work they did on that portion of the project. i, for one, would be glad to work with them to try to pull out some meaningful reports from their well-meaning but, i think, misfiring project.
Paul Jones
Certified Black Helicopter Pilot *** Unwitting Dupe of One World Gov'ment
Losing key staff is no longer the exclusive realm of corporations. I sort of surprises me to see this argument brought up in the context of open software! :-)
Absolutely! What is more, losing "key staff" in an open-source project is generally much less devistating than it is in a closed-source context, as open-source by its very nature tends to distribute expertise on a given project much more widely.
For example, early in the Linux Years (pre 1.0) the guy (I forget his name) who did allot of the early networking work abandoned Linux to its own devices, largely due to being flamed for not having written the perfect, most elegant implimentation in his first iteration. Another took over that aspect, the kernel lived on, development moved forward, and Linux is now a raging success. The loss of a very key developer caused hardly a hiccup in development (though an auful lot of discussion, flamage, and doomsday saying).
kNFS was abandoned for almost a year, which caused myself and others a number of headaches in dealing with Linux NFS (and is probably the reason why Linux NFS lags behind the BSDs and commercial UNIXen in performance). That having been said, it was picked up, is being actively developed, with NFS V 3 support in the 2.4-pre kernels. This is probably the best "worst case" or at least "very bad case" example of an open source project being abandoned one can find, at least in the Linux area of endeavor.
Abandonment of a project can lead to some delay (as with NFS), but as often as not the delay is minimal (gimp, Linux networking) as another active developer takes over. I would submit that delays in closed-source commercial applications are much more common and typically much more lengthy.
Finally, with open source the project will always be picked up and continued by someone, as long as there is any interest. Contrast this to many closed-source products which are orphaned, leaving developers and users in a serious bind which they can do nothing about, other than remapping their entire engineering or corporate strategy to a complety new, competing product, at great cost in time and money. In the worst case open-source scenerio, such a customer would have to finance and perform ongoing development and maintenance themselves, which would often be a less expensive solution than the alternatives. Having said that, I do not know of a single open-source project where anyone was compelled to do this. I do know of a number of orphaned, closed-source products which left consumers in a terrible bind, from bitter, personal experience.
Our solution, which has to date saved us tens of thousands of dollars and hundreds of developer hours in cost, was to move to an open source platform (Linux and FreeBSD) and require open source libraries to be used wherever possible, limiting our exposure to orphanage of closed-source products.
The Future of Human Evolution: Autonomy
The discussion points out some interesting facts about why some individuals are listed as big contributers (such as the author of libtool. Duh.) and why some aren't listed at all. They even have some comments from the developers of the survey.
And I just love the comment of Havoc Pennington:
Good to see something like this. However, I have to admit, its a little bit of a letdown. I've got 10MB worth of gear in Red Hat 6.1, but my name didn't show up anywhere. Yes, yes, I know, it's not code, Bowie..Heh
Bowie J. Poag
Bowie J. Poag
Wonderful point - and I hope folks that are in the less than 1% crowd don't quit either! Even finding and fixing one line of code is a blessing.
Heck, as I sit here now I have found three lines of code I need to put in this program I am writing where I did not clean up my linked list. Argh! No wonder the original app has had a tendency to crash over the past 3 years.
The small stuff is as big as the big stuff.
Well, maybe not quite that much attention. We don't need kiddies who wouldn't know C++ from Excel macros checking in millions of lines of garbage into any open CVS.
As for number of projects, potato has 4376 packages, not all of those are separate projects (some are from multi-binary source, some are task packages), but I'm rather sure more than 3149 of them are :)
He succeeded in writing the exact same size of code in numerous projects:
Interesting results, and certainly the numbers involving lines of code per project are probably accurate.
However, glancing through a project that I'm the primary author on shows me as the 24th on the list of developers for it, having written 585 bytes. I suspect I've written a few more than that.
The top of the list was dominated by a mailing list address that isn't even correct. The second name on the list was the UCRegents, who owns the copyright (but certainly their lawers didn't write the code).
And judging by the other comments, I suspect that the majority of their data is similarily way off. I wonder if they even tested the tool they developed on a few randomly selected projects to see how accurate the results were. They didn't even perform the most obvious data collection method I can think of: "cvs annotate".
I like the study, but I'd sure like to see it done better.
The next site to slashdot will be ready soon, but subscribers can beat the rush and start slashdotting it early!
I looked at the algorithm used to determine how they collected the names of contributors. They grepped e-mail addresses, rcs ids, and copyright info from various files. I don't think that's the best way to draw any useful conclusions in regards to Open Source software. The only real conclusion found here is that Open Source projects include a lot of code written by other people. That's trivial. This study fails to make a distinction between an active contributor and someone whose code was simply borrowed. This is an important distinction to make! For instance, what if I were to take 1000 physics homework assignments and search for "F=ma" in them. I can't assume that the appearance of "F=ma" on your paper means that Newton helped you with your homework. I can only assume that you used Newton's second law of motion to help you solve the problem.
Similarly, if you wanted to determine who the most prolific scientific researcher is in a field, would you gather data by simply grepping for names in the texts of papers? No, you'll skew the data by counting the names who appear in the paper's "References" when you should just be counting the actual investigators who are listed as the authors of the paper!
I would like to see this study repeated but making the distinction between an active contributor to a project and someone whose code was simply included. Only then would a top-heavy distribution suggest anything meaningful in regards to OSS authorship.
If anyone has looked at the CODD algorithms/code and can show me if they used a more sophisicated method to filter out authors with no active involvement in a project, please post. It's a difficult problem to infer who actively and who passively contributed to a project with just a perl script.
I noticed on the PostgreSQL Hackers list that Thomas Lane said this was very bogus because it appears to re-include his libjpeg as many times as it is used by something else.
Also, is FSF an Author? Is BSD an Author?
Andrew.
OTOH, it's nice to see some sort of a start at studying the free software community...
"You can never have too many elephants on your team."
"Windows, measured in man-hours, is the single greatest engineering project in the history of humanity."
;)
hmmm... I wonder how many man-hours went into the pyramids and the great wall... Any of you engineers wanna venture an estimate on the G.W.? I think the ancient Chinese beat MS hands down.
Geeky modern art T-shirts
Did the original poster even *mention* Linux? Linux is not the same thing as Open Source.
Free software was not a "rational choice" in 1984, if by rational you mean The Best Tool For The Job. If everyone only cared about using the best toolset, gcc would not have been written and none of this open-source explosion would have happened. Your use of the word "rational" suggests the original poster's view is crazy. Well, remember that this whole shebang has been made possible by a man who is "crazy", in the sense of not always wanting to use the short-term best tool for the job.
I agree with your point, that the use of Excel does not detract from this study at all. You're also right about misuse of the word "ironic". Please don't misuse the word "rational".
perl -e 'fork||print for split//,"hahahaha"'
They list their sources as follows:
Debian would have been a more sensible distro to use, because it is overflowing with (packages|crap). Red Hat (presumably) just ship the ones which it makes commercial sense to ship, wheras Debian has everything that anyone's bothered to include whether it's useful or not. For example, Cooledit (my favourite text editor) is missing from the survey. The only problem with Debian would be stuff missing because it is not DFSG-free. Such stuff is available in the non-free/ directory but it's probably not as comprehensive as the main/ directory is.
Having said that, it's very interesting to see what they have got. I didn't know Andrew Tridgell did all that stuff, for example. This could be a good tool for the community to get to know people better.
perl -e 'fork||print for split//,"hahahaha"'
ESR had a colloquiem at Cornell a while ago and I brought up Nikolai Bezroukov's critique of his CatB, which he loudly discredited. I wish this survey would have come up earlier...I would like to ask him to comment on these statements:
"The top 1271 authors, 10% of the total, accounted for 72.3% of the total code base. The top 10 authors alone (0.08% of the total) are credited for 19.8% of the code base. Free software development may be distributed, but it is most certainly very top heavy."
"Our conclusion: Free software development is less a bazaar of several developers involved in several
projects, more a collation of projects developed single mindedly by a large number of authors."
The question from Bezroukov's paper I didn't bring up was that open source projects look much more cathedralesque and hierarchical as one moves up. E.g., not just anybody gets patches put right in to the Linux or *BSD kernel.
It's 10 PM. Do you know if you're un-American?
What I find most interesting by far is the composition of the contributions when viewed by project. In nearly every project I viewed, there are two or three elite "key contributors" who provide somthing on the order of 1/3 to 7/10 or more of the code, with the remainder provided in a slew of sub-1% coders.
This relates an interesting story. It appears that, while the real strength of OSS is incremental improvement over time, few projects can exist without a guiding intellect or a handful of ambitious coders on the core team.
Presenting this data to employers who are concerned about losing control of their code may help assuage their fears of open source. Clearly projects that are "owned" by no one are rarities. A corporation *can* have its cake and eat it too.
-konstant
Yes! We are all individuals! I'm not!
-konstant
Yes! We are all individuals! I'm not!
12706 developers working several years on 3149 projects, and they've still produced fewer lines of code than a single release of Win2K... is this because Open Source is more efficient, less feature-rich, or because it doesn't carry the burden of backwards compatibility with DOS 1.0?
"Freedom means freedom for everybody" -- Dick Cheney
anyway, the point is that stats can be used to lie, but equally they can be used to extract the truth. For example much of modern materials science is based on statistics. Likewise economic forecasting techniques. Stats aren't always bad, it's just that they can be misused.
"The new wave is not value-added; it's garbage-subtracted" - Esther Dyson, Dec 1994