Slashdot Mirror


Statistics On Free Software projects

GenericBoy writes: "The first edition of The Orbiten Free Software Survey is out online. Some of the stats are number of authors and projects, the top 10 contributing authors, how many MB are in all of the free software projects put together (!) and a bunch more. " Now, as they themselves point out in the their Scope and Method, the methodology is crude, and I don't think Orbiten could quite submit it to Nature yet or anything, but it's an interesting bunch of stats.

32 of 93 comments (clear)

  1. Re:Pretty Bogus by mvw · · Score: 2
    I noticed on the PostgreSQL Hackers list that Thomas Lane said this was very bogus because it appears to re-include his libjpeg as many times as it is used by something else.

    Yep. I came to the same conclusion. The authors of the survey do a brute force analysis and count whatever name shows up.

    So if you manage to show up on some file that gets included in a lot of projects, like the C/C++ libraries, you will score very high. That is what put Ulrich Drepper on number 8.

    On the contrary I was not able to spot a lot of hard working folks from the BSD crowd. So the authors of the survey did not scan through a FreeBSD, OpenBSD or NetBSD tree. Even giants, like Donald E. Knuth (DEK) did not show up. So TeX was not included either.

    What to think of it?
    The basic idea is nice, the equivalent of a Open Source top ten. It could appeal to the same people who try to score high on distributed.net or Seti. (But especially these projects had people show up who increased their scores bei illegal methods)

    I however like the idea to, in a few years on from now, to be able to look up on what stuff I worked. But guess this will need a much improved system.

    My conclusion is these guys had the right idea, that the existing body of free code screams to be analyzed. So let's forget that they did it poor, and let's try to improve things.

    At first they should extend their input, an easy way is to scan the contents of the former Walnut Creek ftp server, as it cover a lot of free software. However one would need to add a lot of different servers too. Adding the major free systems, commercial stuff like mozilla, projects from science (there is a lot of free Fortran out too!

    If anyone is interested in setting up a better attempt, please contact me.

  2. Re:The figures need a lot of work by mvw · · Score: 2
    When I showed this URL to my family, the reaction was "wait a sec! Bottomfeeders? Isn't that a bit derogative?". It took quite some explaining to make it clear that it was the culmination of what I've done over the years: I've joined the hordes of folks who, by submitting small patches, fixes, bits of functionality, have made the difference between making Open Source a hobby of a select few, and making it a (possibly) useful tool.

    Yep. The author credited is usually the person who wrote the first version of a particular file. This neglects the maintainer and the many people who might advance the state with their patches. All of them, plus web masters, documenters, release and source code repository engineers (maybe I forget a couple of important folks too) deserve credit!

    If done properly, patch submitters should be noted in the CVS logs. Some projects (like FreeBSD) route that comments in commit logs too.

    Ergo: scan the cvs trees and not the release packages.

  3. Here's how to establish credibility by Otter · · Score: 3
    Might I propose that from now on, Slashdot posters saying:

    • Oh, yeah? You have the source. Write it yourself, you moron!
    • QT/GTK is for idiots.
    • Apple is so stupid. If they open-sourced everything we'd fix it for them.
    • M$ code is terrible.
    • Why isn't Company X open-sourcing their product? Proprietary software is evil!
    • Free software project X sucks.
    or such things, be expected to link to this site showing exactly how much they've contributed.

    Although, given that the study has managed to overlook my insignificant but non-zero contributions, maybe I shouldn't propose that.
  4. Hey! I'm on that list! :) by Tord · · Score: 2

    Yeah, I'm on that list! Right at position 771 AND 772!

    What!? They counted me TWICE? Once as tord.jansson@swipnet and then later as tord.jansson... hm... 248447 bytes for each of them... Hm, seems like they somehow counted me twice but with the SAME value or maybe they somehow split it in half.

    Let's click on my name and see what projects they have mentioned me participating in, should be just BladeEnc... What!? makeMP3.codd!!! What the heck is THAT program!? Hm, I see... got to be some kind of frontend that has included the BladeEnc code...

    Feels a bit odd getting credited for a program I don't know anything about, but still kind of okay... :)

    On the other hand, I wonder how they came up with 248447 bytes, the BladeEnc code is about 1.5 meg :-/

    But then again, it wouldn't be fair to credit me for more anyway since BladeEnc is so heavily based on the original ISO code and the other BladeEnc contributors haven't gotten any credits since they're just mentioned on the homepage. :(

    Guess this shows how far from precise this study is. A good attempt to measure something quite
    imessurable though. Kudoz to all the people who must have put down an awfull lot of work on this and hope you could get some usefull out of the big picture although the small details are terribly wrong.

    Tord Jansson
    BladeEnc Creator

  5. Based on Redhat by doomy · · Score: 2

    I think this would more different if they did the survey on something like debian.
    --

    --
    ...free your source and the rest would follow...
  6. Error rates by Signal+11 · · Score: 2
    Given the nature of this community, I suspect this is more of a "tip of the iceberg" sample, and has a high error rate. There's alot of projects that helped create (enable for you buzzword people!) more projects - I doubt many people would have gotten in on the free software scene if not for the GNU C Compiler. Comparing authors by quantity instead of quality is a poor way of judging progress. So take this report with a grain of salt - they make no claims of this being comprehensive or telling, and neither should you. Already I see people proclaiming that this is the metric by which contributors should be judged. Sigh.

    Secondly, most of this community, by its very nature, is distributed, decentralized, and hard to account for. That's not a coincidence - many of us like remaining anonymous.. the man behind the scenes. As anecotal(sp?) evidence look at the .sig blocks on slashdot - how many famous people note their OSS accomplishments in their sig? Very few. And as Linus himself said.. it's not like girls are throwing their underwear at him. Many people don't *want* to be counted.. an anonymous patch here and there is sufficient.. "I just want it to work".

    So before people start using this report as a metric of people's contributions, remember two things: Even small contributions count, and this is an inclusive rather than exclusive community - you are welcome here whether you contribute source or not. People who write documentation, help the newbies, and convince management to put their company printers on linux (3Com anyone?) ought to be commended too. There's alot more here than code!

    1. Re:Error rates by generic-man · · Score: 2

      "alot" is a verb.

      Dictionary.com doesn't even give it that much credit. It's an acronym.

      --
      For more information, click here.
  7. The figures need a lot of work by Rich · · Score: 4
    I checked out the stats for some apps I've written and I found they are way out. For example the analysis of kgui gives me 52.789% of the code despite the fact I am the sole author!

    In general the handling of large packages such as KDE seem fairly poor. For example KDE apparantly has no authors according to the by-project listing. I think this is a great idea, but it needs a cleaner source of data, for example Coolo has been able to give some very interesting and detailed figures by running scripts on the KDE CVS repository. Perhaps this is the sort of thing they need to be using as the initial data set from which they make their analysis.

    Rich.

  8. What's wrong with this survey and why by pjones · · Score: 2
    As one of the authors of a similar but more focused report on the developer community, let me point out a few of the problems with this piece of work.
    • pooling of very unlike data - that is mixing apple and oranges of communites in such a way that individual creators of smaller projects are mixed with sophistocated complex projects like Apache and the kernel
    • inconsistant data gathering - as pointed out in other messages, whilst claiming to represent everything a collection of over 4K projects is missing (LSM projects which we looked at)
    • gross analysis of data - that is not trying to understand what data means what as that licenses are mixed with authors
    • more is more fallacy. that is saying that "we counted a lot, so we learned a lot" smart and focused sampling is always better and tells you more
    • gotta read more to tell you more


    on the other hand, the collection of the data -- if it can be arranged in some meaningful manner and then processed in a reasonable way that will yield thoughtful conclusions -- is no small task and rishab and his associates should be applauded for the hard work they did on that portion of the project. i, for one, would be glad to work with them to try to pull out some meaningful reports from their well-meaning but, i think, misfiring project.


    Paul Jones

    --
    Certified Black Helicopter Pilot *** Unwitting Dupe of One World Gov'ment
  9. This has happened with Linux more than once by FreeUser · · Score: 2

    Losing key staff is no longer the exclusive realm of corporations. I sort of surprises me to see this argument brought up in the context of open software! :-)

    Absolutely! What is more, losing "key staff" in an open-source project is generally much less devistating than it is in a closed-source context, as open-source by its very nature tends to distribute expertise on a given project much more widely.

    For example, early in the Linux Years (pre 1.0) the guy (I forget his name) who did allot of the early networking work abandoned Linux to its own devices, largely due to being flamed for not having written the perfect, most elegant implimentation in his first iteration. Another took over that aspect, the kernel lived on, development moved forward, and Linux is now a raging success. The loss of a very key developer caused hardly a hiccup in development (though an auful lot of discussion, flamage, and doomsday saying).

    kNFS was abandoned for almost a year, which caused myself and others a number of headaches in dealing with Linux NFS (and is probably the reason why Linux NFS lags behind the BSDs and commercial UNIXen in performance). That having been said, it was picked up, is being actively developed, with NFS V 3 support in the 2.4-pre kernels. This is probably the best "worst case" or at least "very bad case" example of an open source project being abandoned one can find, at least in the Linux area of endeavor.

    Abandonment of a project can lead to some delay (as with NFS), but as often as not the delay is minimal (gimp, Linux networking) as another active developer takes over. I would submit that delays in closed-source commercial applications are much more common and typically much more lengthy.

    Finally, with open source the project will always be picked up and continued by someone, as long as there is any interest. Contrast this to many closed-source products which are orphaned, leaving developers and users in a serious bind which they can do nothing about, other than remapping their entire engineering or corporate strategy to a complety new, competing product, at great cost in time and money. In the worst case open-source scenerio, such a customer would have to finance and perform ongoing development and maintenance themselves, which would often be a less expensive solution than the alternatives. Having said that, I do not know of a single open-source project where anyone was compelled to do this. I do know of a number of orphaned, closed-source products which left consumers in a terrible bind, from bitter, personal experience.

    Our solution, which has to date saved us tens of thousands of dollars and hundreds of developer hours in cost, was to move to an open source platform (Linux and FreeBSD) and require open source libraries to be used wherever possible, limiting our exposure to orphanage of closed-source products.

    --
    The Future of Human Evolution: Autonomy
  10. Discussion on Advogato by Carl · · Score: 5
    This was already discussed on Advogato yesterday.

    The discussion points out some interesting facts about why some individuals are listed as big contributers (such as the author of libtool. Duh.) and why some aren't listed at all. They even have some comments from the developers of the survey.

    And I just love the comment of Havoc Pennington:

    It shows me as a major contributor to "gnuclear" and nothing else - I don't even know what gnuclear is. ;-)
  11. Finally, a good Slashdot article.. by Bowie+J.+Poag · · Score: 2

    Good to see something like this. However, I have to admit, its a little bit of a letdown. I've got 10MB worth of gear in Red Hat 6.1, but my name didn't show up anywhere. Yes, yes, I know, it's not code, Bowie..Heh


    Bowie J. Poag

    --
    Bowie J. Poag

    1. Re:Finally, a good Slashdot article.. by Bowie+J.+Poag · · Score: 2

      Business must be pretty slow at VA for you to be able to spend your day trolling on Slashdot. Gives VA's recent stock price, I cant say I'm all that surprised.

      FYI, I wasn't whining, dippy. I just find it interesting that this study ignored non-code based contributions to Linux.

      Go back to work, goon.

      Bowie J. Poag

      --
      Bowie J. Poag

  12. the <1%'ers by Randy+Rathbun · · Score: 2

    Wonderful point - and I hope folks that are in the less than 1% crowd don't quit either! Even finding and fixing one line of code is a blessing.

    Heck, as I sit here now I have found three lines of code I need to put in this program I am writing where I did not clean up my linked list. Argh! No wonder the original app has had a tendency to crash over the past 3 years.

    The small stuff is as big as the big stuff.

    1. Re:the <1%'ers by dsplat · · Score: 2
      Wonderful point - and I hope folks that are in the less than 1% crowd don't quit either! Even finding and fixing one line of code is a blessing.


      I fully agree. And there is an important point that shouldn't be missed. The top author, FSF, is not only not a single person, but because of copyright assignments, it isn't even really a single organization. The FSF has been a valuable member of the free software community for a long time. In fact, arguably, free software might not exist as a viable force today without it. But that doesn't make the FSF a single contributor.

      I know that there are some files out there with an FSF copyright on them that I wrote. I don't begrudge them the copyright assignment. They have taken the stewardship of the projects that I contributed those files to. For the sub one percent group, of which I am one, don't ever forget that our strength lies in both numbers and diversity. Jon Bentley quoted someone in his Programming Pearls chapter entitled Bumper Sticker Computer Science:

      Each new user discovers a new class of bugs.


      It would be easy enough to expand that to cover all of the relevant things that a new set of eyes bring to a free software project: new hardware configurations, a new language, new data.... But the original quote stands alone quite well.

      To each and every contributor of code, bug reports, feature requests, reviews, documentation, translations, or anything else, I offer my thanks. The most obvious evidence that you are needed is that you made a contribution. You did what no one else did.
      --
      The net will not be what we demand, but what we make it. Build it well.
  13. Ok kiddies, boost your ranking by hanway · · Score: 2
    As suspect as the data is, it would be nice if people were inspired to develop more free software and pay as much attention to their position on this list as they do to their seti@home rank.

    Well, maybe not quite that much attention. We don't need kiddies who wouldn't know C++ from Excel macros checking in millions of lines of garbage into any open CVS.

  14. Re:Lines of code by rcw-work · · Score: 2
    Actually, it's because they didn't run it over enough stuff - Debian potato alone has around 218 million lines of code (compare to slink's 70 million).

    As for number of projects, potato has 4376 packages, not all of those are separate projects (some are from multi-binary source, some are task packages), but I'm rather sure more than 3149 of them are :)

  15. Completely false statistics by Mop · · Score: 2
    Did you have a chance to have a look at the stats for the biggest individual contributor, namely gordon matzigkeit ?

    He succeeded in writing the exact same size of code in numerous projects:

    • 35489 bytes of code in 70 different projects (zzplayer xpdf XCGI qbrew pilot-link outguess lxandria981105 lmemory lletters libjpeg-6b LAPACK_D ky kwintv kwebwatch kvoicecontrol kvoctrain kvncviewer kvideogen ktimeclock ksniff ksnes9x ksnapshot kshow ksendmail kreglo kprima kplot3d kpl kpilot-3.1b9 kpasman kover komba knetstart knetdump knc kmud kmp3info kmp3 kmol kmodem kmap kluach klm KKinit kishido kircpoker kinst khotkeys khealthcare kgui kfstab kfibs keasyisdn keasycd kdiskcat kdict kcmpgp kblinsel kbind kBeroFTPD jukebox3.2-pre6 jpegsrc.v6b harnmaker gsynth gpgp gettext gdbm freetype cgicq arts).
    • 52144 bytes in 32 different projects (no list, you understand the idea).
    • 54697 bytes in 31 different projects
    • 45401 bytes in 29 different projects
  16. How much of there data is way off? by hardaker · · Score: 2

    Interesting results, and certainly the numbers involving lines of code per project are probably accurate.

    However, glancing through a project that I'm the primary author on shows me as the 24th on the list of developers for it, having written 585 bytes. I suspect I've written a few more than that.

    The top of the list was dominated by a mailing list address that isn't even correct. The second name on the list was the UCRegents, who owns the copyright (but certainly their lawers didn't write the code).

    And judging by the other comments, I suspect that the majority of their data is similarily way off. I wonder if they even tested the tool they developed on a few randomly selected projects to see how accurate the results were. They didn't even perform the most obvious data collection method I can think of: "cvs annotate".

    I like the study, but I'd sure like to see it done better.

    --
    The next site to slashdot will be ready soon, but subscribers can beat the rush and start slashdotting it early!
  17. Active vs. Passive OSS Participation by SwissPope · · Score: 3

    I looked at the algorithm used to determine how they collected the names of contributors. They grepped e-mail addresses, rcs ids, and copyright info from various files. I don't think that's the best way to draw any useful conclusions in regards to Open Source software. The only real conclusion found here is that Open Source projects include a lot of code written by other people. That's trivial. This study fails to make a distinction between an active contributor and someone whose code was simply borrowed. This is an important distinction to make! For instance, what if I were to take 1000 physics homework assignments and search for "F=ma" in them. I can't assume that the appearance of "F=ma" on your paper means that Newton helped you with your homework. I can only assume that you used Newton's second law of motion to help you solve the problem.

    Similarly, if you wanted to determine who the most prolific scientific researcher is in a field, would you gather data by simply grepping for names in the texts of papers? No, you'll skew the data by counting the names who appear in the paper's "References" when you should just be counting the actual investigators who are listed as the authors of the paper!

    I would like to see this study repeated but making the distinction between an active contributor to a project and someone whose code was simply included. Only then would a top-heavy distribution suggest anything meaningful in regards to OSS authorship.

    If anyone has looked at the CODD algorithms/code and can show me if they used a more sophisicated method to filter out authors with no active involvement in a project, please post. It's a difficult problem to infer who actively and who passively contributed to a project with just a perl script.

    1. Re:Active vs. Passive OSS Participation by driehuis · · Score: 2
      Similarly, if you wanted to determine who the most prolific scientific researcher is in a field, would you gather data by simply grepping for names in the texts of papers?

      Hmmm, this reminds me of the infamous Quotation Index used in the scientific world. Back when I studied sociology, a professor of mine would spend five minutes each college blasting the practice. As it turned out, a number of his colleagues were quoting each other, thereby bumping each others ratings. "On the effects of offering free ballpoints to interviewees", being referenced by an article on "A critical review of free ballpoints", referenced by the rebuttal, ad nauseam.

      Doesn't it strike a familiar note in a forum driven by mechanically established karma?

      --

      Bert Driehuis -- All I asked was a friggin' rotatin' chair. Throw me a bone here, people.

  18. Pretty Bogus by HeatherMax · · Score: 2

    I noticed on the PostgreSQL Hackers list that Thomas Lane said this was very bogus because it appears to re-include his libjpeg as many times as it is used by something else.

    Also, is FSF an Author? Is BSD an Author?

    --
    Andrew.
  19. Well... kinda... by El+Volio · · Score: 3
    Yeah, the FSF came out way on top, with Sun and the UCB regents not far behind. OK, but is it really fair to compare them to individuals like Gordon Matzigkeit, et al? I'm not familiar with any of the individuals, but it would seem to me that each of them deserves far more credit.

    OTOH, it's nice to see some sort of a start at studying the free software community...

    --

    "You can never have too many elephants on your team."

    1. Re:Well... kinda... by technos · · Score: 2

      Funny, but dead on..

      Perhaps we should help them with a more intelligent 'author filter', and a better FM source snagger. It's obvious that Mr. Matzigkeit didn't belong that high up on the list, and other entities like UCB are over represented as well. Most everything *BSD carries the Berkley name, regardless of author!!

      --
      .sig: Now legally binding!
  20. Our greatest achievement: Win2000 or Great Wall? by Sun+Tzu · · Score: 2

    "Windows, measured in man-hours, is the single greatest engineering project in the history of humanity."

    hmmm... I wonder how many man-hours went into the pyramids and the great wall... Any of you engineers wanna venture an estimate on the G.W.? I think the ancient Chinese beat MS hands down. ;)

  21. Re:foo22, foo1 by divec · · Score: 2
    you are a wanker.
    a linux wanker.

    Did the original poster even *mention* Linux? Linux is not the same thing as Open Source.


    it's people like you that prevent open-source software from being adopted for serious purposes because you're constantly advocating it even when it is not a rational choice.

    Free software was not a "rational choice" in 1984, if by rational you mean The Best Tool For The Job. If everyone only cared about using the best toolset, gcc would not have been written and none of this open-source explosion would have happened. Your use of the word "rational" suggests the original poster's view is crazy. Well, remember that this whole shebang has been made possible by a man who is "crazy", in the sense of not always wanting to use the short-term best tool for the job.


    I agree with your point, that the use of Excel does not detract from this study at all. You're also right about misuse of the word "ironic". Please don't misuse the word "rational".

    --

    perl -e 'fork||print for split//,"hahahaha"'

  22. They didn't look in the best place by divec · · Score: 4

    They list their sources as follows:


    • RedHat Linux v6.1 source rpms
    • Linux kernel sources version 2.2.14
    • Munitions cryptography/security archive
    • An un-random half of Freshmeat

    Debian would have been a more sensible distro to use, because it is overflowing with (packages|crap). Red Hat (presumably) just ship the ones which it makes commercial sense to ship, wheras Debian has everything that anyone's bothered to include whether it's useful or not. For example, Cooledit (my favourite text editor) is missing from the survey. The only problem with Debian would be stuff missing because it is not DFSG-free. Such stuff is available in the non-free/ directory but it's probably not as comprehensive as the main/ directory is.


    Having said that, it's very interesting to see what they have got. I didn't know Andrew Tridgell did all that stuff, for example. This could be a good tool for the community to get to know people better.

    --

    perl -e 'fork||print for split//,"hahahaha"'

  23. ESR Fodder by Hard_Code · · Score: 2

    ESR had a colloquiem at Cornell a while ago and I brought up Nikolai Bezroukov's critique of his CatB, which he loudly discredited. I wish this survey would have come up earlier...I would like to ask him to comment on these statements:

    "The top 1271 authors, 10% of the total, accounted for 72.3% of the total code base. The top 10 authors alone (0.08% of the total) are credited for 19.8% of the code base. Free software development may be distributed, but it is most certainly very top heavy."

    "Our conclusion: Free software development is less a bazaar of several developers involved in several
    projects, more a collation of projects developed single mindedly by a large number of authors."

    The question from Bezroukov's paper I didn't bring up was that open source projects look much more cathedralesque and hierarchical as one moves up. E.g., not just anybody gets patches put right in to the Linux or *BSD kernel.

    --

    It's 10 PM. Do you know if you're un-American?
  24. Key contributors by konstant · · Score: 5

    What I find most interesting by far is the composition of the contributions when viewed by project. In nearly every project I viewed, there are two or three elite "key contributors" who provide somthing on the order of 1/3 to 7/10 or more of the code, with the remainder provided in a slew of sub-1% coders.

    This relates an interesting story. It appears that, while the real strength of OSS is incremental improvement over time, few projects can exist without a guiding intellect or a handful of ambitious coders on the core team.

    Presenting this data to employers who are concerned about losing control of their code may help assuage their fears of open source. Clearly projects that are "owned" by no one are rarities. A corporation *can* have its cake and eat it too.

    -konstant
    Yes! We are all individuals! I'm not!

    --
    -konstant
    Yes! We are all individuals! I'm not!
  25. Lines of code by El · · Score: 3

    12706 developers working several years on 3149 projects, and they've still produced fewer lines of code than a single release of Win2K... is this because Open Source is more efficient, less feature-rich, or because it doesn't carry the burden of backwards compatibility with DOS 1.0?

    --

    "Freedom means freedom for everybody" -- Dick Cheney

    1. Re:Lines of code by PollMastah · · Score: 2

      Hmm, this gives me an interesting idea... for another Slashdot poll suggestion, of course :-)

      Why does Win2K have more lines of code than all the open source projects combined?

      1. Because open source projects are lean and mean, and pack a lot of punch; not spongy and flabby like M$ bloatware :-)
      2. Because open source programmers don't like their programs to have any features. Features are for M$ spoon-fed victims. (sarcasm)
      3. Because Win2K actually does something, unlike open source software which merely rides on hype (I mean, it takes a lot of effort to cause Linux kernel panic whereas under Win2K it's so easy that sometimes it's even spontaneous -- obviously M$ understands, unlike OSS fanatics, the need for an easy way to crash!)
      4. Because Open Source is just hype, and cannot produce anything close to a real system.
      5. Face it, people, M$ knows what it's doing and ain't a bunch of loud-mouthed teenagers shouting Long Live Open Source without knowing how the real world works.
      6. Because ... how else would there be enough room for all those 64000+ bugs to hide?!
      7. Because that's how M$ programmers avoid getting laid off: Pad every source file with lots of newlines and useless comments (not to mention the occasional bug) so that their employee record shows a high count of number-of-lines-of-code they wrote.
      8. Because Win2K is written in a verbose language known as VB.
      --

      Poll Mastah

  26. Re:How To Lie With Statistics by perky · · Score: 2
    Actually that quote is variously attributed to Mark Twain or the British Prime Minister Disraeli.

    anyway, the point is that stats can be used to lie, but equally they can be used to extract the truth. For example much of modern materials science is based on statistics. Likewise economic forecasting techniques. Stats aren't always bad, it's just that they can be misused.

    --
    "The new wave is not value-added; it's garbage-subtracted" - Esther Dyson, Dec 1994