What Actually Makes Up "Linux"?
David A. Wheeler sent in linkage to his extensive analysis of the true
size of Linux. There's an amazing amount of information in here, and although it focuses on Red Hat 7.1, it still has tons of interesting bits of information about the code that makes up the distribution. Break downs include languages, licenses, cost estimates, and stats that in no way clear up the legendary GNU/Linux debate that will undoubtably be engraved on tombstones somewhere.
Windows is made up of the following:
You can moderate this down, but I challenge you to find proof that this situation is otherwise.
This was a quite thorough, well-written document all until the point he mentioned Bill Gates. Well, actually not Bill Gates himself but the immortalised words from his "Open Letter To Hobbyists".
In particular, the bit about documentation. The thing that Linux lacks these days is decent documentation in alot of areas, in particular things like devfs (which the author even admits is now poorly documents (the instructions that are available are now out of date)).
Coming from a BSD background (no, this isn't an excuse for a platform war - just hear me out), documentation is just as important as the code itself. This sometimes means that availability of certain features in BSD are a generation behind that of Linux, but when they arrive, the documentation is top notch, containing correct spelling and grammar, notes what bugs are present, provides examples of correct usage (this is especially relevent in documenting programming functions whose incorrect usage may have a security impact) and so on. Overall, it's an issue of documentation quality.
The author of the paper may scoff in the direction of Bill Gates, noting the ability of the Linux community to create and maintain an operating system, but what he's done in the process is brought the whole paper down by exposing the single thing that Linux as a "disparate sources, one distribution" model operating system can never have as what Microsoft products and, from my perspective, the BSD operating systems have - documentation that exists in a single form and written in a style that is consistent across the entire operating system. (This is not the case with Linux. Some things use manpages, others use "info", others use textfiles, others use html documentation. Heaven knows how a new user on Linux (advocacy is about attracting new users, right?) is supposed to navigate this mess without a considerable level of pain and/or persistence).
And before you let the flames begin, have a poke around on say, the NetBSD/OpenBSD/FreeBSD sites' manual page listings on their website and compare them to the ones you see on RedHat and so on.
Linux progressed farther in 10 years than Microsoft during that same time frame
I don't see how that's true at all. In both technology, and the bottom line, Microsoft is *years* ahead. Technology: let me offer one example: go to a web page (IE) with some kind of table with data in it. Copy the table. Paste it into Word. It actually becomes a Word table! Paste it into Excel. It actually places the data, and the formatting, into the cells! How far is linux from that level of ease of use, that level of "object linking and embedding" across apps? Do you think the multiple desktop standards helps or hinders this task?
And in terms of bottom line, linux companies are still trying to figure out how to make a buck. Redhat just now moved into the positive column, VA and others layoff people seemingly every week.
I'm a fan of Linux because I'm a hacker. I like the shell, I like the flexibility and customisability that come with having dozens of "glue" tools. But the fact is, hackers are the minority of computer users, and this is only going to be more and more true in the future. For the masses, ease of use is priority 1, and it seems, at least to me, that the "other" platform has a great lead in that arena.
---
python -c "x='python -c %sx=%s; print x%%(chr(34),repr(x),chr(34))%s'; print x%(chr(34),repr(x),chr(34))"
It would be interesting to see how many MB of space those "This is GPL" disclaimers take up.
... is the people. Seriously, Windows can't really say that because there is no real "Windows community". Mac people can talk about it, but they are still dependant on Apple for all wants and needs. On the other hand, Linux is written, used, and supported by the people themselves. Those figures, all of it from the the lines of code to the language percentages, just illustrate who and what we are as a community.
It's something I could go on and on forever about because it really is something special in a world dominated by the shadow of Gates and Jobs. "Those people" who work "over there" don't make this. We do! While all those numbers can start to quantify this, you can't really put a dollar value on it the same way you can't put a dollar value on freedom. Funny thing to be able to say that about a bunch of software...
"I may not have morals, but I have standards."
"I may not have morals, but I have standards."
> Using RedHat as a distro for this project isn't that good of an idea.... it's just an unrepresentative mass of programs and code! I can safely say that most Redhat users will never use about one-quarter of the programs in their distribution...
That's true for any of today's operating systems. No user uses all the code in Windows, either. Even real-time OS's have more code developed for them than is used by any given user. As a measure of effort, though, examining all the code makes sense.
> Since when is the number of lines of code proportional to the quality of the software? If Red Hat 7.1 has 30 million lines of code over 6.2's 17 million, does that mean the product is 76% better? Is the code getting more sloppy as more programmers get involved? I feel like counsel is leading the witness for the author to say 7.1 has "60% more effort" under the COCOMO model."
I never said it was "better", I said it included "60% more effort." Better is a value judgement. Effort is measured in person-years.
> The kernel shouldn't be two million lines of code. How much of that is drivers? And how much of the drivers are duplicated from one driver to another?
Section 3.2 specifically discusses this; 57% of the lines of code are drivers. Duplicate files are only counted once, but "partly duplicated" files are much harder to detect (and to discount when they happen); they certainly happen in the Linux kernel. However, the COCOMO model is based on real project data, and many other projects include cut-and-pasted code (for good or ill).
> Ok, so this guy claims that Linux would cost a little over $1 billion (US) to develop. I wonder what the big deal is. I'm sure Microsoft has spent that much over the years on Office+Win9x+WinNT+Backoffice+etc ... The only thing incredible about this number is that most of that billion was completely unpaid, or at least underpayed.
But I believe that is a big deal. Gates' "Open Letter to Hobbyists" assumed that if people just shared code, no large project would be developed. GNU/Linux and other open source/free software systems show the assumption wrong, and this paper has the numbers to prove it. You can argue which is "better", of course, but the notion that it can't be done is no longer debatable.
> Are there estimate[s of] how much money in form of salaries were ever paid to programmers for the code and how much was in effect done not only voluntarily, but also completely on an unpaid basis?
Unfortunately not; it's not even clear how to find out. You would have to go back to individual patches submitted to every project, and few people identify in their patches "I was paid to do this."
> 2437470 source lines of code for the Linux kernel. Doesn't that worry some people out there? We have a monolithic kernel almost two and a half million lines long. I think that by 2.6 the kernel is going to collapse under its own weight unless the designers decide to reorganize it in a fundamental way.
It's the nature of a monolithic kernel, and in any case, most of that is in modules (which are individually much smaller and only loaded when needed). I see no evidence of a "collapse", though clearly there are competitors (like HURD) that might eventually replace it in the market.
> Quoting statistics/data going back to '95 is way out of date by todays standards, even '99 is now very old.
It may be old, but it helps give perspective. A simple SLOC number doesn't mean much to people, unless it's compared to something else.
> The cost formula includes a term (ksloc**1.05): i.e. thousands of source lines to the power of 1.05. This reflects the fact that the bigger a program becomes, the harder it is to add new lines, because the system you are adding too is more complex. He plugs the size of the entire code base of RH7.2 into this formula. This seems unreasonable to me - these are many almost independent packages.
No, I don't do that (for the reason you cite). Section 2.3 of the paper discusses this: "Each build directory had its effort estimation computed separately; the efforts of each were then totalled." Appendix A mentions that sloccount was given the "--multiproject" option, which implements this.
Anyway, I hope people found this study interesting. It sounds like several people did.
- David A. Wheeler (see my Secure Programming HOWTO)