dwheeler.com · Domains · Slashdot Mirror

slashdotted! by Anonymous Coward · 2002-07-05 02:30 · Score: 0 · on Estimating the Size/Cost of Linux

This paper analyzes the amount of source code in GNU/Linux, using Red Hat Linux 7.1 as a representative GNU/Linux distribution, and presents what I believe are interesting results.

In particular, it would cost over $1 billion ($1,000 million - a Gigabuck) to develop this GNU/Linux distribution by conventional proprietary means in the U.S. (in year 2000 U.S. dollars). Compare this to the $600 million estimate for Red Hat Linux version 6.2 (which had been released about one year earlier). Also, Red Hat Linux 7.1 includes over 30 million physical source lines of code (SLOC), compared to well over 17 million SLOC in version 6.2. Using the COCOMO cost model, this system is estimated to have required about 8,000 person-years of development time (as compared to 4,500 person-years to develop version 6.2). Thus, Red Hat Linux 7.1 represents over a 60% increase in size, effort, and traditional development costs over Red Hat Linux 6.2. This is due to an increased number of mature and maturing open source / free software programs available worldwide.

Many other interesting statistics emerge. The largest components (in order) were the Linux kernel (including device drivers), Mozilla (Netscape's open source web system including a web browser, email client, and HTML editor), the X Window system (the infrastructure for the graphical user interface), gcc (a compilation system), gdb (for debugging), basic binary tools, emacs (a text editor and far more), LAPACK (a large Fortran library for numerical linear algebra), the Gimp (a bitmapped graphics editor), and MySQL (a relational database system). The languages used, sorted by the most lines of code, were C (71% - was 81%), C++ (15% - was 8%), shell (including ksh), Lisp, assembly, Perl, Fortran, Python, tcl, Java, yacc/bison, expect, lex/flex, awk, Objective-C, Ada, C shell, Pascal, and sed.

The predominant software license is the GNU GPL. Slightly over half of the software is simply licensed using the GPL, and the software packages using the copylefting licenses (the GPL and LGPL), at least in part or as an alternative, accounted for 63% of the code. In all ways, the copylefting licenses (GPL and LGPL) are the dominant licenses in this GNU/Linux distribution. In contrast, only 0.2% of the software is public domain.

This paper is an update of my previous paper on estimating GNU/Linux's size, which measured Red Hat Linux 6.2 [Wheeler 2001]. Since Red Hat Linux 6.2 was released in March 2000, and Red Hat Linux 7.1 was released in April 2001, this paper shows what's changed over approximately one year. More information is available at http://www.dwheeler.com/sloc. 1. Introduction The GNU/Linux operating system has gone from an unknown to a powerful market force. Netcraft found that, of the systems running web servers on June 2001, GNU/Linux was now the second most popular operating system (with 29.6%, versus Windows' 49.6%) [Netcraft 2001]. Another survey, of primarily European and educational sites, found that GNU/Linux was used more than any other operating system (of the sites it surveyed) [Zoebelein 1999]. IDC found that 25% of all server operating systems purchased in 1999 were GNU/Linux, making it second only to Windows NT's 38% [Shankland 2000a].

There appear to be many reasons for this, and not simply because GNU/Linux can be obtained at no or low cost. For example, experiments suggest that GNU/Linux is highly reliable. A 1995 study of a set of individual components found that the GNU and GNU/Linux components had a significantly higher reliability than their proprietary Unix competitors (6% to 9% failure rate with GNU and Linux, versus an average 23% failure rate with the proprietary software using their measurement technique) [Miller 1995]. A ten-month experiment in 1999 by ZDnet found that, while Microsoft's Windows NT crashed every six weeks under a ``typical'' intranet load, using the same load and request set the GNU/Linux systems (from two different distributors) never crashed [Vaughan-Nichols 1999].

However, possibly the most important reason for GNU/Linux's popularity among many developers and users is that its source code is generally ``open source software'' and/or ``free software''. A program that is ``open source software'' or ``free software'' is essentially a program whose source code can be obtained, viewed, changed, and redistributed without royalties or other limitations of these actions. A more formal definition of ``open source software'' is available from the Open Source Initiative [OSI 1999], a more formal definition of ``free software'' (as the term is used in this paper) is available from the Free Software Foundation [FSF 2000], and other general information about these topics is available at Wheeler [2000a]. Quantitative rationales for using open source / free software is given in Wheeler [2000b]. The GNU/Linux operating system is actually a suite of components, including the Linux kernel on which it is based, and it is packaged, sold, and supported by a variety of distributors. The Linux kernel is ``open source software''/``free software'', and this is also true for all (or nearly all) other components of a typical GNU/Linux distribution. Open source software/free software frees users from being captives of a particular vendor, since it permits users to fix any problems immediately, tailor their system, and analyze their software in arbitrary ways.

Surprisingly, although anyone can analyze GNU/Linux for arbitrary properties, I have found little published analysis of the amount of source lines of code (SLOC) contained in a GNU/Linux distribution. Microsoft unintentionally published some analysis data in the documents usually called ``Halloween I'' and ``Halloween II'' [Halloween I] [Halloween II]. Another study focused on the Linux kernel and its growth over time is by Godfrey [2000]; this is an interesting study but it focuses solely on the Linux kernel (not the entire operating system). Paul G. Allen posted some results from running Scientific Toolworks, Inc.'s tools on the Linux kernel, but this analysis only considered C code (including headers) - ignoring the many other languages used in constructing the Linux kernel (e.g., assembly language), and only concentrating on the kernel. The Free Code Graphing Project at http://fcgp.sourceforge.net generates a graphical representation of a program (currently, the Linux kernel), but only of the C code. In a previous paper, I examined Red Hat Linux 6.2 and the numbers from the Halloween papers [Wheeler 2001].

This paper updates my previous paper, showing estimates of the size of one of today's GNU/Linux distributions, and it estimates how much it would cost to rebuild this typical GNU/Linux distribution using traditional software development techniques. Various definitions and assumptions are included, so that others can understand exactly what these numbers mean. I have intentionally written this paper so that you do not need to read the previous version of this paper first.

For my purposes, I have selected as my ``representative'' GNU/Linux distribution Red Hat Linux version 7.1. I believe this distribution is reasonably representative for several reasons:

Red Hat Linux is the most popular Linux distribution sold in 1999 according to IDC [Shankland 2000b]. Red Hat sold 48% of all copies in 1999; the next largest distribution in market share sales was SuSE (a German distributor) at 15%. Not all GNU/Linux copies are ``sold'' in a way that this study would count, but the study at least shows that Red Hat's distribution is a popular one.
Many distributions (such as Mandrake) are based on, or were originally developed from, a version of Red Hat Linux. This doesn't mean the other distributions are less capable, but it suggests that these other distributions are likely to have a similar set of components.
All major general-purpose distributions support (at least) the kind of functionality supported by Red Hat Linux, if for no other reason than to compete with Red Hat.
All distributors start with the same set of open source software projects from which to choose components to integrate. Therefore, other distributions are likely to choose the same components or similar kinds of components with often similar size for the same kind of functionality.

Different distributions and versions would produce different size figures, but I hope that this paper will be enlightening even though it doesn't try to evaluate ``all'' distributions. Note that some distributions (such as SuSE) may decide to add many more applications, but also note this would only create larger (not smaller) sizes and estimated levels of effort. At the time that I began this project, version 7.1 was the latest version of Red Hat Linux available, so I selected that version for analysis.

Note that Red Hat Linux 6.2 was released on March 2000, Red Hat Linux 7 was released on September 2000 (I have not counted its code), and Red Hat Linux 7.1 was released on April 2001. Thus, the differences between Red Hat Linux 7.1 and 6.2 show differences accrued over 13 months (approximately one year).

Clearly there is far more open source / free software available worldwide than is counted in this paper. However, the job of a distributor is to examine these various options and select software that they believe is both sufficiently mature and useful to their target market. Thus, examining a particular distribution results in a selective analysis of such software.

Section 2 briefly describes the approach used to estimate the ``size'' of this distribution (more details are in Appendix A). Section 3 discusses some of the results. Section 4 presents conclusions, followed by an appendix. GNU/Linux is often called simply ``Linux'', but technically Linux is only the name of the operating system kernel; to eliminate ambiguity this paper uses the term ``GNU/Linux'' as the general name for the whole system and ``Linux kernel'' for just this inner kernel. 2. Approach My basic approach was to:

install the source code files in uncompressed format; this requires carefully selecting the source code to be analyzed.
count the number of source lines of code (SLOC); this requires a careful definition of SLOC.
use an estimation model to estimate the effort and cost of developing the same system in a proprietary manner; this requires an estimation model.
determine the software licenses of each component and develop statistics based on these categories.

More detail on this approach is described in Appendix A. A few summary points are worth mentioning here, however. 2.1 Selecting Source Code

I included all software provided in the Red Hat distribution, but note that Red Hat no longer includes software packages that only apply to other CPU architectures (and thus packages not applying to the x86 family were excluded). I did not include ``old'' versions of software, or ``beta'' software where non-beta was available. I did include ``beta'' software where there was no alternative, because some developers don't remove the ``beta'' label even when it's widely used and perceived to be reliable.

I used md5 checksums to identify and ignore duplicate files, so if the same file contents appeared in more than one file, it was only counted once (as a tie-breaker, such files are assigned to the first build package it applies to in alphabetic order).

The code in makefiles and Red Hat Package Manager (RPM) specifications was not included. Various heuristics were used to detect automatically generated code, and any such code was also excluded from the count. A number of other heuristics were used to determine if a language was a source program file, and if so, what its language was.

Since different languages have different syntaxes, I could only measure the SLOC for the languages that my tool (sloccount) could detect and handle. The languages sloccount could detect and handle are Ada, Assembly, awk, Bourne shell and variants, C, C++, C shell, Expect, Fortran, Java, lex/flex, LISP/Scheme, Makefile, Objective-C, Pascal, Perl, Python, sed, SQL, TCL, and Yacc/bison. Other languages are not counted; these include XUL (used in Mozilla), Javascript (also in Mozilla), PHP, and Objective Caml (an OO dialect of ML). Also code embedded in data is not counted (e.g., code embedded in HTML files). Some systems use their own built-in languages; in general code in these languages is not counted.