Improving Linux Kernel Performance
developerWorks writes "The first step in improving Linux performance is quantifying it, but how exactly do you quantify performance for Linux or for comparable systems? In this article, members of the IBM Linux Technology Center share their expertise as they describe how they ran several benchmark tests on the Linux 2.4 and 2.5 kernels late last year. The benchmarks provide coverage for a diverse set of workloads, including Web serving, database, and file serving. In addition, we show the various components of the kernel (disk I/O subsystem, for example) that are stressed by each benchmark."
I was also suprised to see that they still use some of the old performance monitoring tools like looking at /proc, and other ascii tools, rather than something like PCP that collects all these statistics together so that you can look at any combination of subsystems on the same time line. Then they could have graphs showing the interraction and load on the disk, cpus, vm, network etc.
There is folly and foolishness on the one side, and daring and calculation on the other. - Admiral Pellew, Hornblower
Why does it seem that all these benchmarks are primarily concerned with CPU performance or network throughput or single-disk reading and writing? For a large category of enterprise applications (which this paper says it is trying to address), I/O performance can usually be the most important part.
The problem is that the typical PC hardware is just not designed for that. Large proprietary Unix or mainframe systems usually have multiple very high speed buses; a single 32-bit PCI bus is rather low-end in comparison. Now of course this is not Linux's fault; but then again Linux is not just a PC operating system! So I guess my question is if this is about benchmarking Linux for enterprise use, how about some information about Linux running on enterprise-class hardware rather than suped up PC's. I'm sure IBM must have a few resources there.
In particular I'm interested in how the Linux kernel is designed to handle multiple independent I/O buses. Are the I/O schedulers weighted down with locking issuesor interrupt contention. Or what about the allocation of memory buffers between faster and slower I/O devices. Or even it's support for advisory I/O operations (hinting) that some proprietary OS's provide? What about asynchronous I/O?
And of course Linux suffers from the general Unix philosophy when it come's to giving I/O the same level of attention as CPU. For instance there are lots of processor use controls, such as process nice levels, processor affinities, real-time schedulers, threading options galore, etc. But how do you say that a given process may only use 30% of the I/O bandwidth on a particular bus? And those are things that mainframes were good at, so how does Linux on mainframes compare?
I've been reading the comments from some Mozilla people ever since Apple came out with Safari based on KHTML, and it's been suggested that the bloat and delay of Mozilla comes from too many developers. Makes me wonder if Linux will succumb to the same problem.
What about kernel developers creating a sub version of the kernel (so that only those who choose to use it) to log and relay information on performance of that kernel on various users' machines?
Is this a bad idea? Would it take too many hours of extra work?
-JB
"I love deadlines. I love the "whooshing" sound they make as they pass by." - Douglas Adams.
After checking out their list, there are only two test machines running strictly Linux. At least of the non-clustered setups. Beyond that, they are all Win 2k Data Center, .NET Server, IBM AIX and Unix. The ones that are running are running Red Hat Advanced Server, and it does not specify if they are optimized.
.NET server (betas?). Most Linux admins would argue this, especially given the news article on /. last week that said it is cheaper to run. I wonder how accurate their measures are based on the monitoring tools.
Beyond that, they are not using a unified standard as their monitoring system. All of the Win machines use Com+ and the non-win use a variety.
They also say that most of the best Price/Performance machines are running Windows 2000 Server, or
I wish there were some interactive workload benchmarks - I know this is history, but when the kernel went 1.2 I found my machine really slow; the benchmarks were better but somehow the usability had gone down. It would be neat to measure the way the mouse tracking feels, the "snap" with which menus open in an application, Netscape getting a page and rendering it, etc. . Kernel compilation and numerics are not the main use of a desktop machine these days ...
On a related note, my Mac Powerbook was really sluggish until I managed to kill some unneeded processes; they weren't really eating up time by themselves, but were somehow impacting system reactivity: The load factor hardly moved but the system became responsive to mouse clicks.
This is not a signature.
I agree that I/O is a weakness of Linux currently and that it needs a lot more attention. CPU speeds and the ability of Linux to make the most of the processor is very good and has already been very well developed. With CPUs having advanced as far as they have in the past few years means that the CPU is no longer the main bottleneck of the system. I/O technologies have stood pretty much still with only small advances, so no wastage or inefficiencies on the part of the OS are acceptable.
It is a pity that Linux like Unix developers have become a little stuck in their ways - hopefully they will do their best to address this in the 2.6 and 3.0 kernels.
I like the idea of a modularised kernel, where people could use the I/O system that best suited their setup - but this could involve an awful lot of division and arguments and the number of bugs that would result could be huge. Perhaps Linux itself could automatically adapt the way it works more to suit its needs - hence solving the problem of Linux hugely varying performance. Does anyone else have any suggestions or comments on this?
Mozilla uses C++ (and most methods are virtual) and component interfaces like XPCOM. Such things probably enhance developer's productivity, but they incur quite a bit of overhead in code size and (less so) in speed.
It is great that core developers actually care about code size and instruction-level speed (such as the recent syscall patch, or those highly optimized inline functions in headers), and there are many people sending patches to clean up code. Maybe linux won't get as bloated as mozilla after all...
That reason would be the cost of the tests and the fact that most linux hackers don't have pockets as deep as billg's.
The IBM paper is interesting, but beyond doing these straightforward kinds of measurements, I can think of a lot of better approaches to improving kernel and core application performance, based on research I've seen... When I was doing profiling work on supercomputer stuff a few years back I surveyed the tools and found some systems that use really novel approaches which could definitely be adapted to this purpose. I suppose word doesn't really get out about some of this stuff; anyway, take a look and see for yourself:
S-Check
S-Check starts with your original source code and points suspected of being bottlenecks. It adds artificial delays at the specific points throughout the parallel code. These delays can be switched ON or OFF. The switched delays generate numerous new versions of the program, with the delays simulating adjustments in code efficiency. S-Check methodically executes the many variants, recording delay settings and corresponding run times. S-Check analyzes the recorded entries against a linear response model using techniques from statistics. The results are a sensitivity analysis from which program problem areas can be identified. This provides a portable, scalable, and generic basis for assaying parallel and network based programs.
Paradyn
(overview)
"...a heuristic, goal-seeking algorithm was coupled with a dynamic instrumentation package to drive an automated, systematic inquiry into the performance of a parallel application."
The upshot is tools which can instrument a running system on the fly, and use statistical techniques that identify "hot spots" by looking for the amount of "collateral damage" when adding artificial delays to a particular location. You can even go farther, mapping out relationships, etc.
These are approaches that came out of parallel supercomputing, because in that field traditional approaches to benchmarking and profiling are often useless and/or impractical, and the systems (and programming problems) have become so complex that effective hand tuning becomes nearly impossible as well. Of course the kernel isn't so simple either, and these days you have parallelism to boot... I would love to see these techniques solving a wider range of problems.
Want to Know How to Cheat the GPL? Read On!