2.4, The Kernel of Pain
Joshua Drake has written an article for LinuxWorld.com called
The Kernel of Pain.
He seems to think 2.4 is fine for desktop systems but is only now, after a year of release, approaching stability for high-end use. Slashdot has had its own issues with 2.4, so I know where he's coming from. What have your experiences been? Is it still too soon for 2.4?
We're running the Red Hat 2.4.9-13 kernel on several SMP database servers and they have been perfect (not rebooted since 'rpm -U' of the new kernel) for several weeks. Before that, we were running 2.4.7-something from Red Hat and they were the same -- ran straight from the day we installed the kernel to the day we updated without needing to be restarted.
On my desktop machine, I've taken more risks (installed pretty much every official 2.4.x-linus release as they have come out) and some have been good, while others have been total dogs.
I'm running 2.4.17 right now. It seems okay; I've only had a freeze-up once over the last couple of weeks, though it was a total hard freeze (i.e. no ping, no magic SysRq, no nothing), which I haven't had in Linux for several years.
The obvious issue is VM; if you keep lots of memory (768M, or preferably 1.0G+) in your system, things to much more smoothly, though MP3 playback still skips a little.
Right now, I'd prefer some work on the RAID and IDE performance issues. One or two of the 2.4 series have had disk performance 100%+ better than the current 2.4 kernels. Why? I'd like to get the disk I/O back to reasonable levels.
STOP . AMERICA . NOW
The preemptive patches have made my system a lot more responsive under use. Most notably the mouse cursor doesn't slow down during heavy compiles and audio latency is good enough to play with some of the more interesting sound software projects out for linux.
But it really sounds like your problem isn't with linux but with XFree86. X has its share of problems but if you have a good video card that's supported well under it, you should get more than acceptible 2d drawing performance. I use a 3dfx voodoo3 here and its about as good as win2k running KDE (sometimes you can see it rendering when resizing or moving windows quickly but i like to think of it as a cool effect ;) and its way faster with lighter WM's like blackbox.
There was a bad period where the Soundblaster Live driver (particularly mixer settings) was broken. That lasted through at least three kernel releases. There was a worse period where the VM had fits, and where performance degraded way too rapidly if the system had to swap. That lasted at least six kernel releases. There were at least one or two releases where I discovered that Alan Cox's (usually more bleeding edge) tree was being better behaved.
Of course, whenever I'm playing around with this stuff I don't delete my "last known good" kernel, so if after a couple hours or a couple days I noticed a problem, I just booted back to what worked. The default (albeit heavily patched) Red Hat kernels were good, so "last known good" always existed for me.
To summarize: this hasn't been a source of inconvenience for me, but it has been one of vicarious embarrassment. I've only been using Linux since 2.0.somehighnumber, but this is the worst mess I've seen the "stable" kernel tree go through in that time. Don't get me wrong, I've experienced system-crashing bugs (a tulip driver that freaked at some tulip chipset clones, some really bad OOM behavior a couple years ago) before, and pragmatically I guess that's worse... but those problems were always fixed fast enough that the patches predated my bug reports. Watching even the top kernel developers seem to flounder for months over bugs in a core part of the OS like the virtual memory system just sucked.
Having said that, there are some serious issues with 2.4 on some 8-way 8GB machines that I manage. They have been running 2.4.13-ac7 since November, because that is the last kernel that is usable for me (-ac11 would probably be ok). Newer kernels have terrible behavior under the intense IO load these machines go through. They get 14-30 days of uptime, and then hang or get resource starved or something and have to be rebooted.
I think part of the issue is that there simply aren't that many people running 8-way boxes, so bugs aren't found as easy, this is of course on top of having 8-way SMP being much more complex than a defacto single user, single processor desktop machine. To make it even worse, the machines are pushed hard. They move around GBs of data every day, and often will run for extended periods with loads over 25.
Of course, it is still mostly ok. While the machines are working they mostly work fine. Of course 20 days of uptime is totally unacceptable. I have an alpha running Tru64 pushing 300 days of uptime, and the last time it was down was due to a drive failure, not an OS problem.
My only remaining issue with Linux on "small" machines is an oscillation problem in IO. Data will fill up all available memory before being written to disk, and then everything from memory will be written out, and then memory fills up again before anything new is written to disk. This is a bit inefficient, and the machine's responsiveness at the memory-full part of the cycle is poor.
What are my options though? I guess I could try FreeBSD, but a bit of lurking on their lists and forums reveals plenty of problems there, too. Do I switch and hope things get better, or wait out 2.4 and hope it comes around soon? Aside from a few nasty bugs in some releases, pretty much each successive 2.4 kernel has been better than the previous one, at least on small systems.
Several years ago I was having a hard lockup problem with Tru64 (Digital Unix, at the time) and that was very scary. It took time to get the problem escalated to the OS engineers, instead of just sending an e-mail to lkm. Even then I could only hope that the issue was being addressed, but I had no way to know if anybody was doing anything about it or not. (Turned out to be an bug in the NFS server that would cause the machine to lockup when serving to AIX.) For all of its problems though, it is extremely reassuring for me to be able to monitor the development process of Linux through the linux-kernel-mailing list, and other specialized lists. If I feel that people aren't aware of some problem I am experiencing, I can raise the issue. I am not in the dark about what is happening, and what fixes are being made. I know what changes have gone into each kernel update, so I know if there is a chance of it fixing my problems.
I know BSD can run many Linux binaries, but what about kernel modules?
At least for nVidia, it is being worked on: FreeBSD NVIDIA Driver Initiative
I'm beginning to feel like a broken record, and maybe Linus should just change the terminology so that people stop making the same assumption over and over, but people: wake up and smell the bogomips!
"Stable", in the context of a kernel release refers to the interfaces. When Linus releases 2.<leven>.0, he is saying that this kernel is one that has reached some arbitrary plateu of development stability, and it's now ready for others to begin actuall release engineering on.
You have to understand that the Linux Kernel is released by Linus in a state that is very reasonable for a development team, but that will never be "production quality". Debian puts a lot of realease engineering work into a Kernel. As does Red Hat. As does SuSe, etc, etc.
If you just grab 2.4.x and install it, you're acting as Linux Q/A, and I applaud your effort, but when it breaks in your environment, you should not be stunned.
Once again, production release != stable release. A stable release is just one the developers are happy with (and I've yet to see a 2.4 kernel that I can say developers should not be happy with).
So, maybe next time, 2.6.0 should be called the "post-development" release so that people don't go off half-cocked installing it on production systems.
I almost took this guy seriously until this part:
The kernel seemed to show more stability. Then we hit kernel 2.4.15.
Linux version 2.4.15 contained a bug that was arguably worse than the VM bug. Essentially, if you unmounted a file system via reboot -- or any another common method -- you would get filesystem corruption. A fix, called kernel 2.4.16, was released 24 hours later.
Look, anybody who is deploying a kernel on the day it is released on a production server deserves what they get. One day turnaround on a bug fix is phenomenal. Even if these are marked as "stable" kernels, trying to track the new versions in real time is a dumb thing to do.
This guy has written a moan and groan article based on a small set of bugs, some of which he only could have experienced if he is experimenting on his production system. He obviously requires extreme stability and says he needs this over the new 2.4 features (SMP, 2G memory, 2G files), which makes me ask: why was he putting new kernels on his production system before emperical evidence was there about high stability?
Open source will fix bugs faster the proprietary. It doesn't change reality to make bugs impossible. This is true even in "stable" releases, especially if you are talking about highly stressful production environments.