Slashdot Mirror


2.4, The Kernel of Pain

Joshua Drake has written an article for LinuxWorld.com called The Kernel of Pain. He seems to think 2.4 is fine for desktop systems but is only now, after a year of release, approaching stability for high-end use. Slashdot has had its own issues with 2.4, so I know where he's coming from. What have your experiences been? Is it still too soon for 2.4?

31 of 730 comments (clear)

  1. Alphas by Paul+Komarek · · Score: 5, Informative

    And this guy appears to be talking about only x86 machines. My lab has had a horrible time with 2.4 on Alphas. In fact, we've moved back to 2.2.18 on some macines. (2.2.20 for Alpha didn't compile properly, and I didn't want to mess with it -- anyone know if which 2.2 kernel is best for number-crunching Alphas right now?). Oh, the pain. The lost time. "Kernel of Pain" is a fine description of our 2.4 experience on Alphas.

    -Paul Komarek

    1. Re:Alphas by Dr.+Tom · · Score: 4, Informative

      I'm running 2.4.17 and it works great on a DP264. I followed the whole 2.4 series and there were some rough spots in the first dozen or so but it's good to go now.

    2. Re:Alphas by Steffen · · Score: 2, Informative

      I couldn't get 2.2.20 to compile on my alpha (ruffian) either, some problems relating to some PCI code. I tried to fix it myself, (it appeared that a wrong struct type was being used somewhere) but other things broke after that, so I gave up.

      2.2.18 worked grand, and I believe 2.2.19 will as well. I'm running 2.4.16 now which has given me very little trouble (bar a broken network driver)... the machine is rarely stressed so I can't say for sure. One day I'll fire up 10 seti processes and see what happens.

      Hope that's of some use.

  2. My experience by nzhavok · · Score: 5, Informative

    What have your experiences been?

    Well:
    8:33pm up 45 days, 5:49,

    Shameful I know, but I had to move city before that I had 6 months. Should had a UPS ;-)

    This is pretty much a desktop/development box running postgres, JBoss, tomcat, apache, JBuilder and (occasionally) kylix. No problems so far, touch wood.

    I also used to work at the comp-sci department of a university were we had 40 boxes in the linux lab, no real problems except they were running ext2 so only the occasional manual fsck. Now the maclab, that is another story (OS9 not OSX).

    --

    He who defends everything, defends nothing. -- Fredrick The Great
  3. Similar problem here... by Cryptnotic · · Score: 3, Informative
    The article is a little short on the details, but we had a similar problem here at work with a new Redhat 7.2 server (kernel 2.4.9) we were setting up. The machine was to be a CVS/file server, running a cvs pserver and Samba. It had 1GB of main memory, and a 180 GB RAID5 array (external via a Mylex RAID card w/ LVD SCSI U160). The machine would seem to run fine, but then in testing, the machine would block on processes for seemingly no reason. It was something in the [kswapd] kernel process that was blocking things. If you logged in at a terminal or over a network, you'd get extreme "stuttering" on your responsiveness. Basically, it was unresponsive under loads with several running processes. This wasn't even excessive.

    Oh yeah, and the machine would crash randomly and lose data. We were using ext3, so the file system was (supposedly) still consistant, but whatever was being worked on would be lost.

    Ultimately, we upgraded the kernel to 2.4.17, and the problems have been fixed. But the "even number == stable reliable" rule failed us that time.

    Since then, I've read that "the entire VM system in 2.4 was replaced around 2.4.10". This really scares me. I hope that Linus and Alan Cox have learned to manage things better now. If not, someone else will have to pick up the slack (maybe RedHat) and manage a stable kernel.

    Cryptnotic

    --
    My other first post is car post.
    1. Re:Similar problem here... by Rentar · · Score: 3, Informative
      I hope that Linus and Alan Cox have learned to manage things better now. If not, someone else will have to pick up the slack (maybe RedHat) and manage a stable kernel.

      Neither Linus, nor Alan Cox maintain 2.4 at the moment. Marcelo Tosatti does, and from what I read on LKML some ppl thought that to be a bad move at the beginning, but I think it works out just great (the first release he made was 2.4.17 IIRC)

  4. Problem more serious in Business Computing by lamj · · Score: 2, Informative

    I notice that a few people mention they don't have problems with 2.4. I find that true based on certain conditions.

    For home use, I really don't find a lot of problem with 2.4 except minor driver problems. But at work, things are very different. I run a few high load critical servers at work that are still on 2.2, the lab attempt to upgrade 2.4 (at early stage) failed because of lock up and performance issues (yes, some due to VM)

    It was till recently, I tried again with 2.4.16 that I am getting some reasonable results with the 2.4 series. For your information, performance are about the same on 2.4 with my application, I cannot confirm high load stability issue yet as I need more time to test. But initial results tells me 2.4.17 are resonably stable, only one lockup so far (for two weeks).

  5. Re:Au contraire by Ace+Rimmer · · Score: 4, Informative

    Try the low-latency patches to 2.4 tree. They have much better impact than those call "preemptive".

    Also
    nice -n -10 /usr/bin/X11/X
    helps quite a lot on an average desktop linux

    --

    :wq

  6. Re:Au contraire by hansendc · · Score: 5, Informative

    What are you smoking?!? High end box DOES NOT mean your 1.2 GHz Athlon!! We're talking about machines with >8 processors here. Machines which need to use the PPro PAE so that over 4gig of memory can be addressed.
    There are serious VM stability issues with these systems. Ever wonder why Redhat hasn't released a >2.4.9 kernel? It's because 2.4.10 is where the new VM system went in. Redhat is busily porting Rick van Riel's 2.4.9 VM up to the later kernels so that they can use it.

  7. Re:Kernel too big? by WildThing · · Score: 2, Informative

    For example, why do you put USB support into the kernel? Cant this be moved to a driver or external module or something? The same applies for file systems etc.

    Uh... You can compile USB and many other parts as a module.

  8. Re:Oh, stop it! by Anonymous Coward · · Score: 2, Informative
    First of all, there is no code difference between "server" and "desktop" distributions of Linux. That's right, none.

    No, that's wrong. Red Hat, for instance, which is generally designed to be an industrial-strength server distribution, applies something like 200 patches to Linus's kernel. Red Hat knows that its customers expect a solid, stable server operating system, so they will do what it takes to build one. Mandrake, on the other hand, knows that its customers are mostly desktop users, so it has other priorities (providing games, etc.) than testing and patching the kernel.

  9. Re:Observations & Experiences by schwap · · Score: 4, Informative
    How many people are running a version of Apache in the 1.3.x tree? Well if you are that's a development tree and not necessariliy stable. Yes there are stable versions, but you must test!

    Um... 1.3.x is, indeed, the stable version. From the website:

    The Apache Group is pleased to announce the release of the 1.3.22 version of the Apache HTTP server. Apache 1.3.22 is the best version of Apache currently available.

    2.0.x is the unstable tree at the moment.

  10. Comment removed by account_deleted · · Score: 3, Informative

    Comment removed based on user account deletion

  11. One year to stability by Anonymous Coward · · Score: 1, Informative

    Through experience and monitoring of the lkml, it became apparent that the major reason why 2.2.x and 2.4.x kernels failed so frequently for us was down to the implementation of VM by Rick van Riel. Once the decision was made to use Andrea's VM, we found a significant improvement. In both kernels, this decision was made after approx. 1 year of initial release.

    Our servers running 2.2.19 and 2.4.16/17 still lock up from time to time, usually every 30-60 days, but compared to using Rick's VM we'd only get about 16 days uptime.

    We were forced to upgrade to the 2.4.x kernels because 2.2.x no does not support the chipsets that our servers use.

  12. Re:Au contraire by brinkster · · Score: 2, Informative

    I had the same complaints but I am now using KDE 3 beta and the improvements in speed are amazing. Although quicker apps still take a while to start but the reponsiveness of switching between windows and menu actions is very much improved. Konqueror has also received a nice speed boost.
    The KDE team and Trolltech have done a great job and when KDE 3 is released in the next month or so it will be definitly worth checking out.

  13. Linux 2.4 on our router by Tack · · Score: 4, Informative

    I had been running 2.4 on our router for many months. That's not to say those months were consecutive running. :) I had so many problems. I reached the point where I had to reboot the box at least once a week (usually twice) or else it would suddenly become unresponsive. If I had an uptime over 10 days I was doing REALLY well. I tried about 10 different 2.4 kernels (up to 2.4.13), as well as RedHat's 2.4.7 kernel. (I was forced to use 2.4 because of features I required.) At any rate, after about 6-8 months of this, I was resigned to putting either freeBSD on the router or recommending we buy a hardware solution next fiscal year (i.e. cisco router).

    Well, I put 2.4.14 on the box and I haven't rebooted since. I have 61 days of uptime and that's the most I've seen on that box ever. It is finally stable. The only thing I can conclude is that it's AA's VM that is doing the trick. And in hindsight, it makes sense. The behaviour of the box was that it was thrashing, but at the time it didn't seem that way because I hadn't noticed the HDD light was disconnected from the box and I couldn't hear the disk in the noisy server room.

    So, Linux 2.4 is (knock on wood) stable for my servers, now.

    Jason.

  14. Re:Au contraire, agree by Anonymous Coward · · Score: 1, Informative

    I have to agree with this post... people forget
    that large machines are completely different beasts than your desktop. Your x86 based machine is going to behave quite differently than say a 2, 4, 8 or more cpu machine with gigs and gigs of ram and 100s maybee 1000s of gigs of drive space. I am a linux advocate myself, but I would not put it on the IBM F50 I used to work on. This is where the community needs to pay attention. We are on what, near 0% of the dekstop, so are we pushing for this? Linux used to be making inroads into the server community, lets keep it that way and fix these issues...

  15. Re:Why Linux? by xphase · · Score: 2, Informative

    Additionally, many people are actively developing hardware drivers for Linux, but not so many for BSD.

    Regardless of your other points, this is simply not true. There is a large amount of driver development happening in the FreeBSD project. Most hardware that people actually use is supported. Even the nVidia binary module for linux is being ported(in some obscure way) to FreeBSD.

    Also, why would you want a linux kernel module running on a FreeBSD kernel?

    --xPhase

    --
    The following sentence is TRUE. The previous sentence is FALSE.
  16. NFS and 2.2 by ansible · · Score: 3, Informative

    There were some lingering problems with NFS (even v2 using UDP) in the 2.2.x kernel series until 2.2.19.

    I recommend that you upgrade the machine that's running 2.2.17, or else apply the NFS patches. If you're using NFS v3 or TCP, you definitely want to upgrade to the latest version, and get the latest NFS utils.

  17. Turn on your hard drives by blazerw11 · · Score: 2, Informative

    Do this for your ide hard drives:
    hdparm -u1 -ci -d1 /dev/whatever
    I can't believe I was running without it. Does anyone know why this is not turned on by default?
    Use
    man hdparm to learn what these settings do.
    However, your problems sound more like Xwindows problems than kernel problems.

    --
    A great many people think they are thinking when they are merely rearranging their prejudices. -- William James
    1. Re:Turn on your hard drives by cachedout · · Score: 2, Informative

      Beware the -u option: From the hdparm manpage comes the following warning...

      -u Get/set interrupt-unmask flag for the drive. A
      setting of 1 permits the driver to unmask other
      interrupts during processing of a disk interrupt,
      which greatly improves Linux's responsiveness and
      eliminates "serial port overrun" errors. Use this
      feature with caution: some drive/controller combi
      nations do not tolerate the increased I/O latencies
      possible when this feature is enabled, resulting in
      massive filesystem corruption. In particular,
      CMD-640B and RZ1000 (E)IDE interfaces can be unre
      liable (due to a hardware flaw) when this option is
      used with kernel versions earlier than 2.0.13.
      Disabling the IDE prefetch feature of these interfaces
      (usually a BIOS/CMOS setting) provides a
      safe fix for the problem for use with earlier ker
      nels.

  18. Re:Au contraire by ghildstr · · Score: 2, Informative

    Hello. I currently administer and write software for an 8 node cluster at the Naval Surface Warfare Center in Bethesda MD. All of our machines have AMD 1000MHz processors, dual 80GB ATA100 Raid 0, 512 MB PC 133, and 3Com 100bT NICs. We are running Red Hat 7.1, which uses kernel 2.4.2. As of yesterday, the entire cluster, which is used for production work almost daily, had an uptime of 99 days. The software I have written uses lam-mpi for communication. The software is for high volume data analysis, of ship and submarine test data, in the time and frequency domains, which is very network, memory, and cpu intensive. Because of our software design and the size of the data files, we have not experienced heavy swapping, so I cannot comment on the speed or stability of the VM portion of the kernel. Some of our analysis jobs take 6-8 hours to complete at 99.9% cpu load on the compute nodes. I am pleased with 2.4*.

  19. Re:Au contraire by Havoc+Pennington · · Score: 5, Informative

    As the author of a window manager and big hunks of GTK, I don't think your analysis is quite right.

    The primary problem is synchronization, not delay. GTK 1.2 is very fast, its geometry code is not causing any slowness. You are confusing slow with flicker. Flicker looks slow but slow is not the problem; no matter how fast code is, if it flickers, you will see it, and it will look slow.

    Similarly when opaque resizing a window; it has nothing to do with quantization or speed, the problem is that the window manager frame and the client are not resized/drawn at the same time resulting in a "tearing" effect. This would be visible no matter how fast you make things.

    As you say, putting the toolkit in the server or putting the WM in the toolkit are overradical ways to fix this. It's not even necessary to backing store all X windows. It could be done with an extension allowing us to push a backing store on a single X window during a resize, for example. However fixing it 100% pretty clearly requires server changes, and that's why you haven't seen a fix yet.

  20. While Linux remains superior to Windows by FreeUser · · Score: 5, Informative

    ... you are absolutely correct in observing that the 2.4 debacle has used up a great deal of Linux's reputation for being stable. I use 2.4.x with SGI's xfs patches both in production systems at work, and at home (like others, we need various features of 2.4.x not available in 2.2.x), and while it has never been anything close to as flakey as the most stable of Microsoft systems, it has in comparison to 2.2.x (and FreeBSD for that matter) been pretty damn unreliable. In comparison to just about everything else it is still quite stable, so happiness is indeed to some degree relative.

    And now for some arm chair quarterbacking, all that having been said, I really think Linus needs to excersize some self discipline and stay away from maintaining even-numbered kernel releases (x.0.x, x.2.x, x.4.x, etc.). By his own admission he isn't good at being a stable kernel maintainer and prefers the more interesting work done in development kernels, and his track record in 2.2 wasn't fantastic (particularly in comparison to 2.0, where he did a fantastic job) and was pretty abysmal in 2.4. As someone who's been using GNU/Linux since the early pre 1.0 days I hope he'll put his efforts where his talents are (managing changes in odd numbered development releases) and leave stable maintenance to Cox and Marcelo (who are very good at maintaining and improving stable releases). But enough commentary from the peanut gallery...

    --
    The Future of Human Evolution: Autonomy
  21. Re:Why didn't he downgrade immediately? by Anonymous Coward · · Score: 1, Informative

    Downgrad from 2.2 to 2.4? Have you actually tried it yet?

    As the owner of the machine in question, let me answer that question directly.

    It's been virtually impossible to downgrade from 2.4 to 2.2 for a variety of reasons, including file system incompatibility and inability of 2.2 to install with the raid controller in that box (lack of compatible drivers).

    Believe me, if downgrading were as easy to do as to say, we'd have done it long long ago.

    We're fed up with getting up at 3 AM, or 5 AM or whatever to check and see if the server has choked and whether or not someone will have to go and reset it.

    - 2.4Insomniac

  22. Re:Unfortunately I have to agree by _johnnyc · · Score: 4, Informative

    Same here.

    At about the time the 2.4 kernel was first released, we were bulding a server for serving out large media files for encoding. We were on a limited budget, so we put together a PC with about 256 MB RAM running on a K6-2/500. Set it up with a combination of RAID 1 and RAID 5 with 2x40GB and 2x80 GB IDE drives. While running with the stock RH 6.2 kernel we had no problems. But we needed the 2.4 kernel for large files, so we waited until we couldn't wait any longer.

    This turned out to be problematic to say the least. While we had 7 servers running RH 6.2 and never had a crash, the machine serving up the media files would lock up whenever copying large files, or whenever many files were being copied. Kept me working through a few weekends trying the latest kernel and then stress testing the server with large file copies. We wound up reverting back to a 2.2 kernel because the crashes were too frequent.

    I haven't tried the RH kernels for 2.4 on anything other than desktop systems. I can say that, on RH 7.1 at least, the 2.4 kernel in use is rock solid and has never crashed for me at home or on desktop systems at work. I never got the chance to try the kernels on RH 7.1, but I suspect Redhat kernels would probably be more stable. They've got the resources to stress test and modify kernels for specific needs.

    I liked the article. He's not a kernel hacker and writes from his experience of the 2.4 kernel with clients. Only problem I see is WTH was he thinking using Mandrake 8.0 for a server? That version of Mandrake, more than any other I've used, I've found to be very unstable on 2.4.

  23. Re:Guess I've been lucky by Brian+Knotts · · Score: 2, Informative

    Problem Exists Between Chair And Keyboard. :-)

  24. the above message may be a troll by denshi · · Score: 3, Informative
    I think your message is highly misinformed and borders on trolling. Maybe you're just new.

    Many hardware setups require recompiling the kernel and experimenting endlessly.
    This is true. On machines with really exotic hardware, I have had to recompile a great many kernel configurations. Usually, however, I can just rmmod & insmod to test the new configurations without rebooting, so the experimenting phase is not overlong.
    Every time you recompile the kernel, you need to recompile some kernel modules.
    You are in no way forced to compile anything as a module -- the kernel will live quite happily as a solitary elf executable. So don't tell me 'every time'.
    Dependencies and recompilation aren't working correctly--some things don't recompile when they should, and lots of things recompile over and over and over again.
    That's possible anywhere, and I have seen little evidence for your recompiliation loop. It has been some time since I have last seen an incorrect dependency in the kernel build. And on an average uniprocessor machine, my full builds complete in under two minutes. So I'm not crying for time.
    The kernel itself is a 30Mbyte download.
    Cry me a river. Get DSL. Or learn to use the patch command -- that's why all those patch files are on the kernel mirrors. I've been pulling kernel sources off a 33k modem link for the last 6 months, and I'm not hurting for the speed.
    And the list of problems goes on and on.
    All of which are apparently handwaving. Let's watch.
    The kernel hackers keep telling us that C and make are just great tools for building kernels.
    I agree with you that make sucks. Unfortunately, it still sucks less than almost everything on the field. Please suggest an alternative. I also agree that C sucks. OTOH, C++ sucks even harder, and for its extra demands of space and time and its ability to obfuscate, C++ doesn't deliver any of the benefits that a real language (like LISP) does. C++ has been out for 20 years, and it still hasn't superseded C in close-to-the-metal progging. Figure it out.
    This is not a system I can recommend to non-technical users--commercial distributions can't cover all the possible kernel configurations (even with fully modularized kernels), and recompilation is out of the question for many users.
    I have to agree with you on that, but recent kernels are pretty complete -- most users won't need to recompile.
    It must be possible to write drivers and other kernel modules that can be compiled separately from the kernel and work across many versions. Binary modules really should keep working across minor version number changes (2.2 to 2.4, for example).
    You can do that. Say yes to 'attach version information to modules' in the kernel config.
    It must be possible to write kernel modules with more safety in mind. There should also be some way to apply some memory protection to kernel modules when desired.
    I agree with you, but that's pretty far off. The MIT exokernel is I think the shining example of what you are looking for. In the meantime, most people get the same effect by running your theoretic modules outside of the kernel, in daemons or shared libs or something. The user/kernel protections are usually enough.
    The build system needs to get fixed. There is no reason why adding or removing a module should result in a recompilation of the whole kernel. Maybe it's time to get rid of "make" altogether for the kernel.
    There *is* no reason to recompile the whole kernel to add a module. What are you smoking? "make modules","cd to blah","cp blah.o /lib/modules/x.y.z/","depmod". Or just "make modules; make modules_install". As for 'getting rid of make', what would you use to replace it?

    I saved this one for last:

    Important and mature packages like MOSIX require patching the kernel and aren't integrated into the kernel.
    You see, that's what we call not in the linus kernel. Your impressions of importance and maturity of the patch are really something you should take up with Linus himself. I, for one, wish Ingo's TUX subsystem makes it into the linus tree sometime soon. But you have no basis to say that just b/c a kernel patch is out, and linus hasn't integrated it into his stable tree, the linux process is flawed. Get a clue! Independent patches come out much faster than anyone can pull them into the core; they are usually conflictive and compete with other patches to solve the same problem. So it takes a while. If you want it in the linus tree sooner, help out. Welcome to open source.
  25. 2.4, fb, usb and cd write by GodWasAnAlien · · Score: 2, Informative

    2.4.0-? frame buffer had some problems with edge offset.

    -2.4.5 usb-storage was flakey.

    2.4.6 usb stablized.

    2.4.15 broken
    2.4.16 first stable version with ext3
    2.4.16 ide-scsi hangs with some cd-recorders

    2.4.17 seems stable for me.

  26. Re:Why Linux? by mosch · · Score: 1, Informative
    FreeBSD is more stable.
    FreeBSD is more secure.
    FreeBSD is as fast, or faster depending on the task.
    FreeBSD is more free.
    FreeBSD is much easier.

    There's your solution, and with FreeBSD the BSD'd codebase keeps getting richer!

  27. "My experience" by Kourino · · Score: 2, Informative

    Although I'm sure this is already lost in the flood of comments :)

    I started out running Debian 2.2r2(3?), which had a 2.2 kernel. I never had any problems with it, and I didn't have very fancy hardware. I did have a fun ride getting integrated i815 audio/video to work (that was all I had then). Upgrading to 2.4.4 didn't really help with that issue ... I had impeccable stability, but I never really pushed the envelope :)

    Things got more interesting around 2.4.10, partly because my HD crashed (the Deathstar effect) and I rebuilt my system. Lack of MIDI for the SB Live! finally drove me to use ALSA drivers (SB Live! MIDI hasn't worked for me with OSS drivers). No problems support-wise for the Radeon-based card I had for kernel-related issues (kernel DRI mainly). Nice USB support except for the cheap Visioneer 4400 scanner I had (which isn't the kernel's fault). Stability and performance under medium loads has been generally good. I flirted with the pre-emptive and lock-breaking patches for a while, but things got really messy under 2.4.17 (stuff happened that I stopped using Windows to get away from, ie, sudden hard system freezes), so I dropped back to 2.4.16 with no patches. I'll probably try again with .18 or something. The only speed complaint I've really had is gmc takes a long time to scan directories with lots of stuff in them :)

    Now, I haven't really had many problems. 2.4.13-ac8 oopsed, but since I was panicking about midterms at the time I didn't really look at it that closely, and never got around to actually reading the oops screen (I took a pic of it with someone else's camera ... ). I haven't taken time to find out why yet, but XMMS' memory footprint has this habit of growing unbounded ^^; The ALSA output plugins make it worse.

    For the curious: my system is a P3 866 with 128 megs of RAM to start. I upgraded to 512 megs in July when I was running 2.4.4; that's the most memory my motherboard can take, alas. Heaviest loads include running XMMS, apache, xinetd, xchat, and compiles at the same time, so not too bad. Apache was mainly serving stuff to one or two people at maybe 20-30k/sec. I've been generally happy with the 2.4 series, but since I haven't really pushed my system hard yet I may experience problems later when I do :3