Slashdot Mirror


Putting Linux Reliability to the Test

Frank writes "This paper documents the test results and analysis of the Linux kernel and other core OS components, including everything from libraries and device drivers to file systems and networking, all under some fairly adverse conditions, and over lengthy durations. The IBM Linux Technology Center has just finished this comprehensive testing over a period of more than three months and shares the results of their LTP (Linux Test Project) testing."

19 of 296 comments (clear)

  1. Almost 1P, but I RTFAd :( by Anonymous Coward · · Score: 4, Interesting

    Anyone know if the test will be repeated with kernel 2.6.x?

  2. s/w -vs- h/w failure? by Quixote · · Score: 5, Interesting
    I skimmed over the article (heretic!), and was wondering: how do they distinguish between software failures (the purpose of the test) and hardware failures (for example, random bit errors in the memory that could be caused by higher temperatures due to the stress testing)?

    I seem to recall getting random crashes with cheapo memory, and it was a pain to track down the offending component. Of course, one would assume that IBM wouldn't go for cheapo components, but still: how does one point the finger at the software, instead of hardware? Is it just repeatability?

  3. Linux 2.4.19-ull-ppc64-SMP (SLES 8 SP 1) by Proudrooster · · Score: 0, Interesting

    Conclusions

    However, as most Linux kernel testing efforts have only been conducted over short periods of time, this series of tests provides us first-hand data and results of longer runs. The series of tests also provides data for heavy-stress workloads on Linux kernel components, as well as TCP, NFS, and other test components. The tests demonstrate that the Linux system is reliable and stable over long durations and can provide a robust, enterprise-level environment.


    BIG NEWS!!!... IBM says the 2.4.19 kernel in the Suse Distro SLES 8 is enterprise ready. Too bad 2.4 is yesterday's news. I wonder when IBM will start testing the 2.6 .0 kernel :) This test report should at least make Ford happy, too bad IBM timed this annoucement while Ford is closed for holiday break. I also wonder why IBM didn't use Redhat for the stress test. Things that make you go hmmmmm....... maybe it's time to learn SUSE and YAST.

    On a side note, does anyone know if Suse's SLES 8 will run on a single CPU home PC? I've always wanted to take that version for a test drive, but could never find install CD's for a non SMP, low end Intel machine.

    1. Re:Linux 2.4.19-ull-ppc64-SMP (SLES 8 SP 1) by LurkerXXX · · Score: 3, Interesting

      2.6.0 has only been out for a week. I'm going to want to see someone stress test it for a hell of a lot longer than a week before I call it "enterprise ready".

    2. Re:Linux 2.4.19-ull-ppc64-SMP (SLES 8 SP 1) by ramzak2k · · Score: 2, Interesting

      I used to use RH8.0 and then RH9.0 and moved on to SUSE 9.0 Pro recently. I noticed that when apps crash RH distros took it much better- You could close the unstable application safely and continue working on the rest. Suse OTOH freezes. Did anyone else notice this ?

      --

      Siggy Say, Siggy Do
    3. Re:Linux 2.4.19-ull-ppc64-SMP (SLES 8 SP 1) by kiore · · Score: 2, Interesting
      Yes, I do.

      We run SuSE Professional 8.2 on our home machines, one server, two workstations.

      My partner's machine freezes occasionally. Most recent one was yesterday, and even Alt+Ctrl+Backspace wouldn't get control back. I needed to power off!

      I've never been able to work out exactly what causes these freezes (my partner is not very "'puter literate"), but suspect that it may be somehow related to the printing subsystem. One time I did manage to ssh into the machine when it froze and the cups process was feasting on CPU cycles.

      We upgraded to the then latest patches from the SuSE ftp site about two weeks ago. This did not affect the reliability.

      I can't contrast these observations to any other distro. I've only ever used SuSE (since 6.x).

  4. Results... by rmdir+-r+* · · Score: 3, Interesting

    The Linux kernel and other core OS components -- including libraries, device drivers, file systems, networking, IPC, and memory management -- operated consistently and completed all the expected durations of runs with zero critical system failures. Every run generated a high success rate (over 95%), with a very small number of expected intermittent failures that were the result of the concurrent executions of tests that are designed to overload resources. How does that compare with other OS's?

  5. Re:You don't trust Microsoft to evaluate Windows.. by davidstrauss · · Score: 5, Interesting
    Why shoudn't we trust this test?

    The people performing it have a vested financial interest in having it turn out a specific way, notably positive. If the test resulted showed poor reliability, then I would understand trusting it because it would go against the motives of the people performing it. Since the test affirms their business model, no matter how documented it is, it should be suspect.

    It doesn't appear to be a test rigged to make one platform look better than the other.

    It looks a bit skewed to me. Many of the test results depend on the computer systems meeting expectations of the people testing it, particularly in overload cases. Since the people who tested work in the Linux Technology Center, their expectations stand a greater likelyhood of being consistant with the system.

    Take C/C++ and Java. Someone who regularly works with C/C++ knows certain libraries (notably the character ones) return ints for status in the form 0 being false and not 0 being true. If someone expects that, the system meets expectations and passes. If someone comes from a different background, say Java, he or she may not expect that, and the system would consequently fail the test of meeting expectations. I would like an evaluation from somewhere in-between, not someone whose years of experience allow them to gloss over what might be problems for another person.

  6. Why? Here's why... by Crypto+Gnome · · Score: 5, Interesting
    • because the test methodologies are documented
    • because it's disclosed up-front that it's IBM Linux Team testing Linux (ie no hidden conflict of interest
    As opposed to the usual (ie in the Microsoft World)
    • ZDNet (and/or others) "testing" Microsoft Products (but only vaguely describing how things were configured)
    • Microsoft paying someone to "report" on the quality/performance of a Microsoft product, but the evaluation is worded in such a way as to convince the user that it's an independent review and the "funded by microsoft" fact is never mentioned anywhere in the evaluation
    --
    Visit CryptoGnome in his home.
  7. WHAT is the failure? by SharpFang · · Score: 4, Interesting

    95% success ratio... does that mean that 1 in 20 programs I run segfaults or what? What do they mean by "failure"? Not finishing given task in predefined time? Getting the results wrong? Hanging?

    Sorry but that means nothing. Even if there -was- a comparison to other systems, it would still mean nothing. 95% success ratio, 78% happiness factor and 93% user satisfaction.

    --
    45 5F E1 04 22 CA 29 C4 93 3F 95 05 2B 79 2A B2
  8. Re:You don't trust Microsoft to evaluate Windows.. by Unordained · · Score: 1, Interesting

    The people performing it have a vested financial interest in having it turn out a specific way, notably positive. If the test resulted showed poor reliability, then I would understand trusting it because it would go against the motives of the people performing it.

    That may be human instinct, but let's be honest: it's not fair either. Either trust the source of information or don't, but don't trust the result of a test based on the result of the test -- circular dependency.

    It's much simpler to simply say that this is a linux test by IBM, and may therefore be tainted. Did I mention that it's good to have truly independent testing centers, albeit expensive for them? This whole independent/free media thing is rather important ... specifically because of this.

  9. Re:You don't trust Microsoft to evaluate Windows.. by willabr · · Score: 1, Interesting

    IBM only want's to sell it's hardware, I don't think they could care much less about the OS (remember OS/2). As long as they can ride on the back of a free labor force they are happy campers.

    The Wall Street Journal reported last week that IBM had told its managers to plan on moving as many as 4,730 high-tech jobs from the United States, (I wonder if they are the Linux testers)

    Some of whom will be required to train the foreign workers who will replace them.

    Thank's IBM. You will be remembered. I'm happy I have nothing to do with them.

  10. Failures needed by Error27 · · Score: 3, Interesting

    This test would have been more interesting if there had been failures. Perhaps they could have tried the test on an older version of Linux, or a different operating system.

    I have been trying to write some tests of my own recently. So far I have found a filesystem OOPs, a ptrace BUG(), and my system locks up on low memory situations. Probably the lockup is because my ethernet driver allocates memory in the interrupt handler (GFP_ATOMIC) and can't handle the result when there is no memory available.

    I need to fix the lock up first of all so the other tests have time to run...

  11. Re:USE BAD HARDWARE! by router · · Score: 2, Interesting

    Exactly. How the fsck do you think they manage to have a whole corporation hidden in the razor thin margins that exist on commodity hardware? By cutting everything down to the wire. Dell et. al. are like the major automakers, saving a quarter on every piece is 3 million on the bottom line....

    andy

  12. Re:Not bad by bonehead · · Score: 2, Interesting

    He actually makes some very good points.

    Windows, even the server versions, are not the enterprise class OSs that they are marketed as. This should come as no surprise, because they were not even designed that way in the first place.

    All you have to do to realize this is boot up W2K AS and use it as a desktop machine for awhile. All of the desktop crap is still there sucking up resources. Even Freecell is there, fer cryin' out loud! Try as I might, I can't come up with a good reason for a headless server sitting in a data center to have a copy of Freecell on it.

    I can understand why, with a desktop OS, you would just go ahead and install everything by default, just to make sure that everything works. But why would you do that with an enterprise class server OS? At some level of the chain here, shouldn't MS acknowledge that the intended user of the product actually knows what he's doing?

    At some point in the design process, shouldn't someone have said "Hey, this is going to run on servers in the back room, we could probably ditch Freecell and Solitaire, couldn't we?"

    The fact that they didn't, well... It makes me wonder.

  13. I already knew GNU/Linux was stable by idiotnot · · Score: 3, Interesting

    In fact, it's not much of a question for me anymore -- when there's a problem, it's normally hardware malfunction. I have several machines with 160+ day uptimes, which would be longer if not for an extended power outage at the office.

    IBM just confirmed what I already knew. Guess what, Win2k is pretty stable, too. Sorry, but it's true.

    But, jeeze, isn't anyone else drooling over those systems they tested on? Makes me hate my busted whiteboxes and horrible HP's a little more everyday.

    Repeat after me....."MMMM, dual Power4......MMMM, dual Power4...."

  14. Look at companies using Linux by JaredOfEuropa · · Score: 2, Interesting
    any one of you know of someone who fills in these criteria.
    The closest you're likely to get is good testimonials from companies using Linux. IBM, SUN etc. all have a stake in Linux, and the 'independant' research outfits are probably funded by them, or by Microsoft (in case Linux needs a good bashing).

    My client is a big megacorp. Their strategy for the coming years is to migrate all Unix systems to Windows/.Net (client side), and to Linux or NT (server side, depending on which OS fits best). This isn't the kind of corporation that makes such a decision after reading a sales brochure or a Gartner article. They research their options, thoroughly. Apparently the conclusion was that Linux is reliable enough to be entrusted with mission-critical stuff.

    The sad thing is that they will (probably) keep the results of this research confidential. Why help the competition with this knowledge?
    --
    If construction was anything like programming, an incorrectly fitted lock would bring down the entire building...
  15. Re:Diagnosing software vs. hardware is easy. by MoogMan · · Score: 3, Interesting

    You will find that a lot of the trickier bugs can depend on certain [eg. race-] conditions. Such things that are very hard to recreate, even under carefully controlled situations. Then you get the heisen-bug variety etc. Such errors could easily be passed off as hardware failiure. Im sure you can dream up your own examples. (I cant; Im still drunk)

  16. My experience: Linux survives hard drive crash by crow · · Score: 4, Interesting

    I've been using an old P120 laptop as a firewall/router for my house for the past several years running 2.2.something. I wondered why it rebooted after noticing an uptime of only a day or two, but found that instead I was experiencing the uptime rollover bug (at about 500 days; Windows used to crash on a similar bug after 48 days). About a month ago, it stopped giving out DHCP addresses. I went downstairs to investigate, as I couldn't log in remotely, and found that the hard drive was making that nasty clicking sound. I eventually managed to ssh in (sshd and sh were in ram; I just waited for the logging to time out). I was able to kill syslog and cron, and now dhcp is again giving out addresses.

    It's been running just fine for a month now with a dead hard drive.

    (Yes, I'm getting a replacement because it won't survive an extended power outage on that ancient battery.)