Bad Lockup Bug Plagues Linux
jones_supa (887896) writes "A hard to track system lockup bug seems to have appeared in the span of couple of most recent Linux kernel releases. Dave Jones of Red Hat was the one to first report his experience of frequent lockups with 3.18. Later he found out that the issue is present in 3.17 too. The problem was first suspected to be related to Xen. A patch dating back to 2005 was pushed for Xen to fix a vmalloc_fault() path that was similar to what was reported by Dave. The patch had a comment that read "the line below does not always work. Needs investigating!" But it looks like this issue was never properly investigated. Due to the nature of the bug and its difficulty in tracking down, testers might be finding multiple but similar bugs within the kernel. Linus even suggested taking a look in the watchdog code. He also concluded the Xen bug to be a different issue. The bug hunt continues in the Linux Kernel Mailing List."
The last mail in the thread, dated the 26th of November, explains that the Xen bug was a Xen bug and that the lockup was something different and traceable once the chap experiencing the bug managed to get a kernel backtrace.
Trying to become famous by taking photos. Visit my homepage please.
> First got into it ... because Linux was totally stable
If stable is your top priority, Fedora is approximately the worst possible choice. Fedora is essentially Red Hat Beta. If you want stable, the devel / beta branch is not for you. You'll probably be much happier with Red Hat or its twin, CentOS.
Also, you mentioned that you did an "upgrade" to Debian Unstable. You didn't mention any _reason_ for doing that. If stability is a top priority for you, don't upgrade just because you can, don't fix it if it aint broke.
Mac OSX may indeed be a good choice for you also. It is certified Unix and if you use the commondand line in Linux you'll find that day-to-day tasks are the same on a Mac. System internals are different of course, but bash, sed, awk, grep, and vim work just like they do on Linux.
So it may be a "bad" lockup bug in the sense that nobody knows exactly what causes it, but it's not "bad" in the sense that people should worry overly.
Why?
Dave Jones sees it only under insane loads (CPU loads of 150+) running a stress tester that is designed to do crazy things (trinity). And he can reproduce it on only one of his machines, and even there it takes hours. And it happens on a debug kernel that has DEBUG_PAGEALLOC and other explicit (and complex) debug code enabled. And even then the bug is a "Hmm. We made no progress in the last 21 seconds", rather than anything stranger.
In other words, it's "bad" in the sense that any unknown behavior is bad, but it's unknown mainly because it's so hard to trigger. Nobody else than core developers should really care. And those developers do care, so it's not like it's worrisome there either. It just takes longer to figure out because the usual "bisect it" approach isn't very easy when it can take a day to reproduce..
Have you ever compared enterprise class software (I also count Windows 7 Enterprise) with OSS Software? Windows does not even reliably support STR and resume. Using multiple monitors is a PITA.
Suspend and multiple monitors have always worked great in Windows for me. Under Linux, they have also worked fine in some machines, but I have also occasionally experienced serious problems with those areas. During recent times I have found out that even laptop screen brightness adjustment cannot be expected to work reliably out of the box under Linux.