Slashdot Mirror


Closure On the Linux Lockup Bug

jones_supa writes: Dave Jones from Red Hat has written a wrap-up of the strange bug that has made some machines running Linux to freeze. (Previous discussion.) Right down to his final week at Red Hat before Dave gave all his hardware back, Linus Torvalds managed to reproduce similar symptoms, by scribbling directly to the HPET timer. He came up with a hack that at least made the kernel survive for him. When Dave tried the same patch, the machine ran for three days before he interrupted it, which was a promising result. The question remains, what was scribbling over the HPET in his case? The only two plausible scenarios Dave could think of were that Trinity generated 0xFED000F0 as a random address and passed that to a syscall which wrote to it, or a hardware bug. That's where the story ends for now. Linus' hacky workaround didn't get committed, but him and John Stultz continue to back and forth on hardening the clock management code in the face of screwed up hardware, so maybe soon we'll see something real get committed on that area.

10 of 115 comments (clear)

  1. does not sound like closure to me by Narcocide · · Score: 4, Informative

    "probably a hardware bug" is code for "well, we bought new hardware and threw out all the old stuff, sorry"

    1. Re:does not sound like closure to me by thegarbz · · Score: 4, Informative

      Re-read the summary. They know what is causing the lockup, they don't know what is making the system call which is triggering the bug. Once you know what is causing the lockup it can be fixed, and the hack that was written made the lock-ups stop. At no point did anyone throw out or try new hardware, though one thought is everything is originating from a hardware bug.

    2. Re:does not sound like closure to me by sjames · · Score: 4, Interesting

      RTFA, they have good reason to point at the hardware. Then there's the bazillions of servers running on different hardware that have never seen the bug.

      Many teams would have written it off as a hardware bug a long time ago, but the linux kernel team was willing to consider and investigate the possibility that it was a rarely triggered bug in the software before they passed the buck.

      Sometimes it really is a hardware bug.

    3. Re:does not sound like closure to me by tippen · · Score: 3, Funny

      One of the more memorable quotes I heard while developing embedded systems: if you can fix it in software, it isn't a hardware bug

      Annoying as hell to the software team when it is clearly a bug in the hardware, but very true at a practical level for the engineering team trying to get product out the door.

  2. hardening is NOT blaming the hardware by dltaylor · · Score: 4, Interesting

    Too many clueless comments already that don't understand the difference between "blaming the hardware" and hardening to deal with demonstrably-broken hardware (and/or firmware for devices). I've spent years writing drivers for various OS', including Windows and Linux. It is rare for any complex device to be bug-free at the hardware level (look how many patches are BIOS-applied to CPUs, for example). Sometimes, under NDA, of course, the Windows driver writers are apprised of the deficiencies, or, at least, get better response from the vendor when an anomaly appears. Linux rarely gets that same assistance.

    My favorite example, though, is all-IBM. We were porting AIX to the PS/2s and 370s. We consistently had problems with the diskette interface under AIX and the response from Boca Raton was always "it works in MS-DOS, so it's your code, not our hardware". When OS-2 came around, they ran into exactly the same problem in the hardware. By then, we had a work-around (slower, more locks, but no more glitches) which was how OS-2 got around it, as well.

    1. Re:hardening is NOT blaming the hardware by kad77 · · Score: 3, Funny

      What you posted about his being the 4th post struck me as wrong, given how far it was down the page. I'm bored, so I took a moment to look at how many posts have an earlier timestamp than the one you are slamming (at least 8), and 2 make dismissive statements about hardware, including the first comment of article at 8:12, and another at 8:19 seemingly dismissing hardware as a possibility.

      So your snide comment is not based in fact. It's like you are reading a different page. Maybe you need glasses. An attitude adjustment, for sure.

  3. Call me crazy by Nyall · · Score: 4, Interesting

    Sorry if I've found the wrong stuff. I'm doing this via a quick googling...

    Is this really the code for reading and writing the HPET?

    http://www.cs.fsu.edu/~baker/d...

    I've been a powerpc programmer in aviation for a while. If you need to read the time base register (also a 64 bit up counter) you have to be aware that your read might coincide with the lower 32 bits incrementing and carrying into the upper 32 bits. So you read the upper 32 bits, read the lower 32 bits, then re-read the upper bits and make sure the upper bits didn't change. If they did repeat this process. But if they are the same then you combine the 32 bit halves into a 64 bit time and call it good.

    --
    http://en.wikipedia.org/wiki/Jury_nullification
  4. Re:plus don't crash on bad hardware. Hotplugged CP by TechyImmigrant · · Score: 3, Informative

    >Hot swapping the CPU without an immediate crash had to be a million to one shot!

    With QPI interconnect and the voltage and temp supervisory circuits on chip, it's not such a long shot these days, especially on Xeons with failover support that is explicitly intended to cope with a neighbor CPU going down.

    --
    I should use this sig to advertise my book ISBN-13 : 978-1501515132.
  5. Re:plus don't crash on bad hardware. Hotplugged CP by cerberusss · · Score: 4, Funny

    Sometimes it

    Sometimes it -- what? Did someone attempt to hot-swap your CPU again? (-:

    --
    8 of 13 people found this answer helpful. Did you?
  6. Linux CPU hotplug support link by raymorris · · Score: 3, Informative

    Replying to myself, but I figured someone reading this might be interested. Linux does support CPU hotplug where you disable the CPU before removing it. Your motherboard might get mad about it if it's not supported by the board, though.

    http://www.cyberciti.biz/faq/d...