Closure On the Linux Lockup Bug
jones_supa writes: Dave Jones from Red Hat has written a wrap-up of the strange bug that has made some machines running Linux to freeze. (Previous discussion.) Right down to his final week at Red Hat before Dave gave all his hardware back, Linus Torvalds managed to reproduce similar symptoms, by scribbling directly to the HPET timer. He came up with a hack that at least made the kernel survive for him. When Dave tried the same patch, the machine ran for three days before he interrupted it, which was a promising result. The question remains, what was scribbling over the HPET in his case? The only two plausible scenarios Dave could think of were that Trinity generated 0xFED000F0 as a random address and passed that to a syscall which wrote to it, or a hardware bug. That's where the story ends for now. Linus' hacky workaround didn't get committed, but him and John Stultz continue to back and forth on hardening the clock management code in the face of screwed up hardware, so maybe soon we'll see something real get committed on that area.
"probably a hardware bug" is code for "well, we bought new hardware and threw out all the old stuff, sorry"
Closed NOTABUG?
We all know that "blame the hardware" is only correct one out of every million or so times.
"has made some machines running Linux to freeze... but him and John Stultz continue to back and forth"
Really?
"National Security is the chief cause of national insecurity." - Celine's First Law
Hey youse editors, you want I should take the mug out?
Too many clueless comments already that don't understand the difference between "blaming the hardware" and hardening to deal with demonstrably-broken hardware (and/or firmware for devices). I've spent years writing drivers for various OS', including Windows and Linux. It is rare for any complex device to be bug-free at the hardware level (look how many patches are BIOS-applied to CPUs, for example). Sometimes, under NDA, of course, the Windows driver writers are apprised of the deficiencies, or, at least, get better response from the vendor when an anomaly appears. Linux rarely gets that same assistance.
My favorite example, though, is all-IBM. We were porting AIX to the PS/2s and 370s. We consistently had problems with the diskette interface under AIX and the response from Boca Raton was always "it works in MS-DOS, so it's your code, not our hardware". When OS-2 came around, they ran into exactly the same problem in the hardware. By then, we had a work-around (slower, more locks, but no more glitches) which was how OS-2 got around it, as well.
I wish Slashdot would allow me to mark users not just as "friend" or "foe", but as "neckbeard". :) That must have been 1986 or 1987?
Windows still BSOD's and always will.
About as much as this year being the year of the linux desktop... no really, it's gonna be THIS year... promise.
"... him and John Stultz continue to back and forth ..."
What in the world is happening, editors?
Written by me, AC, this OS has no viruses because no one bothers to make viruses for it. It doesn't lock up either like those bloated OS's; Windows, Mac OS, and Linux.
Use the OS with the name you can trust. Use Anonymous Coward.
Obviously, it's folds in the space time continuum that is causing HPET (the high precision hardware timer) to jump backwards, causing negative deltas and lockups.
Perhaps a future version of ourselves has transcended space-time and is trying to contact us to help us with our bad harvests? Did Linus try to determine any kind of co-ordinates from the glitch?
Has NASA seen any kind of weird portholes near Jupiter?
>. Many teams would have written it off as a hardware bug a long time ago, but the linux kernel team was willing to consider and investigate the possibility that it was a rarely triggered bug in the software before they passed the buck.
And try to avoid crashing due to hardware bugs, if possible.
A contractor once hotplugged one of the CPUs in one of my servers. That's right, they took the processor out and replaced it with the machine running. The box did not crash. It kept running at least for the few minutes it took me to find out what they did and reboot the machine properly. Hardware error doesn't HAVE to mean a crash, though you can't guarantee that it never will.
Of course if you're holding it wrong, that'll always cause problems, because the special rectangle shape needs to be held at the proper aesthetic angle*. ;)
* I use and enjoy Mac pros, which are nice Unix systems. iOS mobile devices - not so much.
Sometimes it
Sorry if I've found the wrong stuff. I'm doing this via a quick googling...
Is this really the code for reading and writing the HPET?
http://www.cs.fsu.edu/~baker/d...
I've been a powerpc programmer in aviation for a while. If you need to read the time base register (also a 64 bit up counter) you have to be aware that your read might coincide with the lower 32 bits incrementing and carrying into the upper 32 bits. So you read the upper 32 bits, read the lower 32 bits, then re-read the upper bits and make sure the upper bits didn't change. If they did repeat this process. But if they are the same then you combine the 32 bit halves into a 64 bit time and call it good.
http://en.wikipedia.org/wiki/Jury_nullification
PS I meant that in the best possible way. I didn't really think through the connotations of "neck beard" before posting. I was really thinking more "gray beard" , including wizardly connotations.
Hot swapping the CPU without an immediate crash had to be a million to one shot!
But yes, resilient software is always a good thing.
I do hope Linus's patch goes in in some form to at least make it clear what the problem is if someone with similarly borked hardware sees the problem.
>Hot swapping the CPU without an immediate crash had to be a million to one shot!
With QPI interconnect and the voltage and temp supervisory circuits on chip, it's not such a long shot these days, especially on Xeons with failover support that is explicitly intended to cope with a neighbor CPU going down.
I should use this sig to advertise my book ISBN-13 : 978-1501515132.
Yes it's great to support hotplugged CPUs! 1969 called and they want to let you know they supported online reconfiguration back then too: http://en.wikipedia.org/wiki/M...
I always wondered where this setting was...
No it doesn't. Maybe you should upgrade past XP already and use a windows made in this century
Sometimes it
Sometimes it -- what? Did someone attempt to hot-swap your CPU again? (-:
8 of 13 people found this answer helpful. Did you?
Sometimes it screws up the post, where "it" is the Android browser.
That's interesting. Apparently it was supported well enough that they actually did hotplug CPUs regularly, as standard practice. I wonder if they "unmounted" the components before removal and "mounted" them upon insertion. That's a much easier approach, especially for CPUs, than handling a CPU suddenly going AWOL.
Replying to myself, but I figured someone reading this might be interested. Linux does support CPU hotplug where you disable the CPU before removing it. Your motherboard might get mad about it if it's not supported by the board, though.
http://www.cyberciti.biz/faq/d...
Yes, I can see that would limit the damage, but it still leaves the OS surprised to have running tasks just go away.
It would likely work less well with AMD processors since a chunk of memory would also go away.
"that has made some machines running Linux TO freeze"...
WTF?
The "to" is not needed. But then, you ARE American, aren't you... Idiot.
Depends on the hardware... I believe some of the mainframes had interlock screws/latches that would cause a signal slightly before the CPU could be removed, and that would initiate the "unmount". Tightening the latch would then signal a "mount".
Thus doing it automatically.
Similar techniques as used by disks - having the leads of a different length to signal a pending removal/or just plugged in.
USB has slightly longer contacts on the power pins for much the same reason.
comment first, facts later. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
Was not one reason why mainframes was so highly valued that one could hotswap virtually anything without interrupting workflow?
comment first, facts later. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
With QPI interconnect and the voltage and temp supervisory circuits on chip,
Though the voltage regulation on chip is a new thing as of haswell and will go away with sky lake.
The Intel (alpha) documentation is clear on the race issue with reading the 64 bit HPET timer.
http://www.intel.com/content/dam/www/public/us/en/documents/technical-specifications/software-developers-hpet-spec-1-0a.pdf
Section 2.4.7:
"A race condition comes up if a 32-bit CPU reads the 64-bit register using two separate 32-bit reads. An accuracy problem may be arise if just after reading one half, the other half rolls over and changes the first half."
The usage of readq (from io.h) using two 32bit reads for HPET is therefore invalid.
Nope, it is not Trinity the desktop that is the driver behind this bug hunt (which is what I thought originally, silly me). It is a separate testing system that randomly calls random syscalls in linux with ... wait for it ... random parameter values.
That the OS run at all when random garbage is spewed to random syscalls in random sequences is astounding. A totally stupid testing protocol as well, but that is not the subject of the discussion. Just that Linux locks up when handling hundreds of threads throwing garbage at it.
And the only other system to show this bug consistently has the same motherboard. There is another suspect but I didn't find the motherboard in question.
One might wonder how any other OS would survive.
Solaris supported hot pluggable CPUs in the last century!
That this bug wasn't known as Davey Jones' Lockup...
I had the freeze bug in a VM system on a Mac running Parallels. I downloaded Ubuntu 14.04 from Parallels and could not get around it. Then I downloaded directly from Canonical and it worked just find. I assumed it was a bad download from Parallels, but perhaps it is more subtle. The virtual machine has the same vulnerabilities - is that a clue?
I am affected by this bug, but can't seem to find any real place to follow it. I searched https://bugzilla.kernel.org/ but that didn't turn up anything. Anyone know where the source of truth for tracking this issue might be located?
Yes. Exactly this. Pulling the latches on the card generates an interrupt. In the systems I designed (for a mainframe raid disk system in this case), a little green light would light up when it was ready. So pull the latches out, wait for green light, pull the card out. The light generally lit up in a few milliseconds, so you could just rip the card out.
I presume this is how it worked for all products from this (very large, well known) manufacturer, because that's what the spec required.
I should use this sig to advertise my book ISBN-13 : 978-1501515132.