Closure On the Linux Lockup Bug
jones_supa writes: Dave Jones from Red Hat has written a wrap-up of the strange bug that has made some machines running Linux to freeze. (Previous discussion.) Right down to his final week at Red Hat before Dave gave all his hardware back, Linus Torvalds managed to reproduce similar symptoms, by scribbling directly to the HPET timer. He came up with a hack that at least made the kernel survive for him. When Dave tried the same patch, the machine ran for three days before he interrupted it, which was a promising result. The question remains, what was scribbling over the HPET in his case? The only two plausible scenarios Dave could think of were that Trinity generated 0xFED000F0 as a random address and passed that to a syscall which wrote to it, or a hardware bug. That's where the story ends for now. Linus' hacky workaround didn't get committed, but him and John Stultz continue to back and forth on hardening the clock management code in the face of screwed up hardware, so maybe soon we'll see something real get committed on that area.
"probably a hardware bug" is code for "well, we bought new hardware and threw out all the old stuff, sorry"
>Hot swapping the CPU without an immediate crash had to be a million to one shot!
With QPI interconnect and the voltage and temp supervisory circuits on chip, it's not such a long shot these days, especially on Xeons with failover support that is explicitly intended to cope with a neighbor CPU going down.
I should use this sig to advertise my book ISBN-13 : 978-1501515132.
Replying to myself, but I figured someone reading this might be interested. Linux does support CPU hotplug where you disable the CPU before removing it. Your motherboard might get mad about it if it's not supported by the board, though.
http://www.cyberciti.biz/faq/d...