Closure On the Linux Lockup Bug

← Back to Stories (view on slashdot.org)

Closure On the Linux Lockup Bug

Posted by Soulskill on Friday January 9, 2015 @02:01PM from the it-was-dead-the-whole-time dept.

jones_supa writes: Dave Jones from Red Hat has written a wrap-up of the strange bug that has made some machines running Linux to freeze. (Previous discussion.) Right down to his final week at Red Hat before Dave gave all his hardware back, Linus Torvalds managed to reproduce similar symptoms, by scribbling directly to the HPET timer. He came up with a hack that at least made the kernel survive for him. When Dave tried the same patch, the machine ran for three days before he interrupted it, which was a promising result. The question remains, what was scribbling over the HPET in his case? The only two plausible scenarios Dave could think of were that Trinity generated 0xFED000F0 as a random address and passed that to a syscall which wrote to it, or a hardware bug. That's where the story ends for now. Linus' hacky workaround didn't get committed, but him and John Stultz continue to back and forth on hardening the clock management code in the face of screwed up hardware, so maybe soon we'll see something real get committed on that area.

115 comments

Min score:

Reason:

Sort:

does not sound like closure to me by Narcocide · 2015-01-09 14:12 · Score: 4, Informative

"probably a hardware bug" is code for "well, we bought new hardware and threw out all the old stuff, sorry"
1. Re:does not sound like closure to me by thegarbz · 2015-01-09 14:34 · Score: 4, Informative
  
  Re-read the summary. They know what is causing the lockup, they don't know what is making the system call which is triggering the bug. Once you know what is causing the lockup it can be fixed, and the hack that was written made the lock-ups stop. At no point did anyone throw out or try new hardware, though one thought is everything is originating from a hardware bug.
2. Re:does not sound like closure to me by Anonymous Coward · 2015-01-09 14:36 · Score: 1
  
  "probably firmware SMM code messing with the HPET counter behind our back" != "probably a hardware bug"
3. Re:does not sound like closure to me by sjames · 2015-01-09 15:06 · Score: 4, Interesting
  
  RTFA, they have good reason to point at the hardware. Then there's the bazillions of servers running on different hardware that have never seen the bug.
  Many teams would have written it off as a hardware bug a long time ago, but the linux kernel team was willing to consider and investigate the possibility that it was a rarely triggered bug in the software before they passed the buck.
  Sometimes it really is a hardware bug.
4. Re:does not sound like closure to me by Anonymous Coward · 2015-01-09 18:51 · Score: 0
  
  yes, because windows doesn't also suffer such teething problems...
5. Re:does not sound like closure to me by Anonymous Coward · 2015-01-09 19:21 · Score: 0
  
  So you'd rather have a bunch of underpaid employees and clueless middle managers, but ultimately greedy CEOs running things?
6. Re: does not sound like closure to me by Anonymous Coward · 2015-01-09 20:55 · Score: 1
  
  My windows servers have an uptime of 49 years, 31 days, 22 hrs, 15 mins and 4539 ms. No Linux server can beat that
7. Re: does not sound like closure to me by chentiangemalc · 2015-01-09 21:49 · Score: 0
  
  If you are using GUI desktop Windows 7 or 8.1 is way more stable then many popular Linux GUIs, unless you load up your Windows machine with crapware / adware. Unfortunately most windows machines come preloaded with crap.
8. Re: does not sound like closure to me by paulatz · 2015-01-09 22:10 · Score: 1
  
  Did you add up the uptime of all the 4096 servers?
  
  --
  this post contain no useful information, no need to mod it down
9. Re: does not sound like closure to me by Anonymous Coward · 2015-01-09 22:53 · Score: 0
  
  So you never update Windows?
10. Re:does not sound like closure to me by GlowingCat · 2015-01-09 23:09 · Score: 1
  
  Maybe kernel or driver code writing to HPET counter accidentally. Kernel and drivers both have access to same unlimited memory space, right ?
11. Re: does not sound like closure to me by Anonymous Coward · 2015-01-09 23:53 · Score: 0
  
  As long as you compare like for like: a supported Linux running on certified hardware; then this is complete bullshit. Your problem is that you are comparing hobbyists playing around on non-dedicated. Admittedly many IT departments behave like hobbyists when it comes to Linux, however that reflects in their competence more than anything else.
12. Re:does not sound like closure to me by PoochieReds · 2015-01-10 02:01 · Score: 1
  
  It's still not a given that it's the hardware. It's likely that something is scribbling over the HPET timer. As to whether that's due to faulty hardware or a software bug is still undetermined.
  Random memory corruption is oh so painful. :(
13. Re:does not sound like closure to me by tippen · 2015-01-10 04:47 · Score: 3, Funny
  
  One of the more memorable quotes I heard while developing embedded systems: if you can fix it in software, it isn't a hardware bug
  Annoying as hell to the software team when it is clearly a bug in the hardware, but very true at a practical level for the engineering team trying to get product out the door.
14. Re:does not sound like closure to me by sjames · 2015-01-10 06:02 · Score: 1
  
  I'm famioliar with that one. Same thing happens in boot ROMs.
15. Re: does not sound like closure to me by Anonymous Coward · 2015-01-10 06:32 · Score: 0
  
  how does one not have all hardware certified, everyfucking one else manages to do it, its not like its ever expanding and unique anymore
  guess my standard issue realtek soundcard isnt certified, cant even adjust the line in volume without bringing up some 2 decade old command line program to work around it
  its sad
16. Re: does not sound like closure to me by Anonymous Coward · 2015-01-10 06:54 · Score: 0
  
  If you are using GUI desktop Windows 7 or 8.1 is way more stable then many popular Linux GUIs, unless you load up your Windows machine with crapware / adware. Unfortunately most windows machines come preloaded with crap.
  Well backed up with facts there. Love the links to all your sources - they are really helpful and informative.
17. Re:does not sound like closure to me by the_B0fh · 2015-01-10 12:21 · Score: 1
  
  bwahahahahahaha, come on, we need sarcasm font here!!
18. Re:does not sound like closure to me by fidelleon · 2015-01-11 05:37 · Score: 1
  
  Nice try, you troll.
19. Re:does not sound like closure to me by Anonymous Coward · 2015-01-11 09:58 · Score: 0
  
  I agree: Until it's fully understood, fixed and committed it sounds like "clue-sure" to me. In the mean time TFA clearly states (para) "we think it's this but we don't know for sure."
20. Re:does not sound like closure to me by TechyImmigrant · 2015-01-11 15:19 · Score: 1
  
  Someone with the right equipment should be able to do a hardware trace and catch the culprit.
  
  --
  I should use this sig to advertise my book ISBN-13 : 978-1501515132.
21. Re:does not sound like closure to me by Anonymous Coward · 2015-01-11 16:46 · Score: 1
  
  if you can fix it in software, it isn't a hardware bug
  I'm a hardware and software guy, and I can tell you that is entirely bullshit. While I understand it may seem this way because sometimes software guys can't write a driver to save their lives, there are many bugs in hardware which are actual hardware bugs (race conditions, dropped interrupts, whatever) that have workarounds in software.
  I've seen buggy hardware NAND flash ECC units "fixed" by doing ECC entirely in software, leaving the hardware unit unused, and taking a bit throughput hit.
  I also seem to recall a problem with some built-in Intel CPU random number generators not delivering as much entropy as advertised. Again, this was "fixed" by mixing it with yet more entropy in software, but that didn't change the fact that the CPU RNG didn't work as advertised.
22. Re: does not sound like closure to me by nobodie · 2015-01-18 05:01 · Score: 1
  
  Everyone else? Like all hardware is OSX certified? Try putting any old HDD or SSD into a macbook and see how that works.
  
  --
  Subversion of spatial scale luxury decoration ideas.
In other words.. by Anonymous Coward · 2015-01-09 14:16 · Score: 2, Funny

Closed NOTABUG?
"Blame the hardware." by Anonymous Coward · 2015-01-09 14:17 · Score: 0

We all know that "blame the hardware" is only correct one out of every million or so times.
Editors, edit! by msauve · 2015-01-09 14:25 · Score: 2

"has made some machines running Linux to freeze... but him and John Stultz continue to back and forth"

Really?

--
"National Security is the chief cause of national insecurity." - Celine's First Law
1. Re:Editors, edit! by Anonymous Coward · 2015-01-09 14:30 · Score: 0
  
  Also, it's not "closure" until the root cause is actually identified and patch. They probably meant "closing in on the linux lockup bug."
2. Re:Editors, edit! by SeaFox · 2015-01-09 14:32 · Score: 2
  
  The second sentence isn't much better:
  
  Right down to his final week at Red Hat before Dave gave all his hardware back, Linus Torvalds managed to reproduce similar symptoms, by scribbling directly to the HPET timer.
  Was Linus at Dave's place working on the issue? Is the first part a sentence fragment and Dave did something before he gave his hardware back we aren't being told? Or is the first part really a continuation of the first sentence, and Dave was working on his writeup all the way until the deadline for returning his hardware?
3. Re:Editors, edit! by Anonymous Coward · 2015-01-10 01:34 · Score: 0
  
  No, he reproduced by manually messing up HPET, in Dace's Red Hat machine the same was happening spontaneously.
4. Re:Editors, edit! by Anonymous Coward · 2015-01-10 05:40 · Score: 0
  
  "has made some machines running Linux to freeze... but him and John Stultz continue to back and forth"
  Really?
  Obviously not written by a native English-speaker. How many foreign languages do you speak?
him? by Anonymous Coward · 2015-01-09 14:31 · Score: 1

him and John Stultz

Hey youse editors, you want I should take the mug out?
hardening is NOT blaming the hardware by dltaylor · 2015-01-09 14:33 · Score: 4, Interesting

Too many clueless comments already that don't understand the difference between "blaming the hardware" and hardening to deal with demonstrably-broken hardware (and/or firmware for devices). I've spent years writing drivers for various OS', including Windows and Linux. It is rare for any complex device to be bug-free at the hardware level (look how many patches are BIOS-applied to CPUs, for example). Sometimes, under NDA, of course, the Windows driver writers are apprised of the deficiencies, or, at least, get better response from the vendor when an anomaly appears. Linux rarely gets that same assistance.
My favorite example, though, is all-IBM. We were porting AIX to the PS/2s and 370s. We consistently had problems with the diskette interface under AIX and the response from Boca Raton was always "it works in MS-DOS, so it's your code, not our hardware". When OS-2 came around, they ran into exactly the same problem in the hardware. By then, we had a work-around (slower, more locks, but no more glitches) which was how OS-2 got around it, as well.
1. Re:hardening is NOT blaming the hardware by thegarbz · 2015-01-09 14:36 · Score: 1
  
  Too many clueless comments already
  Not bad given you were the ~4th poster and 2 of them didn't mention the hardware.
2. Re:hardening is NOT blaming the hardware by Anonymous Coward · 2015-01-09 14:54 · Score: 0
  
  Too many clueless comments already that don't understand the difference between "blaming the hardware" and hardening to deal with demonstrably-broken hardware (and/or firmware for devices). I've spent years writing drivers for various OS', including Windows and Linux. It is rare for any complex device to be bug-free at the hardware level (look how many patches are BIOS-applied to CPUs, for example). Sometimes, under NDA, of course, the Windows driver writers are apprised of the deficiencies, or, at least, get better response from the vendor when an anomaly appears. Linux rarely gets that same assistance.
  My favorite example, though, is all-IBM. We were porting AIX to the PS/2s and 370s. We consistently had problems with the diskette interface under AIX and the response from Boca Raton was always "it works in MS-DOS, so it's your code, not our hardware". When OS-2 came around, they ran into exactly the same problem in the hardware. By then, we had a work-around (slower, more locks, but no more glitches) which was how OS-2 got around it, as well.
  Was it a DMA problem?
3. Re:hardening is NOT blaming the hardware by kad77 · 2015-01-09 15:19 · Score: 3, Funny
  
  What you posted about his being the 4th post struck me as wrong, given how far it was down the page. I'm bored, so I took a moment to look at how many posts have an earlier timestamp than the one you are slamming (at least 8), and 2 make dismissive statements about hardware, including the first comment of article at 8:12, and another at 8:19 seemingly dismissing hardware as a possibility.
  So your snide comment is not based in fact. It's like you are reading a different page. Maybe you need glasses. An attitude adjustment, for sure.
4. Re:hardening is NOT blaming the hardware by Dog-Cow · 2015-01-10 17:11 · Score: 1
  
  The other posts were, in fact, made later, but someone was messing around with the HPET timer and, well, bugs.
"friend" and "foe", but no "neckbeard" by raymorris · 2015-01-09 14:39 · Score: 0

I wish Slashdot would allow me to mark users not just as "friend" or "foe", but as "neckbeard". :) That must have been 1986 or 1987?
1. Re:"friend" and "foe", but no "neckbeard" by dltaylor · 2015-01-09 15:13 · Score: 2
  
  0: I do shave my neck. :) In fact, the beard has been gone for more than a year.
  1: a bit later, early 1990; we all got a big laugh out of the 486SX/487 when those came out. https://en.wikipedia.org/wiki/Intel_80486SX
2. Re:"friend" and "foe", but no "neckbeard" by Anonymous Coward · 2015-01-09 15:14 · Score: 2, Funny
  
  AC here, no longer posting as myself since I've long lost my SO account, can't be bothered to find the password for the ancient yahoo email address, and after working on the inside in finance will probably never post an opinion (as my own) again. (Yes, that was a run on sentence.)
  If 1986 qualifies as a "neckbeard" you missed the mark, unless he's a Berkley neckbeard. The 80's were a magical time when power ties, very bad print shirts, and driving your overpriced car with women and blow was available to any person who could reasonably crank out C or Basic.
  Just saying...
3. Re:"friend" and "foe", but no "neckbeard" by Anonymous Coward · 2015-01-09 15:34 · Score: 0
  
  Are you implying that hookers and blow ever went out of fashion?
4. Re:"friend" and "foe", but no "neckbeard" by sound+vision · 2015-01-09 22:01 · Score: 1
  
  No, I think he's implying that coding has gone out of fashion (or at least no longer guarantees a high-paying job.)
5. Re:"friend" and "foe", but no "neckbeard" by Lunix+Nutcase · 2015-01-10 02:09 · Score: 1
  
  No, I think he's implying that coding has gone out of fashion (or at least no longer guarantees a high-paying job.)
  Coding going out if fashion? Have you been living in a cave these last few years?
6. Re:"friend" and "foe", but no "neckbeard" by Lehk228 · 2015-01-10 15:25 · Score: 1
  
  marking users a "neckbeard" on slashdot has been available since the beginning. all you need to do is check if the user has an account on slashdot, if so, neckbeard is present.
  
  --
  Snowden and Manning are heroes.
In the mean time... by Anonymous Coward · 2015-01-09 14:49 · Score: 1

Windows still BSOD's and always will.
"closure" by Anonymous Coward · 2015-01-09 14:50 · Score: 1

About as much as this year being the year of the linux desktop... no really, it's gonna be THIS year... promise.
"him and John Stultz continue ..." by seyyah · 2015-01-09 15:00 · Score: 2

"... him and John Stultz continue to back and forth ..."
What in the world is happening, editors?
1. Re:"him and John Stultz continue ..." by Rick+Zeman · 2015-01-09 15:08 · Score: 1
  
  "... him and John Stultz continue to back and forth ..."
  What in the world is happening, editors?
  The only editors on slashdot are some vi's, some pines, and a couple of notepads and textedit. Certainly, no human editors....
2. Re:"him and John Stultz continue ..." by Anonymous Coward · 2015-01-09 15:17 · Score: 1
  
  They have obviously outsourced the editing to India.
3. Re:"him and John Stultz continue ..." by Anonymous Coward · 2015-01-09 15:19 · Score: 0
  
  "... him and John Stultz continue to back and forth ..."
  What in the world is happening, editors?
  They have obviously outsource the editing to India.
4. Re:"him and John Stultz continue ..." by dwye · 2015-01-10 03:11 · Score: 1
  
  They have obviously outsourced the editing to India.
  Or New Jersey
Use my OS instead, Anonymous Coward by Anonymous Coward · 2015-01-09 15:06 · Score: 0

Written by me, AC, this OS has no viruses because no one bothers to make viruses for it. It doesn't lock up either like those bloated OS's; Windows, Mac OS, and Linux.
Use the OS with the name you can trust. Use Anonymous Coward.
Folds in space time continuum by Anonymous Coward · 2015-01-09 15:07 · Score: 1

Obviously, it's folds in the space time continuum that is causing HPET (the high precision hardware timer) to jump backwards, causing negative deltas and lockups.
Perhaps a future version of ourselves has transcended space-time and is trying to contact us to help us with our bad harvests? Did Linus try to determine any kind of co-ordinates from the glitch?
Has NASA seen any kind of weird portholes near Jupiter?
1. Re:Folds in space time continuum by Anonymous Coward · 2015-01-09 16:24 · Score: 0, Offtopic
  
  Has NASA seen any kind of weird portholes near Jupiter?
  No, but I heard they found one near Uranus.
2. Re:Folds in space time continuum by thephydes · 2015-01-09 17:46 · Score: 1
  
  To understand that joke you need to be aware that in some place Uranus is pronounced your-anus (here in oz for example). The old 9th grade joke - " Mr R, can you see uranus with a telescope?" "yes if you use a mirror lens" ....
3. Re:Folds in space time continuum by Anonymous Coward · 2015-01-09 22:14 · Score: 0
  
  There are only two ways to pronounce Uranus, and neither one sounds good. It's either your-anus or urine-us.
4. Re:Folds in space time continuum by Teun · 2015-01-10 03:46 · Score: 1
  
  You should for once get out of your English-centric world and use the languages of the people who named the planet.
  
  --
  "The likes of Facebook and WhatsApp are free to those whose privacy is of zero value."
plus don't crash on bad hardware. Hotplugged CPU by raymorris · 2015-01-09 15:39 · Score: 2

>. Many teams would have written it off as a hardware bug a long time ago, but the linux kernel team was willing to consider and investigate the possibility that it was a rarely triggered bug in the software before they passed the buck.
And try to avoid crashing due to hardware bugs, if possible.
A contractor once hotplugged one of the CPUs in one of my servers. That's right, they took the processor out and replaced it with the machine running. The box did not crash. It kept running at least for the few minutes it took me to find out what they did and reboot the machine properly. Hardware error doesn't HAVE to mean a crash, though you can't guarantee that it never will.
Of course if you're holding it wrong, that'll always cause problems, because the special rectangle shape needs to be held at the proper aesthetic angle*. ;)
* I use and enjoy Mac pros, which are nice Unix systems. iOS mobile devices - not so much.
Sometimes it
Call me crazy by Nyall · 2015-01-09 15:55 · Score: 4, Interesting

Sorry if I've found the wrong stuff. I'm doing this via a quick googling...
Is this really the code for reading and writing the HPET?
http://www.cs.fsu.edu/~baker/d...
I've been a powerpc programmer in aviation for a while. If you need to read the time base register (also a 64 bit up counter) you have to be aware that your read might coincide with the lower 32 bits incrementing and carrying into the upper 32 bits. So you read the upper 32 bits, read the lower 32 bits, then re-read the upper bits and make sure the upper bits didn't change. If they did repeat this process. But if they are the same then you combine the 32 bit halves into a 64 bit time and call it good.

--
http://en.wikipedia.org/wiki/Jury_nullification
1. Re: Call me crazy by Anonymous Coward · 2015-01-09 16:33 · Score: 0
  
  I have no knowledge of this particular hardware timer but some hardware that I have dealt with latches the counter registers until both halves are read/written. Obviously the timer keeps running on reads, it just latches the output.
2. Re:Call me crazy by myforwik · 2015-01-09 16:34 · Score: 1
  
  And what does writel do?
3. Re:Call me crazy by Anonymous Coward · 2015-01-09 16:40 · Score: 1
  
  Is this really the code for reading and writing the HPET?
  Yup.
  
  I've been a powerpc programmer in aviation for a while. If you need to read the time base register (also a 64 bit up counter) you have to be aware that your read might coincide with the lower 32 bits incrementing and carrying into the upper 32 bits. So you read the upper 32 bits, read the lower 32 bits, then re-read the upper bits and make sure the upper bits didn't change. If they did repeat this process. But if they are the same then you combine the 32 bit halves into a 64 bit time and call it good.
  That would be entirely wrong here.
  The upper 32 bits of the current timer value are latched into the register at the upper address when the lower 32 bits are read from the lower address.
4. Re:Call me crazy by Anonymous Coward · 2015-01-09 16:53 · Score: 0
  
  http://www.cs.fsu.edu/~baker/devices/lxr/http/source/linux/arch/um/include/asm/io.h#L44
  But I Guess clicking on the link was too much work?
5. Re:Call me crazy by Nyall · 2015-01-09 17:19 · Score: 1
  
  OK then. Where in this return statement are the lower 32 bits read first? I don't believe the bitwise or operator is a sequence point. (The logical one is)
  return readl(addr) | (((unsigned long long)readl(addr + 4)) http://www.intel.com/hardwared...
  but I did find the following, which documents the race condition I explained above.
  http://www.intel.com/content/d...
  I will search for newer documentation than a 1.0a.
  
  --
  http://en.wikipedia.org/wiki/Jury_nullification
6. Re: Call me crazy by Anonymous Coward · 2015-01-09 17:20 · Score: 0
  
  Ahh, OK if I brag a bit here, as long as I'm an AC?
  I worked at a computer manufacturer in the 70s where the hardware guys developed a 2 register timer that latched when read like that. The problem was, if you only read the upper register, the thing wouldn't unlatch! It was me, a software guy, who came up with the fix, a hardware timer that would unlatch the thing after awhile. The only software that should have been reading those two registers was DMA, and it was guaranteed to read both before the timer that unlatched.
7. Re:Call me crazy by Anonymous Coward · 2015-01-09 17:28 · Score: 0
  
  OK then. Where in this return statement are the lower 32 bits read first? I don't believe the bitwise or operator is a sequence point.
  It's not read first. It's up to the compiler. If you want to read it first, you have to stick it on a different statement line.
8. Re:Call me crazy by WinstonWolfIT · 2015-01-09 17:31 · Score: 1
  
  Might want to check your first link.
9. Re:Call me crazy by Nyall · 2015-01-09 17:39 · Score: 1
  
  Sorry for the bad post. Yes, the first link does not work, but it is what is documented in hpet.c as the reference. A sentence went missing somewhere saying that I couldn't find it. The second link, which does work, is what I've found so far. I have yet to find something newer which documents the latching behavior that was claimed.
  Sorry again for the bad post.
  -Nyall
  
  --
  http://en.wikipedia.org/wiki/Jury_nullification
10. Re: Call me crazy by Anonymous Coward · 2015-01-09 19:06 · Score: 0
  
  I had a bug that took me a while to find with an asynchronous timer. The timer had some logic to synchronize the register reads with the uP clock. However the interrupt from the timer was a straight shot to the interrupt controller. Occasionally the ISR would beat the synchronization logic resulting a stale read of the timer count. The lower bits would be read as 0XFFFF instead of 0x0000 so you'd read
  0x1234FFFF instead of the correct 0x12340000
  Also had one where the synchronization logic was plain busted and that didn't show up on the errata for another 3 months.
  Word of warning the I2C controllers on Freescale ARM processors are jank.
11. Re:Call me crazy by Anonymous Coward · 2015-01-09 19:18 · Score: 0
  
  That would be entirely wrong here.
  The upper 32 bits of the current timer value are latched into the register at the upper address when the lower 32 bits are read from the lower address.
  Classic is when you do that, but with the interrupts on and another thread comes in and also reads the timer counter in the middle.
  Had another IC, a radio where the power on reset wouldn't work reliably on about 2% of the chips. The power on reset control circuit had an 8 bit counter, which if it powered up as 00 would cause the reset to fail to assert. Fixing that caused a two month delay and required adding a 0.20 cent mosfet to allow the system to power cycle the radio IC. X five million units.
12. Re: Call me crazy by Anonymous Coward · 2015-01-09 19:37 · Score: 0
  
  Or as a software guy who now writes hardware who has implemented the same thing.
  You simply latch the value on the first register read, nothing else.
  If you never read the second register, you can start over by reading the first register.
13. Re:Call me crazy by DamnOregonian · 2015-01-09 19:50 · Score: 2
  
  That code doesn't suffer from the problem you think it does.
  
  readq is only defined in that code if undefined elsewhere, and is only used to read counters on 64-bit architectures.
  
  on 32-bit architectures, that code uses readl to read the counter.
  
  readq is undefined in some 32-bit architectures, so is defined there- but only used there to read the configuration register (not likely to roll over ;)
  
  Also, the actual reading of the counter is done indirectly: it's returned from the IRQ handler for the HPET. the direct reading is only done during calibration.
14. Re:Call me crazy by hendric · 2015-01-09 20:22 · Score: 1
  
  http://www.cs.fsu.edu/~baker/devices/lxr/http/source/linux/arch/x86/include/asm/io.h#L49
  Line 49 looks like where readq is defined for x64 architecture.
  
  --
  "Though it may take a thousand years, we shall be FREE."
15. Re:Call me crazy by _merlin · 2015-01-09 20:26 · Score: 1
  
  The upper 32 bits of the current timer value are latched into the register at the upper address when the lower 32 bits are read from the lower address.
  Well in that case, you'd need to ensure the lower 32 bits are read first so you're reading the upper 32 bits that you latched this time through, not last time through. And if that's the case, the code is still wrong because there's nothing to force a sequence point between the two reads. The compiler is free to re-order the two reads in that expression.
16. Re: Call me crazy by Anonymous Coward · 2015-01-10 03:50 · Score: 0
  
  I2C controllers on Freescale ColdFire processors are jank as well. I guess they did a copy-and-paste... Work around is simple, dedicate a pin to toggle the clock if the I2C controller locks up. You'd think that Freescale would know better. Oh, right, there's a work-around so the hardware doesn't need to be fixed.
meant in the best possible way. Gray beard. by raymorris · 2015-01-09 16:26 · Score: 1

PS I meant that in the best possible way. I didn't really think through the connotations of "neck beard" before posting. I was really thinking more "gray beard" , including wizardly connotations.
Re:plus don't crash on bad hardware. Hotplugged CP by sjames · 2015-01-09 16:43 · Score: 1

Hot swapping the CPU without an immediate crash had to be a million to one shot!
But yes, resilient software is always a good thing.
I do hope Linus's patch goes in in some form to at least make it clear what the problem is if someone with similarly borked hardware sees the problem.
Re:plus don't crash on bad hardware. Hotplugged CP by TechyImmigrant · 2015-01-09 17:19 · Score: 3, Informative

>Hot swapping the CPU without an immediate crash had to be a million to one shot!
With QPI interconnect and the voltage and temp supervisory circuits on chip, it's not such a long shot these days, especially on Xeons with failover support that is explicitly intended to cope with a neighbor CPU going down.

--
I should use this sig to advertise my book ISBN-13 : 978-1501515132.
Re:plus don't crash on bad hardware. Hotplugged CP by pasamio · 2015-01-09 18:20 · Score: 2

Yes it's great to support hotplugged CPUs! 1969 called and they want to let you know they supported online reconfiguration back then too: http://en.wikipedia.org/wiki/M...

--
I always wondered where this setting was...
No it doesn't by johncandale · 2015-01-09 18:33 · Score: 1

No it doesn't. Maybe you should upgrade past XP already and use a windows made in this century
1. Re:No it doesn't by Anonymous Coward · 2015-01-09 20:50 · Score: 0
  
  In Windows 8.1 it just says something went wrong, gonna reboot.
2. Re:No it doesn't by fnj · 2015-01-09 22:19 · Score: 1
  
  Whether or not you see a blue screen with a lot of text on it is beside the point. Every OS can potentially panic. Even if it's configured to paper over the problem by doing it quietly and rebooting, the system has gone tits up.
3. Re: No it doesn't by Anonymous Coward · 2015-01-09 23:07 · Score: 0
  
  The frustrating part of some lockups is there is nothing about it in the logs. How much would it cost to have a computer which could leave a trace of the cause of a lockup, even if the machine exploded? Surely modern tech is much faster than a shockwave front. Fast enough to run on a small capacitor the fraction of a second it takes to report the malfunction before dying.
4. Re:No it doesn't by Anonymous Coward · 2015-01-09 23:43 · Score: 0
  
  Windows 7 still has lockups. Can't remember ever seeing a bluescreen, everything just freezes and you have to do a hard reset.
5. Re: No it doesn't by drinkypoo · 2015-01-09 23:53 · Score: 1
  
  How much would it cost to have a computer which could leave a trace of the cause of a lockup, even if the machine exploded?
  You would have to have double your main memory, basically. Not really that expensive.
  
  --
  "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
6. Re:No it doesn't by Anonymous Coward · 2015-01-10 00:04 · Score: 0
  
  My Win7 machine at work that runs SolidWorks and not much else pulls a BSOD now and then.
7. Re:No it doesn't by Anonymous Coward · 2015-01-10 01:24 · Score: 0
  
  There is no Windows made in this century. It's a few minor patches on the old piece of shit.. Look at the dialog for changing an environment variable.
8. Re:No it doesn't by Osgeld · 2015-01-10 06:35 · Score: 1
  
  hell I cant recall the last time I saw XP BSOD
9. Re:No it doesn't by Anonymous Coward · 2015-01-10 06:37 · Score: 0
  
  and linux isnt a gigantic patch hack orgy streaming along since 1993, made upon a unix patch hack orgy stemming from 1963
10. Re: No it doesn't by corychristison · 2015-01-10 07:31 · Score: 1
  
  The problem is that when the kernel panics, everything grinds to a stand still. More specifically: hard drive controller/driver. How are you going to write the data if you don't have access to the disks?
  This is by design, as the disk controller could br the reason for the lockup, and you would potentially corrupt your entire disk by trying to write to it.
  I'm sure its been thought of before, but my first thought is to include a very small chunk of memory on the motherboard, with a stupidly simple api that is designed for dumping kernel panic data into.. where it would stay until, say, 3 reboots or its written over again. I don't design motherboards, so I don't know how feasible this would be... but with Microsofts pull with the manufacturers I'm sure they could make it happen. The problem then, obviously, is it would be locked down to support only Windows, or it would be redesigned across each manufacturer, each one less compatable than the previous.
Re:plus don't crash on bad hardware. Hotplugged CP by cerberusss · 2015-01-09 18:37 · Score: 4, Funny

Sometimes it
Sometimes it -- what? Did someone attempt to hot-swap your CPU again? (-:

--
8 of 13 people found this answer helpful. Did you?
Re:plus don't crash on bad hardware. Hotplugged CP by raymorris · 2015-01-09 19:51 · Score: 1

Sometimes it screws up the post, where "it" is the Android browser.
Re:plus don't crash on bad hardware. Hotplugged CP by raymorris · 2015-01-09 19:59 · Score: 1

That's interesting. Apparently it was supported well enough that they actually did hotplug CPUs regularly, as standard practice. I wonder if they "unmounted" the components before removal and "mounted" them upon insertion. That's a much easier approach, especially for CPUs, than handling a CPU suddenly going AWOL.
Linux CPU hotplug support link by raymorris · 2015-01-09 20:05 · Score: 3, Informative

Replying to myself, but I figured someone reading this might be interested. Linux does support CPU hotplug where you disable the CPU before removing it. Your motherboard might get mad about it if it's not supported by the board, though.
http://www.cyberciti.biz/faq/d...
1. Re:Linux CPU hotplug support link by sjames · 2015-01-10 06:30 · Score: 2
  
  Yes. It's mostly used for reconfiguring VMs, but it is possible to do it with real hardware if the board supports it.
  It's interesting how as time goes on, PC hardware is slowly coming to resemble an affordable version of the mainframes they replaced.
Re:plus don't crash on bad hardware. Hotplugged CP by sjames · 2015-01-09 21:41 · Score: 1

Yes, I can see that would limit the damage, but it still leaves the OS surprised to have running tasks just go away.
It would likely work less well with AMD processors since a chunk of memory would also go away.
American grammar by Anonymous Coward · 2015-01-09 22:17 · Score: 0

"that has made some machines running Linux TO freeze"...
WTF?
The "to" is not needed. But then, you ARE American, aren't you... Idiot.
1. Re: American grammar by Anonymous Coward · 2015-01-09 23:16 · Score: 0
  
  Who cares. It is still understandable.
2. Re:American grammar by Anonymous Coward · 2015-01-10 06:40 · Score: 0
  
  broken half assed forign understanding of english
  machines running Linux freeze, yea that's better, direct out of a tech support call from india, dont you have a bedpan to empty in your front lawn?
Re:plus don't crash on bad hardware. Hotplugged CP by Anonymous Coward · 2015-01-09 23:03 · Score: 0

Depends on the hardware... I believe some of the mainframes had interlock screws/latches that would cause a signal slightly before the CPU could be removed, and that would initiate the "unmount". Tightening the latch would then signal a "mount".
Thus doing it automatically.
Similar techniques as used by disks - having the leads of a different length to signal a pending removal/or just plugged in.
Re:plus don't crash on bad hardware. Hotplugged CP by hitmark · 2015-01-10 00:27 · Score: 1

USB has slightly longer contacts on the power pins for much the same reason.

--
comment first, facts later. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
Re:plus don't crash on bad hardware. Hotplugged CP by hitmark · 2015-01-10 00:29 · Score: 1

Was not one reason why mainframes was so highly valued that one could hotswap virtually anything without interrupting workflow?

--
comment first, facts later. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
Re:plus don't crash on bad hardware. Hotplugged CP by Anonymous Coward · 2015-01-10 03:58 · Score: 0

With QPI interconnect and the voltage and temp supervisory circuits on chip,
Though the voltage regulation on chip is a new thing as of haswell and will go away with sky lake.
Two 32bit reads of the HPET timer is invalid by Anonymous Coward · 2015-01-10 07:10 · Score: 0

The Intel (alpha) documentation is clear on the race issue with reading the 64 bit HPET timer.
http://www.intel.com/content/dam/www/public/us/en/documents/technical-specifications/software-developers-hpet-spec-1-0a.pdf
Section 2.4.7:
"A race condition comes up if a 32-bit CPU reads the 64-bit register using two separate 32-bit reads. An accuracy problem may be arise if just after reading one half, the other half rolls over and changes the first half."
The usage of readq (from io.h) using two 32bit reads for HPET is therefore invalid.
Why don't they mention the trinity testing system? by Anonymous Coward · 2015-01-10 10:47 · Score: 0

Nope, it is not Trinity the desktop that is the driver behind this bug hunt (which is what I thought originally, silly me). It is a separate testing system that randomly calls random syscalls in linux with ... wait for it ... random parameter values.
That the OS run at all when random garbage is spewed to random syscalls in random sequences is astounding. A totally stupid testing protocol as well, but that is not the subject of the discussion. Just that Linux locks up when handling hundreds of threads throwing garbage at it.
And the only other system to show this bug consistently has the same motherboard. There is another suspect but I didn't find the motherboard in question.
One might wonder how any other OS would survive.
Re:plus don't crash on bad hardware. Hotplugged CP by the_B0fh · 2015-01-10 12:24 · Score: 1

Solaris supported hot pluggable CPUs in the last century!
I am really surprised by Anonymous Coward · 2015-01-10 15:55 · Score: 0

That this bug wasn't known as Davey Jones' Lockup...
1. Re:I am really surprised by Hognoxious · 2015-01-11 03:04 · Score: 1
  
  Was it caused by Monkeeing around?
  
  --
  Confucius say, "Find worm in apple - bad. Find half a worm - worse."
Freezes on Mac under Parallels by iMactheKnife · 2015-01-11 07:36 · Score: 1

I had the freeze bug in a VM system on a Mac running Parallels. I downloaded Ubuntu 14.04 from Parallels and could not get around it. Then I downloaded directly from Canonical and it worked just find. I assumed it was a bad download from Parallels, but perhaps it is more subtle. The virtual machine has the same vulnerabilities - is that a clue?
How to Follow this Bug by 4rest · 2015-01-11 08:28 · Score: 1

I am affected by this bug, but can't seem to find any real place to follow it. I searched https://bugzilla.kernel.org/ but that didn't turn up anything. Anyone know where the source of truth for tracking this issue might be located?
Re:plus don't crash on bad hardware. Hotplugged CP by TechyImmigrant · 2015-01-12 07:07 · Score: 1

Yes. Exactly this. Pulling the latches on the card generates an interrupt. In the systems I designed (for a mainframe raid disk system in this case), a little green light would light up when it was ready. So pull the latches out, wait for green light, pull the card out. The light generally lit up in a few milliseconds, so you could just rip the card out.
I presume this is how it worked for all products from this (very large, well known) manufacturer, because that's what the spec required.

--
I should use this sig to advertise my book ISBN-13 : 978-1501515132.