AMD Confirms CPU Bug Found By DragonFly BSD's Matt Dillon
An anonymous reader writes "Matt Dillon of DragonFly BSD just announced that AMD confirmed a CPU bug he found. Matt quotes part of the mail exchange and it looks like 'consecutive back-to-back pops and (near) return instructions can create a condition where the processor incorrectly updates the stack pointer.' The specific manifestations in DragonFly were random segmentation faults under heavy load."
Though, it's still very serious. At least it generally causes your program to crash rather than spitting out a wrong answer. And it sounds like the sequence of instructions that causes it is not commonly found.
I can well understand the guy who found it being all excited. The CPU is the last place you'd look for a bug, and finding one is pretty impressive, especially a really elusive one like this.
Need a Python, C++, Unix, Linux develop
I'm wondering if they will. This seems like a very odd timing issue that may be a problem in the electronics. Of course, I suppose they could just put in some microcode to wait after certain operations to make sure things settle and so avoid the hardware bug.
Need a Python, C++, Unix, Linux develop
What has Taiwan got to do with this ?
I mean, was the CPU bug somehow introduced by TSMC ?
Muchas Gracias, Señor Edward Snowden !
It matters because it's impressive. It also seems fair to associate some of the positive impression with DragonflyBSD, and I cannot see any downside to throwing good PR at any BSD flavor.
I can only imagine the time and effort spent on tracking down this problem - a rare CPU condition is exponentially more difficult to narrow down than most programming mistakes. A lot of progress in IT depends on engineers like this, who obsessively solve problems even when it's much easier to just ignore them, try to hack around them or pass the buck around. Kudos.
A pertinent addition to the submission would be which CPUs have been found to be affected.
The second link says Opteron 6168 and Phenom II X4 820. For a second I thought that bulldozer hasn't managed to do anything right, but these two examples are pre bulldozer.
No doubt this is not an exhaustive list.
Matt Dillon is a rather famous programmer (as programmers go). I assume that's why they mention him by name. I think a very large percentage of old Amiga hackers know who he is. He's also done work on the Linux kernel. Despite all that, he's best known for his work on FreeBSD and on his DragonflyBSD project. While a lot of old timers will know that, not everyone else will.
FWIW:
The failure has been observed on three different machines, all running AMD cpus. A quad opteron 6168 (48 core) box, and two Phenom II x4 820 boxes.
When information is power, privacy is freedom.
I have to worry about stack smashing bugs here... can there be a way for (say) a data pattern in a media file, or carefully crafted javascript or java code that's been JIT-compiled, to break out of its sandbox? What about a hostile OS kernel running inside a VPS container taking over the hypervisor or bare iron? Hmm.
A floating point precision error. Floating points cannot represent quite a diverse collection of numbers, this is especially problematic when you're doing intersections with small objects. Say a ray projected from an object will, because of the minute errors in floating point, collide with the same object (which produces some cool patterns).
Floating points are kind of crappy. Not that I have a better option with viable performance on a desktop machine. That's not a division bug, that's just the nature of representing numbers in binary with a fixed number of bits.
Matt Dillon, desperate after chasing unsuccessfully mary in Something about Mary radically changed jobs and started to study computer science...
Slashdot, fix the reply notifications... You won't get away with it...
Pop two off the stack and ret to the calling routine seems fairly common to me. Lots of functions use two arguments and are called with near calls in various programming languages.
That might have been true on 386s.
But currently we're in 2012 and the most widely used instruction set for Linux on AMD processors is x86_64. Because these 64bit processors feature a big number of registers, the two arguments will be passed as registers, not on the stack. So the sequence of instructions isn't indeed common.
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
You try and find something that "the other guy" had a problem with and bring it up as worse so as to try and "protect" the thing you are a fan about? Because I see nothing about the FDIV bug anywhere but your post.
Oh and you know what that bug applied to, right? The Intel Pentium, the ORIGINAL Pentium. Not the Pentium MMX, not the Pentium Pro, not the Pentium II, not the Pentium III, not the Pentium 4, not the Core, not the Core 2, not the Core i, not the second generation Core i. And yes, that's how many major processor versions from Intel there have been since then (with another to launch in the next couple weeks). The original Pentium chips that had this problem came out almost 2 decades ago, 1993.
So seriously, leave off it. I get tired of any time there is a problem with $Product_X fans of it will point out how $Product_Y had a similar or worse error way back in the day and that somehow changes things.
No it doesn't. The story is about the AMD chips, nobody gives a shit about the FDIV bug and I'll wager there are people reading Slashdot who weren't alive when it happened.
The good news for AMD is that processors can often patch around this shit in microcode these days so a recall may not be needed. Have to see, but the potential is there for a software (so to speak) fix.
512bit calculations aren't that expensive
Yes they are.
"His name was James Damore."
Except I very much doubt that would solve whatever "problems" this guy was having. As a newbie programmer, it's entirely understandable that he wouldn't know about the fun you can (or can't) have with floating point operations. However, I very much doubt that sheer accuracy was the issue, rather he was probably making assumptions such as 1.0 - 1.0 == 0.0, when in reality the result isn't necessarily exactly 0.0. Considering it's an MMO, he probably had something like "Why is this guy not dying, he has 4 HP left and this attack does exactly 4 damage? Must be a bug!". /* die */", you do something like "if (Health is less than 0.0 + epsilon) /* die */", with "epsilon" being a very small number (such as 0.00000001).
Really, it doesn't matter a huge amount, if such "accuracy" is important to your game then instead of doing "if(Health is less than 0.0)
The real fun with floats, however, is that each platform does something different. It's possible that the OP ran the game on Intel hardware and got one result (which may have seemed more "correct"), then ran it on an AMD machine and got a different (seemingly less-correct) result - you can see why he naturally jumped to the conclusion that the AMD system had a bug.
In reality, chances are both systems were "wrong" anyway, they just happen to use different implementations for floating-point logic. To solve this, once again higher rates of calculations aren't the answer, but rather there's a compiler switch (/fp:strict in VS) that will use the ISO standard floating point model. It's not as fast as the other methods, but you will at least game the same results across different platforms (assuming that CPU has implemented the standard correctly which these days is almost certain).
There's LOTS of fantastic info on this here: http://gafferongames.com/networking-for-game-programmers/floating-point-determinism/
+1 IDisagreeSoHeMustBeATrollOrAnAstroturferOrAShill
I'm pretty sure it was with the introduction of the Pentium (which had the famous FDIV bug) that John Carmack officially made the switch to single precision FP for most things because it was finally fast enough. FP wasn't cheap, per se, but the simplification it brings over keeping track of binary points and precision/range tradeoffs in integerized algorithms should not be underestimated either.
For example, if I want to do a floating point multiply and add, I just say: f3 = f0 * f1 + f2. Before I even start writing a fixed-point multiply and add, I need to ask what the Q points (binary points) are for each of the terms, what Q point you'd like for the result, and what sort of rounding (if any) the result requires for stability. You can end up with a monstrosity like this, assuming all four numbers are at the same Q point:
x3 = (int)(((long long)x0 * x1 + (1LL > Q) + x2;
Ok, maybe you hide that behind a macro, but what about cases where some of the terms are at different Q points? A fully general macro (which is no fun to write, BTW) would also have a ton of arguments, and only reduce you to something like x3 = FXMULADD(x0, Q0, x1, Q1, x2, Q2, Q3); which won't win you any awards in the clarity department.
And look at the operations themselves, too. You have type promotion, extra adds and shifts... the instruction sequence itself isn't super efficient. It pays off when floating point takes 10s and 100s of cycles, but is a dubious win when most of the core FP starts coming down into the single digits. With the Pentium's dual pipes and the fact you could keep integer instructions flowing in parallel to the float, that's effectively what happened. And notice we haven't even talked about dynamic range and overflow errors and how they screw you up. If you have to add tests for that... yuck. With floating point, you degrade gracefully if your dynamic range spikes a little higher than you expect.
Anyway, getting back on topic: This isn't the first time an x86 has had a stack-pointer related bug. I remember the 80386s that had the so-called "POPAD bug". That one was a bit easier to hit.
Hopefully, AMD will be able to publish a microcode update or something to work around theirs. That's one thing modern x86s have over their predecessors: A good number of CPU bugs can be patched around with microcode updates. I believe Intel added that with the Pentium Pro, and AMD followed suit. I believe my Phenom is one of the affected parts. I guess I'll have to keep an eye out for such a patch.
Program Intellivision!
AMD has indicated to me that the Bulldozer is not effected, which is a relief.
I guess I should have realized this would get slashdotted. In anycase, it took quite a bit of effort to track the bug down. It was very difficult to reproduce reliably. It isn't a show stopper in that it really takes a lot of work to get it to happen and most people will never see it, but it's certainly a significant bug owing to the fact that it can be reproduced with normal instruction sequences.
I began to suspect it might be a cpu bug last year and after exhaustive testing I posted my suspicions in December:
http://leaf.dragonflybsd.org/mailarchive/kernel/2011-12/msg00025.html
Older versions of GCC were more prone to generate the sequence of POP's + RET, coupled with a deep recursion and other stack state, that could result in the bug. It just so happened that DragonFly's buildworld hit the right combination inside gcc, and even then the bug only occurred sometimes and only one a small subset of .c files being compiled (like maybe 2-3 files). The bug never manifested anywhere else, doing anything else, running any other application. Ever.
In particular the bug disappeared with later versions of GCC and disppeared when I messed with the optimizations. We use -O by default, not -O2. The bug disappeared when I produced code with gcc -O2 (using 4.4.7).
It is really unlikely that Linux is effected... the sensitivity to particular code sequences laid out in the compiler is so fine that adding a single instruction virtually anywhere could make the bug disappear. Even just shifting the stack pointer a little bit would make it disappear.
In anycase, for a programmer like me being able to find an honest-to-god cpu bug in a modern cpu is very cool :-)
-Matt
What's really amusing is that I've been on the scene for so long if you google my name 'Matthew Dillon', the first entry is actually... me! And not the actor(s). I'm sure that grinds a bit but I do bask in the occasional fan mail reaching my inbox, just before I hit the 'delete' key.
In recent years its started to flip back and forth, and I expect Hollywood will again take over the top spot after things die down again :-)
-Matt
Intel has had quite a few serious chip bugs too, all in errata. A number of new cpu bugs in both AMD and Intel chips always appears in new generations, but both companies have very large test suites and the number of new bugs goes down in every generation.
Don't forget that Intel had to recall a sandybridge chipset early in the sandybridge cycle, which cost them something like a billion dollars because the related motherboards had to be thrown away and replaced. That was due to internal on-chip circuitry related to a SATA port burning out.
Right at this moment AMD has two issues facing it in order to compete on workstations: (1) Power and (2) Performance. Their initial bulldozer release clearly depends too much on compiler optimizations to make full use of the architecture. They will clearly have to bulk-up some of the simplifications they made that made their cpu cores a little too sensitive to instruction sequences generated by compilers and I hope their next few releases will do better.
On power consumption it comes down to the Fab as much as anything else. Their dependence on the Fab is clearly a problem and they've made a break for it to try to solve it, even though it is costing them dearly. At the same time Intel has made some major advances in their three fabs, to the point where Intel can do their entire production on just two of those three fabs now but they decided to keep the third fab because they think they can 'grow into' it.
So AMD definitely has some work ahead of it, and I am hoping they reserve some of their focus for the high-end and don't concentrate entirely on laptops. I always like to say that I love AMD, but in the stock market I invest in Intel. That's just business. But I got on the AMD bandwagon big-time when they got to 64-bit first and I stuck with them all the way through the Phenom II.
Now, at this moment, Intel's SandyBridge has the best value and AMDs bulldozer is quite far behind, so new purchases for me right now are Intel. That may change in the next year or two and when it does my new purchases will happily be in the AMD camp again. Frankly, AMD only has to get within shouting distance (~8%) of Intel and I will happily use AMD. AMD doesn't have to beat Intel.
I think there are a number of things AMD can do right now to compete better with Intel. One of the biggest is in the mini-server department (albeit clearly with lower volumes than their current focus on laptops & integrated graphics). AMD consumer cpus (aka Phenom II) always had ECC support but very few motherboards actually supported it, which made it difficult to use AMD for mini-servers and avoid the Intel Xeon tax to get ECC. If AMD worked on the mobo vendors to ALWAYS support an ECC option that would allow them to compete against Intel Xeons on price, even if they are unable to compete on performance.
On the opterons AMD clearly has the right idea going with high-core-count cpus, but the memory subsystem is lagging too much to really be able to make use of all those cores. That seems to be low-hanging fruit to me, something which should be readily addressable by AMD. The opterons still have a lot of value and potentially can have a radical improvement in value with Bulldozer, but only if AMD can push the core count and improve the memory subsystem.
On large multi-core boxes AMD also needs to improve CMPXCHG and other atomic instructions in situations where contention is high. Right now multi-chip opteron systems seriously lag Intel on contended latency due to cache coherency inefficiencies. Will Bulldozer fix those latency issues? I don't know.
AMD only needs to get within shouting distance of Intel for me to buy their chips, and work their mobo producers a bit more to get better overall support for their chip's capabilities. They don't have to beat Intel.
-Matt