AMD Confirms CPU Bug Found By DragonFly BSD's Matt Dillon
An anonymous reader writes "Matt Dillon of DragonFly BSD just announced that AMD confirmed a CPU bug he found. Matt quotes part of the mail exchange and it looks like 'consecutive back-to-back pops and (near) return instructions can create a condition where the processor incorrectly updates the stack pointer.' The specific manifestations in DragonFly were random segmentation faults under heavy load."
Though, it's still very serious. At least it generally causes your program to crash rather than spitting out a wrong answer. And it sounds like the sequence of instructions that causes it is not commonly found.
I can well understand the guy who found it being all excited. The CPU is the last place you'd look for a bug, and finding one is pretty impressive, especially a really elusive one like this.
Need a Python, C++, Unix, Linux develop
I wonder if AMD likes apples.
I assume they'll be able to fix it via a microcode patch. Intel had to learn that the hard way...
Large print giveth, and the small print taketh away
This is cool, but...?
Why does it matter that it's the lead developer of DragonflyBSD?
Find free books.
Nice work tracking that one down. It must have been very frustrating - what we used to call a "ring-tailed b1tch"
Breathe continuously
What has Taiwan got to do with this ?
I mean, was the CPU bug somehow introduced by TSMC ?
Muchas Gracias, Señor Edward Snowden !
After reading the links and knowing how I have time trying to find out why something doesn't work right I think I understand why he is so stoked at finding the root of the problem. Good for Matt, maybe they will send you a fixed processor someday.
You were not hitting the division bug, it happens only with certain combinations of numbers and is quite rare.
You probably just had the rounding mode set wrong.
I can only imagine the time and effort spent on tracking down this problem - a rare CPU condition is exponentially more difficult to narrow down than most programming mistakes. A lot of progress in IT depends on engineers like this, who obsessively solve problems even when it's much easier to just ignore them, try to hack around them or pass the buck around. Kudos.
A pertinent addition to the submission would be which CPUs have been found to be affected.
The second link says Opteron 6168 and Phenom II X4 820. For a second I thought that bulldozer hasn't managed to do anything right, but these two examples are pre bulldozer.
No doubt this is not an exhaustive list.
Windows? Does this mean that those users and devs aren't so important as far as total CPU load?
Sent as ripples into the electromagnetic field. No single photon has been harmed in the process.
Anyone know if that "Test case" image is available? I'd like to check if my Phenom II x6 is affected.
Ohh, I'm sure AMD will want you to believe that :-)
-- no sig today
FWIW:
The failure has been observed on three different machines, all running AMD cpus. A quad opteron 6168 (48 core) box, and two Phenom II x4 820 boxes.
When information is power, privacy is freedom.
I have to worry about stack smashing bugs here... can there be a way for (say) a data pattern in a media file, or carefully crafted javascript or java code that's been JIT-compiled, to break out of its sandbox? What about a hostile OS kernel running inside a VPS container taking over the hypervisor or bare iron? Hmm.
I remember division not being exactly accurate, so the solution I needed to use was to round up and down results that are really close.
Just so you know, division is never accurate in floats, even when the CPU doesn't have bugs. If you're using doubles you'll get better accuracy, but with a 32 bit floating point number, you shouldn't be surprised to find errors in the third digit after the decimal point.
Just to be clear its not limited to division. Hell, errors can creep in just by converting a decimal number to floating point. This is why calculators use decimal arithmetic, well some of them - like Perpenso Calc for iPhone iPad RPN Scientific Stats Business Hex. Try "0.5 - 0.4 - 0.1" in your favorite calculator app, it might indicate whether the app is using the FPU or decimal arithmetic. Of course the app may be doing something naive like the "BASIC MMORG", rounding results. Its naive because it is another source of rounding error, some results are legitimately a little bit off from a nicely round number.
Matt Dillon, desperate after chasing unsuccessfully mary in Something about Mary radically changed jobs and started to study computer science...
Slashdot, fix the reply notifications... You won't get away with it...
Pop two off the stack and ret to the calling routine seems fairly common to me. Lots of functions use two arguments and are called with near calls in various programming languages.
That might have been true on 386s.
But currently we're in 2012 and the most widely used instruction set for Linux on AMD processors is x86_64. Because these 64bit processors feature a big number of registers, the two arguments will be passed as registers, not on the stack. So the sequence of instructions isn't indeed common.
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
Linux in 64bit mode use register to pass arguments to functions.
So the most common sequence at the end of a function isn't "bunch of pops then a (near) return", but "move the results into target registers and the return".
Thus the bug sequence doesn't happen that often.
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
In a way this bug reminds me of a problem I found on a certain software-hardware combination involving a ten gigabit Ethernet card. We were seeing mysteriously corrupted SSL connections when there was no reason them to be corrupted. This condition occurred roughly every couple dozen gigabytes of transferred data on a connection that both MSS and very short data segments. After couple weeks of debugging, I was able to conclude that the TCP receive offload engine on the card occassionally injected Ethernet padding as data into the TCP stream that SSL library eventually processed as its' input! Neither Ethernet nor TCP checksums saved from this extra data, since it was conceptually generated on a higher layer; SSL caught it, though. Only God knows how many connections this semi-hardware (but at least software-fixable) bug had corrupted in a less obvious manner elsewhere before it was found out...
You try and find something that "the other guy" had a problem with and bring it up as worse so as to try and "protect" the thing you are a fan about? Because I see nothing about the FDIV bug anywhere but your post.
Oh and you know what that bug applied to, right? The Intel Pentium, the ORIGINAL Pentium. Not the Pentium MMX, not the Pentium Pro, not the Pentium II, not the Pentium III, not the Pentium 4, not the Core, not the Core 2, not the Core i, not the second generation Core i. And yes, that's how many major processor versions from Intel there have been since then (with another to launch in the next couple weeks). The original Pentium chips that had this problem came out almost 2 decades ago, 1993.
So seriously, leave off it. I get tired of any time there is a problem with $Product_X fans of it will point out how $Product_Y had a similar or worse error way back in the day and that somehow changes things.
No it doesn't. The story is about the AMD chips, nobody gives a shit about the FDIV bug and I'll wager there are people reading Slashdot who weren't alive when it happened.
The good news for AMD is that processors can often patch around this shit in microcode these days so a recall may not be needed. Have to see, but the potential is there for a software (so to speak) fix.
Presumably AMD will announce affected CPUs fairly soon, after they get done testing. This isn't the kind of thing they would be able to sit on, even if they wanted to. If your CPU has been working for you in general it isn't like it is going to suddenly go and beat up your cat or something, it'll be fine for a bit longer while AMD figures out which ones are all affected and figures out how to fix it.
As I noted in another post, depending on it may be possible to fix it via microcode. CPUs aren't "pure" hardware these days. They have a bit of software that tells them how to do things and on some of them (Intel CPUs I know for sure) it is field upgradable. So they may find a way to patch out the bug.
Just keep an eye on their page, maybe send them an e-mail saying you'd like a notice when they know. Should be soon I'd imagine.
for a guy like me, this is pretty much an honest-to-god bughunt ;)
...the CPUs suck :-)
Actually how (Free)BSD made use of available hardware resources back then in the 90s was the big reason for me to use it instead of Linux.
I'm pretty sure that TSMC doesn't even fab AMD CPUs. They did GPUs for ATI, and likely still handle some portion of the discrete GPU parts under AMD; but I believe that AMD CPU production is all still in their formerly-in-house-now-spun-off fabs...
Since adding up terms of half the power of the previous number goes up only to 2, 4 cannot be represented.
You have to increase the exponent.
Can we say "WOOOOSH!"
paintball
I'm very interested on this, since the original posting by Matt Dillon hints at the bug being present in all Opteron and Phenom models. The bug seems hard enough to replicate, but still, corrupting the stack is no minor detail.
Then why has it only been seen on DragonFly BSD?
My opinion is that the sequence that leads to the bug is rare. DragonFly BSD happen to have it somewhere (due to sheer luck, because otherwise the sequence is uncommon) and under some stress condition, the bug got triggered.
Under the same stress condition Linux didn't barf, simply because the sequence is so rare that the Linux kernel didn't have it.
Can someone do some screening of the binary image of a few kernels and see how common is the pop-pop-nearret sequence ? :-P )
(Lazy and don't have the material to do it here now
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
AMD has indicated to me that the Bulldozer is not effected, which is a relief.
I guess I should have realized this would get slashdotted. In anycase, it took quite a bit of effort to track the bug down. It was very difficult to reproduce reliably. It isn't a show stopper in that it really takes a lot of work to get it to happen and most people will never see it, but it's certainly a significant bug owing to the fact that it can be reproduced with normal instruction sequences.
I began to suspect it might be a cpu bug last year and after exhaustive testing I posted my suspicions in December:
http://leaf.dragonflybsd.org/mailarchive/kernel/2011-12/msg00025.html
Older versions of GCC were more prone to generate the sequence of POP's + RET, coupled with a deep recursion and other stack state, that could result in the bug. It just so happened that DragonFly's buildworld hit the right combination inside gcc, and even then the bug only occurred sometimes and only one a small subset of .c files being compiled (like maybe 2-3 files). The bug never manifested anywhere else, doing anything else, running any other application. Ever.
In particular the bug disappeared with later versions of GCC and disppeared when I messed with the optimizations. We use -O by default, not -O2. The bug disappeared when I produced code with gcc -O2 (using 4.4.7).
It is really unlikely that Linux is effected... the sensitivity to particular code sequences laid out in the compiler is so fine that adding a single instruction virtually anywhere could make the bug disappear. Even just shifting the stack pointer a little bit would make it disappear.
In anycase, for a programmer like me being able to find an honest-to-god cpu bug in a modern cpu is very cool :-)
-Matt
"there is never any need for self modifying code"
I'm sorry to burst your bubble, but the linux kernel uses self-modifying code to provide the ability to build a kernel that will boot on many machines but can still be tweaked at boot to run faster depending on what the cpu supports.
Actually I believe some of the APUs such as the Brazos platform is done by TSMC. Yes according to this link they inked a deal with TSMC to use their SOI process for Fusion APUs and since i can't find any links saying this deal was later canceled i have to assume that is what they did. Makes sense as TSMC has the experience with ATI GPUs and the whole point of Fusion is wedding the ATI GPU to various AMD CPUs such as Stars (Liano), Bobcat (Brazos) and Bulldozer (FX) and since the overall performance of those chips would depend on how well they could get the GPUs to come out since they can't really bin the GPUs like they normally do without ending up with a bazillion variations it makes sense they went with the one that has more GPU experience.
So does anybody have any REAL info on what's going on? What chips? What sockets? Is it only on server, or is desktop and mobile affected as well? Is it a one in a million bug or an easy to hit bug? Considering CPU bugs can go from the Pentium I "you'll likely never hit it" to the Phenom I quad "Good luck getting above 2.4GHz because the third core (Core 2) is buggy and unstable" there can be a pretty wide variance when it comes to odds. As someone who owns several AMD units and have sold even more some more info would be nice.
ACs don't waste your time replying, your posts are never seen by me.
Intel has had quite a few serious chip bugs too, all in errata. A number of new cpu bugs in both AMD and Intel chips always appears in new generations, but both companies have very large test suites and the number of new bugs goes down in every generation.
Don't forget that Intel had to recall a sandybridge chipset early in the sandybridge cycle, which cost them something like a billion dollars because the related motherboards had to be thrown away and replaced. That was due to internal on-chip circuitry related to a SATA port burning out.
Right at this moment AMD has two issues facing it in order to compete on workstations: (1) Power and (2) Performance. Their initial bulldozer release clearly depends too much on compiler optimizations to make full use of the architecture. They will clearly have to bulk-up some of the simplifications they made that made their cpu cores a little too sensitive to instruction sequences generated by compilers and I hope their next few releases will do better.
On power consumption it comes down to the Fab as much as anything else. Their dependence on the Fab is clearly a problem and they've made a break for it to try to solve it, even though it is costing them dearly. At the same time Intel has made some major advances in their three fabs, to the point where Intel can do their entire production on just two of those three fabs now but they decided to keep the third fab because they think they can 'grow into' it.
So AMD definitely has some work ahead of it, and I am hoping they reserve some of their focus for the high-end and don't concentrate entirely on laptops. I always like to say that I love AMD, but in the stock market I invest in Intel. That's just business. But I got on the AMD bandwagon big-time when they got to 64-bit first and I stuck with them all the way through the Phenom II.
Now, at this moment, Intel's SandyBridge has the best value and AMDs bulldozer is quite far behind, so new purchases for me right now are Intel. That may change in the next year or two and when it does my new purchases will happily be in the AMD camp again. Frankly, AMD only has to get within shouting distance (~8%) of Intel and I will happily use AMD. AMD doesn't have to beat Intel.
I think there are a number of things AMD can do right now to compete better with Intel. One of the biggest is in the mini-server department (albeit clearly with lower volumes than their current focus on laptops & integrated graphics). AMD consumer cpus (aka Phenom II) always had ECC support but very few motherboards actually supported it, which made it difficult to use AMD for mini-servers and avoid the Intel Xeon tax to get ECC. If AMD worked on the mobo vendors to ALWAYS support an ECC option that would allow them to compete against Intel Xeons on price, even if they are unable to compete on performance.
On the opterons AMD clearly has the right idea going with high-core-count cpus, but the memory subsystem is lagging too much to really be able to make use of all those cores. That seems to be low-hanging fruit to me, something which should be readily addressable by AMD. The opterons still have a lot of value and potentially can have a radical improvement in value with Bulldozer, but only if AMD can push the core count and improve the memory subsystem.
On large multi-core boxes AMD also needs to improve CMPXCHG and other atomic instructions in situations where contention is high. Right now multi-chip opteron systems seriously lag Intel on contended latency due to cache coherency inefficiencies. Will Bulldozer fix those latency issues? I don't know.
AMD only needs to get within shouting distance of Intel for me to buy their chips, and work their mobo producers a bit more to get better overall support for their chip's capabilities. They don't have to beat Intel.
-Matt
I'm getting ready to rebuild my system. Before I buy, can someone tell me *which* AMD "processor families" have this bug?
mark
a hardware issue. When it is developed by someone else, then it is a software issue, at least that is what I tell my boss.
Yeah, Intel. Quality, so long as you don't need to get the right answer from floating point math or locking operations to not eat data. Hardware bugs happen to both vendors, you're a fool if you think Intel are infallible.
* http://en.wikipedia.org/wiki/Pentium_F00F_bug
* http://en.wikipedia.org/wiki/Pentium_FDIV_bug
If you have float scale = 0.1; in your source code, I'd like to know how you get the binary representation to be "exact" while using IEEE 754. We wouldn't need numerical analysis if computers were able to accurately represent real numbers, rather than a subset of the rationals.
Yes, there are rules for dealing with things, but if you're trying to test if two floats are equal, you're probably doing it wrong if you're not including a bit of a fudge factor, because all sorts of rules that apply to real numbers no longer come out exactly right when you're dealing with numbers the computer can actually represent.
The AMD CPU was probably designed in the US and made in Germany.