AMD Confirms CPU Bug Found By DragonFly BSD's Matt Dillon
An anonymous reader writes "Matt Dillon of DragonFly BSD just announced that AMD confirmed a CPU bug he found. Matt quotes part of the mail exchange and it looks like 'consecutive back-to-back pops and (near) return instructions can create a condition where the processor incorrectly updates the stack pointer.' The specific manifestations in DragonFly were random segmentation faults under heavy load."
What a bunch of freaking homosexual gay-babies!
Gayin' up the place!
And who reveals it? Open source. Clearly it is the threat to the integrity of our Taiwanese-made American CPUs!
Though, it's still very serious. At least it generally causes your program to crash rather than spitting out a wrong answer. And it sounds like the sequence of instructions that causes it is not commonly found.
I can well understand the guy who found it being all excited. The CPU is the last place you'd look for a bug, and finding one is pretty impressive, especially a really elusive one like this.
Need a Python, C++, Unix, Linux develop
I wonder if AMD likes apples.
I assume they'll be able to fix it via a microcode patch. Intel had to learn that the hard way...
Large print giveth, and the small print taketh away
This is cool, but...?
Why does it matter that it's the lead developer of DragonflyBSD?
Find free books.
Nice work tracking that one down. It must have been very frustrating - what we used to call a "ring-tailed b1tch"
Breathe continuously
After reading the links and knowing how I have time trying to find out why something doesn't work right I think I understand why he is so stoked at finding the root of the problem. Good for Matt, maybe they will send you a fixed processor someday.
Buggy buggy buttholes!!!!
UNITE with the Campaign for a Free Internet because today, our future begins with tomorrow!
You were not hitting the division bug, it happens only with certain combinations of numbers and is quite rare.
You probably just had the rounding mode set wrong.
I can only imagine the time and effort spent on tracking down this problem - a rare CPU condition is exponentially more difficult to narrow down than most programming mistakes. A lot of progress in IT depends on engineers like this, who obsessively solve problems even when it's much easier to just ignore them, try to hack around them or pass the buck around. Kudos.
A pertinent addition to the submission would be which CPUs have been found to be affected.
The second link says Opteron 6168 and Phenom II X4 820. For a second I thought that bulldozer hasn't managed to do anything right, but these two examples are pre bulldozer.
No doubt this is not an exhaustive list.
Windows? Does this mean that those users and devs aren't so important as far as total CPU load?
Sent as ripples into the electromagnetic field. No single photon has been harmed in the process.
Anyone know if that "Test case" image is available? I'd like to check if my Phenom II x6 is affected.
FWIW:
The failure has been observed on three different machines, all running AMD cpus. A quad opteron 6168 (48 core) box, and two Phenom II x4 820 boxes.
When information is power, privacy is freedom.
I have to worry about stack smashing bugs here... can there be a way for (say) a data pattern in a media file, or carefully crafted javascript or java code that's been JIT-compiled, to break out of its sandbox? What about a hostile OS kernel running inside a VPS container taking over the hypervisor or bare iron? Hmm.
I remember division not being exactly accurate, so the solution I needed to use was to round up and down results that are really close.
Just so you know, division is never accurate in floats, even when the CPU doesn't have bugs. If you're using doubles you'll get better accuracy, but with a 32 bit floating point number, you shouldn't be surprised to find errors in the third digit after the decimal point.
Just to be clear its not limited to division. Hell, errors can creep in just by converting a decimal number to floating point. This is why calculators use decimal arithmetic, well some of them - like Perpenso Calc for iPhone iPad RPN Scientific Stats Business Hex. Try "0.5 - 0.4 - 0.1" in your favorite calculator app, it might indicate whether the app is using the FPU or decimal arithmetic. Of course the app may be doing something naive like the "BASIC MMORG", rounding results. Its naive because it is another source of rounding error, some results are legitimately a little bit off from a nicely round number.
Matt Dillon, desperate after chasing unsuccessfully mary in Something about Mary radically changed jobs and started to study computer science...
Slashdot, fix the reply notifications... You won't get away with it...
Pop two off the stack and ret to the calling routine seems fairly common to me. Lots of functions use two arguments and are called with near calls in various programming languages.
That might have been true on 386s.
But currently we're in 2012 and the most widely used instruction set for Linux on AMD processors is x86_64. Because these 64bit processors feature a big number of registers, the two arguments will be passed as registers, not on the stack. So the sequence of instructions isn't indeed common.
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
Linux in 64bit mode use register to pass arguments to functions.
So the most common sequence at the end of a function isn't "bunch of pops then a (near) return", but "move the results into target registers and the return".
Thus the bug sequence doesn't happen that often.
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
flurpy slippah
In a way this bug reminds me of a problem I found on a certain software-hardware combination involving a ten gigabit Ethernet card. We were seeing mysteriously corrupted SSL connections when there was no reason them to be corrupted. This condition occurred roughly every couple dozen gigabytes of transferred data on a connection that both MSS and very short data segments. After couple weeks of debugging, I was able to conclude that the TCP receive offload engine on the card occassionally injected Ethernet padding as data into the TCP stream that SSL library eventually processed as its' input! Neither Ethernet nor TCP checksums saved from this extra data, since it was conceptually generated on a higher layer; SSL caught it, though. Only God knows how many connections this semi-hardware (but at least software-fixable) bug had corrupted in a less obvious manner elsewhere before it was found out...
You try and find something that "the other guy" had a problem with and bring it up as worse so as to try and "protect" the thing you are a fan about? Because I see nothing about the FDIV bug anywhere but your post.
Oh and you know what that bug applied to, right? The Intel Pentium, the ORIGINAL Pentium. Not the Pentium MMX, not the Pentium Pro, not the Pentium II, not the Pentium III, not the Pentium 4, not the Core, not the Core 2, not the Core i, not the second generation Core i. And yes, that's how many major processor versions from Intel there have been since then (with another to launch in the next couple weeks). The original Pentium chips that had this problem came out almost 2 decades ago, 1993.
So seriously, leave off it. I get tired of any time there is a problem with $Product_X fans of it will point out how $Product_Y had a similar or worse error way back in the day and that somehow changes things.
No it doesn't. The story is about the AMD chips, nobody gives a shit about the FDIV bug and I'll wager there are people reading Slashdot who weren't alive when it happened.
The good news for AMD is that processors can often patch around this shit in microcode these days so a recall may not be needed. Have to see, but the potential is there for a software (so to speak) fix.
Presumably AMD will announce affected CPUs fairly soon, after they get done testing. This isn't the kind of thing they would be able to sit on, even if they wanted to. If your CPU has been working for you in general it isn't like it is going to suddenly go and beat up your cat or something, it'll be fine for a bit longer while AMD figures out which ones are all affected and figures out how to fix it.
As I noted in another post, depending on it may be possible to fix it via microcode. CPUs aren't "pure" hardware these days. They have a bit of software that tells them how to do things and on some of them (Intel CPUs I know for sure) it is field upgradable. So they may find a way to patch out the bug.
Just keep an eye on their page, maybe send them an e-mail saying you'd like a notice when they know. Should be soon I'd imagine.
for a guy like me, this is pretty much an honest-to-god bughunt ;)
...the CPUs suck :-)
Actually how (Free)BSD made use of available hardware resources back then in the 90s was the big reason for me to use it instead of Linux.
Since adding up terms of half the power of the previous number goes up only to 2, 4 cannot be represented.
You have to increase the exponent.
Can we say "WOOOOSH!"
paintball
I'm very interested on this, since the original posting by Matt Dillon hints at the bug being present in all Opteron and Phenom models. The bug seems hard enough to replicate, but still, corrupting the stack is no minor detail.
Then why has it only been seen on DragonFly BSD?
My opinion is that the sequence that leads to the bug is rare. DragonFly BSD happen to have it somewhere (due to sheer luck, because otherwise the sequence is uncommon) and under some stress condition, the bug got triggered.
Under the same stress condition Linux didn't barf, simply because the sequence is so rare that the Linux kernel didn't have it.
Can someone do some screening of the binary image of a few kernels and see how common is the pop-pop-nearret sequence ? :-P )
(Lazy and don't have the material to do it here now
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
AMD has indicated to me that the Bulldozer is not effected, which is a relief.
I guess I should have realized this would get slashdotted. In anycase, it took quite a bit of effort to track the bug down. It was very difficult to reproduce reliably. It isn't a show stopper in that it really takes a lot of work to get it to happen and most people will never see it, but it's certainly a significant bug owing to the fact that it can be reproduced with normal instruction sequences.
I began to suspect it might be a cpu bug last year and after exhaustive testing I posted my suspicions in December:
http://leaf.dragonflybsd.org/mailarchive/kernel/2011-12/msg00025.html
Older versions of GCC were more prone to generate the sequence of POP's + RET, coupled with a deep recursion and other stack state, that could result in the bug. It just so happened that DragonFly's buildworld hit the right combination inside gcc, and even then the bug only occurred sometimes and only one a small subset of .c files being compiled (like maybe 2-3 files). The bug never manifested anywhere else, doing anything else, running any other application. Ever.
In particular the bug disappeared with later versions of GCC and disppeared when I messed with the optimizations. We use -O by default, not -O2. The bug disappeared when I produced code with gcc -O2 (using 4.4.7).
It is really unlikely that Linux is effected... the sensitivity to particular code sequences laid out in the compiler is so fine that adding a single instruction virtually anywhere could make the bug disappear. Even just shifting the stack pointer a little bit would make it disappear.
In anycase, for a programmer like me being able to find an honest-to-god cpu bug in a modern cpu is very cool :-)
-Matt
"there is never any need for self modifying code"
I'm sorry to burst your bubble, but the linux kernel uses self-modifying code to provide the ability to build a kernel that will boot on many machines but can still be tweaked at boot to run faster depending on what the cpu supports.
Use AMD, get burned.
Also, AMD is divesting from that chip fab.
Looks like their going down!
Just because it CAN be done, doesn't mean it should!
I'm getting ready to rebuild my system. Before I buy, can someone tell me *which* AMD "processor families" have this bug?
mark
a hardware issue. When it is developed by someone else, then it is a software issue, at least that is what I tell my boss.
Whassa madda? CIA got you pushing too many penzilz?
If you have float scale = 0.1; in your source code, I'd like to know how you get the binary representation to be "exact" while using IEEE 754. We wouldn't need numerical analysis if computers were able to accurately represent real numbers, rather than a subset of the rationals.
Yes, there are rules for dealing with things, but if you're trying to test if two floats are equal, you're probably doing it wrong if you're not including a bit of a fudge factor, because all sorts of rules that apply to real numbers no longer come out exactly right when you're dealing with numbers the computer can actually represent.