Flawed AMD Chip Can Lead To Data Corruption
Brandonski writes "Apparently AMD allowed some flawed chips to slip through their detection grid. The problem affects only a small number of chips and only single core 2.6 and 2.8 GHz CPUs." From the article: "It is believed that the glitch is triggered when the affected chip's FPU is made to loop through a series of memory-fetch, multiplication and addition operations without any condition checks on the result of the calculations. The loop has to run over and over again for long enough to cause localized heating which together with high ambient temperatures could combine to cause the result of the operation to be recorded incorrectly, leading to data corruption."
The big question is will someone write malware/virus to somehow take advantage of this flaw?
I am curious how a virus could possibly exploit this. It would have to a) hog the resources so that it ran nearly exclusively, which would mean the virus already had control, and b) somehow cause a floating point error to result in a priviliages error. (priviliages and security routines rarely use floating point numbers). Also why would a kernel patch be released for this? It would hurt performance for the rest of us, customers with defective chips should simply return and replace them.
Philosophy.
> Only a few of the AMD chips, and AMD has only what, 30% of the market.
The intel fanboys have been too noisy lately! AMD has more than 50% of the market since this year already!
AMD has a unique opportunity to do the right thing: offering to replace all the defective chips. If AMD does the right thing, then it will only help AMD in its litigation against Intel and in various attempts to increase marketshare. After all, would you not prefer to buy from a reputable company instead of a dishonest, shifty company?
This is different than the Intel bug; that was a logic flaw, where the chip computed a floating point quantity using an incorrect algorithm. This is an implementation error. In fact, the article mentions that they're going to re-spec the parts and they'll be fine. So if you've got a 2.8Ghz part, and you run this loop at 2.8Ghz (within the old spec), it's like you're "overclocking" (because you're actually outside of AMD's new spec). My guess is that if you over-bought your heatsink and got something better than the stock OEM cooling solution, you would be fine even if you ran this loop all day. Yay, arctic silver!
No CPU can guarantee to be free of corruption, the goal of the designer is just to minimize the likelihood of corruption. The design margins are usually such that proper operation is ensured, except for the statistical outliers. However, even CPUs with several error checking and correcting mechanisms can still corrupt data, it is just extremely unlikely. A CPU can never know for sure if it can compute a result accurately, or if an operation was performed correctly, just like no communications system can achieve bit error rates of 0.
Data corruption in integrated circuits can come from several different sources. Cosmic rays are likely to alter memory values, especially so in DRAM cells. Typically, only ICs for space applications are actually radiation hardened. Much less likely, transistor device noise can corrupt data. Transistor device noise is usually more an issue in RF circuits. Finally, not all manufacturing defects can be found during manufacturing test, since most test sequences don't even achieve 100% fault coverage under currently used fault models, and this does not even consider how closely the models represent the actually circuit failure modes.
Really, for most people this floating point data corruption is probably a non-issue. It is even more unlikely that errors in floating point data lead to exploits. It is more likely that some bits of your DRAM memory will get corrupted. On my system with ECC RAM that is a few years old, logs show that I get about 1 or 2 (correctable) errors per day...
I agree with your comments on the current story. In reality, all modern processors have flaws that only occur in extrememly unlikely circumstances. This one is not any different.
sheep.horse - does not contain information on sheep or horses.
Jesus. The things that people attribute to AMD's "moral superiority" here on Slashdot... It's astounding.
If AMD does "the right thing" it won't be because of a moral high road. It's because Intel already stepped on a similar PR landmine long ago. Learning from your rival's huge mistakes is not worth high praise. It's just common sense.
-- Mojo Tooth : exploring our world as only an idiot can.
I'll go you one better - I have formed my own personal postulate/theory/law that:
No sufficiently complex system can ever be completely bug-free.
and it's corollary:
It is impossible to completely test a sufficiently complex system in every possible way to be certain that it's bug-free.
In that vein, someone once said "Foolproof is impossible because fools are so ingenious", and "As soon as an idiot-proof system is devised, they go and invent a better idiot!"
-- You are in a maze of little, twisty passages, all different... --
There is a HUGE difference between this AMD problem and the FDIV bug. The FDIV bug, once found, was one of those "1,2, BANG" bugs (do step 1, then step 2 and BANG, the bug is there). With this AMD bug, you have to do the same operation many times before you see the problem, and then the problem is random (only if it overheats enough). Another possible solution to this is to use better heat sinks. This AMD problem isn't 1,2,BANG. Bugs that are of this nature are orders of magnitude harder to find and characterize.
But you're right, since Intel blundered so badly on their handling of he FDIV bug, everyone else learned from it.
When AMD has a problem, it only affects 3000 or so processors and causes minor corruption when a million-line-long piece of code is called without being stopped at any time. When Intel has a problem it affects millions of processors and crashes your computer when a single 32-bit command is called. I know whom I'll be buying from.
Please, for the good of Humanity, vote Obama.
I was an Intel man for many years. It's like being a ford or a chevy man
you know, you ignore all good things about the competition and smugly
goof on all their mistakes while ignoring your favorite's eccentricities.
My wakeup call came as I was looking into building a cheap comp to play
UT 2K4 on. I went through the benchmark results to find a good processor
for cheap and was appalled at the prices that intel wanted for middle of the road dreck while AMD had several budget choices that were faster. I finally settled on a sempron 3100+ and I can't believe how many games I can play with just an nvidia 6600le and they all rock. I tried out a friends dell that was supposed to be high end and it couldn't match my resolution or fps and he paid 2300 for his intel boat anchor while I paid exactly 404.17 with shipping for my budget screamer. All this and ethical
treatment for customers too? Long Live AMD
I think someone's confusing user error/not enough troubleshooting with an almost not reproducable issue. TFA mentions a lot of instructions without enough pause of FPU code to cool down. This isn't your bug if you're playing WC3. WC3 uses TCP/IP. TCP/IP generates interrupts - lots of interrupts. So many interrupts that your FPU has plenty of time to cool down between calculations. There are many handy ways of troubleshooting this issue of yours, and I'd bet you're not going to identify the problem by some slashdot story submission.
You are apt to be doing this extensively when processing audio or video streams.
"Prove all things; hold fast that which is good." [KJV: I Thessalonians 5:21]