Slashdot Mirror


Flawed AMD Chip Can Lead To Data Corruption

Brandonski writes "Apparently AMD allowed some flawed chips to slip through their detection grid. The problem affects only a small number of chips and only single core 2.6 and 2.8 GHz CPUs." From the article: "It is believed that the glitch is triggered when the affected chip's FPU is made to loop through a series of memory-fetch, multiplication and addition operations without any condition checks on the result of the calculations. The loop has to run over and over again for long enough to cause localized heating which together with high ambient temperatures could combine to cause the result of the operation to be recorded incorrectly, leading to data corruption."

13 of 203 comments (clear)

  1. Re:Kernel fix? by Umbral+Blot · · Score: 4, Insightful

    The big question is will someone write malware/virus to somehow take advantage of this flaw?

    I am curious how a virus could possibly exploit this. It would have to a) hog the resources so that it ran nearly exclusively, which would mean the virus already had control, and b) somehow cause a floating point error to result in a priviliages error. (priviliages and security routines rarely use floating point numbers). Also why would a kernel patch be released for this? It would hurt performance for the rest of us, customers with defective chips should simply return and replace them.

  2. Re:Fearmongering? by Saven+Marek · · Score: 2, Insightful

    > Only a few of the AMD chips, and AMD has only what, 30% of the market.

    The intel fanboys have been too noisy lately! AMD has more than 50% of the market since this year already!

  3. Deja Vu: Intel Processor's Bug in 1994 by reporter · · Score: 3, Insightful
    In 1994, Intel's Pentium processor suffered from a division error. Intel handled the problem by initially requiring customers to "prove" that the error caused a serious impact on the customers' lives before Intel would agree to replace the defective chips. Later, after much pressure and lost credibility, Intel agreed to replace all the defective chips without requiring the customer to "prove" his case.

    AMD has a unique opportunity to do the right thing: offering to replace all the defective chips. If AMD does the right thing, then it will only help AMD in its litigation against Intel and in various attempts to increase marketshare. After all, would you not prefer to buy from a reputable company instead of a dishonest, shifty company?

  4. It's like you're overclocking when you're not by IvyMike · · Score: 4, Insightful

    This is different than the Intel bug; that was a logic flaw, where the chip computed a floating point quantity using an incorrect algorithm. This is an implementation error. In fact, the article mentions that they're going to re-spec the parts and they'll be fine. So if you've got a 2.8Ghz part, and you run this loop at 2.8Ghz (within the old spec), it's like you're "overclocking" (because you're actually outside of AMD's new spec). My guess is that if you over-bought your heatsink and got something better than the stock OEM cooling solution, you would be fine even if you ran this loop all day. Yay, arctic silver!

  5. Re:Corruption by leendertv · · Score: 5, Insightful

    No CPU can guarantee to be free of corruption, the goal of the designer is just to minimize the likelihood of corruption. The design margins are usually such that proper operation is ensured, except for the statistical outliers. However, even CPUs with several error checking and correcting mechanisms can still corrupt data, it is just extremely unlikely. A CPU can never know for sure if it can compute a result accurately, or if an operation was performed correctly, just like no communications system can achieve bit error rates of 0.

    Data corruption in integrated circuits can come from several different sources. Cosmic rays are likely to alter memory values, especially so in DRAM cells. Typically, only ICs for space applications are actually radiation hardened. Much less likely, transistor device noise can corrupt data. Transistor device noise is usually more an issue in RF circuits. Finally, not all manufacturing defects can be found during manufacturing test, since most test sequences don't even achieve 100% fault coverage under currently used fault models, and this does not even consider how closely the models represent the actually circuit failure modes.

    Really, for most people this floating point data corruption is probably a non-issue. It is even more unlikely that errors in floating point data lead to exploits. It is more likely that some bits of your DRAM memory will get corrupted. On my system with ECC RAM that is a few years old, logs show that I get about 1 or 2 (correctable) errors per day...

  6. Re:An old problem by AndrewStephens · · Score: 2, Insightful

    I agree with your comments on the current story. In reality, all modern processors have flaws that only occur in extrememly unlikely circumstances. This one is not any different.

    --
    sheep.horse - does not contain information on sheep or horses.
  7. Re:Deja Vu: Intel Processor's Bug in 1994 by mojotooth · · Score: 1, Insightful

    Jesus. The things that people attribute to AMD's "moral superiority" here on Slashdot... It's astounding.

    If AMD does "the right thing" it won't be because of a moral high road. It's because Intel already stepped on a similar PR landmine long ago. Learning from your rival's huge mistakes is not worth high praise. It's just common sense.

    --
    -- Mojo Tooth : exploring our world as only an idiot can.
  8. Re:An old problem by Mister+Transistor · · Score: 4, Insightful

    I'll go you one better - I have formed my own personal postulate/theory/law that:

    No sufficiently complex system can ever be completely bug-free.

    and it's corollary:

    It is impossible to completely test a sufficiently complex system in every possible way to be certain that it's bug-free.

    In that vein, someone once said "Foolproof is impossible because fools are so ingenious", and "As soon as an idiot-proof system is devised, they go and invent a better idiot!"

    --
    -- You are in a maze of little, twisty passages, all different... --
  9. Re:An old problem by AWhistler · · Score: 3, Insightful

    There is a HUGE difference between this AMD problem and the FDIV bug. The FDIV bug, once found, was one of those "1,2, BANG" bugs (do step 1, then step 2 and BANG, the bug is there). With this AMD bug, you have to do the same operation many times before you see the problem, and then the problem is random (only if it overheats enough). Another possible solution to this is to use better heat sinks. This AMD problem isn't 1,2,BANG. Bugs that are of this nature are orders of magnitude harder to find and characterize.

    But you're right, since Intel blundered so badly on their handling of he FDIV bug, everyone else learned from it.

  10. Re:Haha by WilliamSChips · · Score: 2, Insightful

    When AMD has a problem, it only affects 3000 or so processors and causes minor corruption when a million-line-long piece of code is called without being stopped at any time. When Intel has a problem it affects millions of processors and crashes your computer when a single 32-bit command is called. I know whom I'll be buying from.

    --
    Please, for the good of Humanity, vote Obama.
  11. This is why I switched by Anonymous Coward · · Score: 1, Insightful

    I was an Intel man for many years. It's like being a ford or a chevy man
    you know, you ignore all good things about the competition and smugly
    goof on all their mistakes while ignoring your favorite's eccentricities.
    My wakeup call came as I was looking into building a cheap comp to play
    UT 2K4 on. I went through the benchmark results to find a good processor
    for cheap and was appalled at the prices that intel wanted for middle of the road dreck while AMD had several budget choices that were faster. I finally settled on a sempron 3100+ and I can't believe how many games I can play with just an nvidia 6600le and they all rock. I tried out a friends dell that was supposed to be high end and it couldn't match my resolution or fps and he paid 2300 for his intel boat anchor while I paid exactly 404.17 with shipping for my budget screamer. All this and ethical
    treatment for customers too? Long Live AMD

  12. Re:Humanly reproducable :) by fimbulvetr · · Score: 2, Insightful

    I think someone's confusing user error/not enough troubleshooting with an almost not reproducable issue. TFA mentions a lot of instructions without enough pause of FPU code to cool down. This isn't your bug if you're playing WC3. WC3 uses TCP/IP. TCP/IP generates interrupts - lots of interrupts. So many interrupts that your FPU has plenty of time to cool down between calculations. There are many handy ways of troubleshooting this issue of yours, and I'd bet you're not going to identify the problem by some slashdot story submission.

  13. This flaw seems damned serious to me... by anubi · · Score: 2, Insightful
    ... because the multiply-add is the basic building block of digital signal processing.

    You are apt to be doing this extensively when processing audio or video streams.

    --
    "Prove all things; hold fast that which is good." [KJV: I Thessalonians 5:21]