Slashdot Mirror


Flawed AMD Chip Can Lead To Data Corruption

Brandonski writes "Apparently AMD allowed some flawed chips to slip through their detection grid. The problem affects only a small number of chips and only single core 2.6 and 2.8 GHz CPUs." From the article: "It is believed that the glitch is triggered when the affected chip's FPU is made to loop through a series of memory-fetch, multiplication and addition operations without any condition checks on the result of the calculations. The loop has to run over and over again for long enough to cause localized heating which together with high ambient temperatures could combine to cause the result of the operation to be recorded incorrectly, leading to data corruption."

24 of 203 comments (clear)

  1. I Have an AMD CPU by ozmanjusri · · Score: 5, Funny

    Hey, I have an AMD 2.8Ghz. Maybe I should stop refresðN9'óI]öR9ù¥Î6ýPoe}+èa(ê{

    --
    "I've got more toys than Teruhisa Kitahara."
    1. Re:I Have an AMD CPU by zaguar · · Score: 5, Funny
      ðN9'óI]öR9ù¥Î6ýPoe}+èa(ê{

      Interesting Perl script.

      --
      "Sure there's porn and piracy on the Web but there's probably a downside too."
    2. Re:I Have an AMD CPU by Minwee · · Score: 4, Funny
      ðN9'óI]öR9ù¥Î6ýPoe}+èa(ê{

      Interesting Perl script.

      It's also rule number 26 in sendmail.cf.

  2. An old problem by AndrewStephens · · Score: 4, Informative
    Something similar used to happen on very old processors, back in the day. If certain instructions were executed in tight loops, the chips would experience localised heating and eventually malfunction (sometimes with permanent damage).

    I'm too young to remember the details (I think it goes back to the early eighties at least), but perhaps some of the elder gods that lurk around here might be able to supply more details.

    --
    sheep.horse - does not contain information on sheep or horses.
    1. Re:An old problem by Alien+Being · · Score: 3, Funny

      I used to burn out a lot of abacus beads.

    2. Re:An old problem by Jerf · · Score: 4, Funny

      Do not meddle in the affairs of the Elder Gods, for you are crunchy, and good with ketchup.

    3. Re:An old problem by Mister+Transistor · · Score: 5, Informative

      You may be referring to the early MC6800 8-bit processors. The first ones had a major problem in that the internal registers were dynamic RAM style memory, and synchronized to the internal state clock. If you halted the processor for an extended period of time, the refresh clock to them ceased and the registers got hot, drew too much current and burned up!

      I'm pretty sure that gave rise to the joke "Halt and Catch Fire"...

      I always figured that if you were to burn out a register from overuse, it would be the carry bit ;)

      Anyway, as to the story at hand, it sounds like this would only ever occur a) to only 3000 processors total - MAYBE, and b) would only ever happen under such an artifically contrived laboratory stress-test/benchmark situation. Any CPU running in a real system would a) have to do other things like service hardware interrupts, and b) wouldn't do something useless like perform a looping calculation without checking to see if it was done periodically. It really sounds like this is a big non-issue in reality.

      --
      -- You are in a maze of little, twisty passages, all different... --
    4. Re:An old problem by Mister+Transistor · · Score: 4, Insightful

      I'll go you one better - I have formed my own personal postulate/theory/law that:

      No sufficiently complex system can ever be completely bug-free.

      and it's corollary:

      It is impossible to completely test a sufficiently complex system in every possible way to be certain that it's bug-free.

      In that vein, someone once said "Foolproof is impossible because fools are so ingenious", and "As soon as an idiot-proof system is devised, they go and invent a better idiot!"

      --
      -- You are in a maze of little, twisty passages, all different... --
    5. Re:An old problem by something_wicked_thi · · Score: 5, Informative

      RTFA. They are offering a free replacement. However, the FDIV bug was overblown. For most people, it didn't matter (few people were using software that required division precise enough to be affected). This bug is even less worrisome. Its effect is, at the moment, completely unobserved in the wild using real world applications. The FDIV bug was apparent to anyone with a calculator.

      I'm not saying AMD should be let off the hook completely, but the bug isn't a big problem, they are offering free replacements, and they publicized it. The FDIV bug was bigger (though still hardly catastrophic), refused (at first) to offer replacements, and they sat on it. The two scenarios are nowhere near similar. Maybe AMD just has more character than Intel, or maybe they were watching in 94/95 when the FDIV bug happened and they've actually learned from Intel's mistakes. Regardless, this whole story is more of a heads-up to concerned buyers than a criticism of AMD.

    6. Re:An old problem by AWhistler · · Score: 3, Insightful

      There is a HUGE difference between this AMD problem and the FDIV bug. The FDIV bug, once found, was one of those "1,2, BANG" bugs (do step 1, then step 2 and BANG, the bug is there). With this AMD bug, you have to do the same operation many times before you see the problem, and then the problem is random (only if it overheats enough). Another possible solution to this is to use better heat sinks. This AMD problem isn't 1,2,BANG. Bugs that are of this nature are orders of magnitude harder to find and characterize.

      But you're right, since Intel blundered so badly on their handling of he FDIV bug, everyone else learned from it.

  3. Re:Kernel fix? by Umbral+Blot · · Score: 4, Insightful

    The big question is will someone write malware/virus to somehow take advantage of this flaw?

    I am curious how a virus could possibly exploit this. It would have to a) hog the resources so that it ran nearly exclusively, which would mean the virus already had control, and b) somehow cause a floating point error to result in a priviliages error. (priviliages and security routines rarely use floating point numbers). Also why would a kernel patch be released for this? It would hurt performance for the rest of us, customers with defective chips should simply return and replace them.

  4. Uh oh.. by BigZaphod · · Score: 5, Funny

    Wow! AMD has invented a way to crash an infinite loop! Awesome! Intel? I bet their solution will take twice as long to crash this loop:

    10 PRINT "HELLO WORLD"
    20 GOTO 10

    AMD is always innovating.

  5. Deja Vu: Intel Processor's Bug in 1994 by reporter · · Score: 3, Insightful
    In 1994, Intel's Pentium processor suffered from a division error. Intel handled the problem by initially requiring customers to "prove" that the error caused a serious impact on the customers' lives before Intel would agree to replace the defective chips. Later, after much pressure and lost credibility, Intel agreed to replace all the defective chips without requiring the customer to "prove" his case.

    AMD has a unique opportunity to do the right thing: offering to replace all the defective chips. If AMD does the right thing, then it will only help AMD in its litigation against Intel and in various attempts to increase marketshare. After all, would you not prefer to buy from a reputable company instead of a dishonest, shifty company?

  6. Re:Deja Vu: Intel Processor's Bug in 1994 by Anonymous Coward · · Score: 4, Informative

    "The company is also working with OEMs to identify affected parts and contact customers who could be affected - if they are, they will be offered free replacements."

    forth paragraph in TFA.

  7. nice! by B3ryllium · · Score: 3, Interesting

    Wow, that was fast. FreeBSD already has a patch for this.

    Judging from the posting date, I *really* need to be updating my sources more often. :)

    20060419: p7 FreeBSD-SA-06:14.fpu
                    Correct a local information leakage bug affecting AMD FPUs.

    (could be an unrelated correction, I guess, it doesn't provide much more information in /usr/src/UPDATING)

    1. Re:nice! by larry+bagina · · Score: 3, Informative
      it is an unrelated correction:

      ...As a result of this discrepancy remaining unnoticed until now, the FreeBSD kernel does not restore the contents of the FOP, FIP, and FDP registers between context switches.

      source

      --
      Do you even lift?

      These aren't the 'roids you're looking for.

  8. Re:Kernel fix? by larry+bagina · · Score: 5, Informative
    I'm sure someone will have a kernel patch to prevent this from happening in linux in very short order.

    Not likely. This is valid user code that is being executed. On other CPUs, the same code wouldn't cause a problem. Something like the F00F bug is fixable in the kernel by mucking with exception handler. This is pure user-land code.

    --
    Do you even lift?

    These aren't the 'roids you're looking for.

  9. It's like you're overclocking when you're not by IvyMike · · Score: 4, Insightful

    This is different than the Intel bug; that was a logic flaw, where the chip computed a floating point quantity using an incorrect algorithm. This is an implementation error. In fact, the article mentions that they're going to re-spec the parts and they'll be fine. So if you've got a 2.8Ghz part, and you run this loop at 2.8Ghz (within the old spec), it's like you're "overclocking" (because you're actually outside of AMD's new spec). My guess is that if you over-bought your heatsink and got something better than the stock OEM cooling solution, you would be fine even if you ran this loop all day. Yay, arctic silver!

  10. Re:Corruption by leendertv · · Score: 5, Insightful

    No CPU can guarantee to be free of corruption, the goal of the designer is just to minimize the likelihood of corruption. The design margins are usually such that proper operation is ensured, except for the statistical outliers. However, even CPUs with several error checking and correcting mechanisms can still corrupt data, it is just extremely unlikely. A CPU can never know for sure if it can compute a result accurately, or if an operation was performed correctly, just like no communications system can achieve bit error rates of 0.

    Data corruption in integrated circuits can come from several different sources. Cosmic rays are likely to alter memory values, especially so in DRAM cells. Typically, only ICs for space applications are actually radiation hardened. Much less likely, transistor device noise can corrupt data. Transistor device noise is usually more an issue in RF circuits. Finally, not all manufacturing defects can be found during manufacturing test, since most test sequences don't even achieve 100% fault coverage under currently used fault models, and this does not even consider how closely the models represent the actually circuit failure modes.

    Really, for most people this floating point data corruption is probably a non-issue. It is even more unlikely that errors in floating point data lead to exploits. It is more likely that some bits of your DRAM memory will get corrupted. On my system with ECC RAM that is a few years old, logs show that I get about 1 or 2 (correctable) errors per day...

  11. CALL ESP by Myria · · Score: 3, Interesting

    Probably the easiest errata to come by is the instruction "CALL ESP" (or "CALL RSP"). On AMD CPUs, "CALL ESP" will jump to the address in ESP, *then* push the return address. However, on Intel CPUs, it will push the return address first, then jump to the value it just pushed. This is, of course, disasterous if you try to use it.

    According to Intel errata documents, this is a bug in the Pentium Pro that has been kept for several generations. The Pentium and below, except the 8086 and 8088, worked correctly with this instruction.

    If you want to differentiate Intel and AMD in your program and don't want to use CPUID, you can set up a test with CALL ESP.

    Melissa

    --
    "Screw Sun, cross-platform will never work. Let's move on and steal the Java language." - Visual J++ Product Manager
  12. Quality Control at AMD must be good. by MROD · · Score: 5, Interesting

    Having read a lot about this flaw it's actually amazing that AMD's quality control found the problem in the first place.

    The actions needed to cause the problem to arise are so extreme that they'd never happen in the field. i.e. Loop through tight floating-point only instructions without any comparisons for maybe hours before the error occurs.

    This would *NEVER* happen in the field. Firstly, in any modern OS the process would have been pre-empted long before any problem could occur (causing other instructions to run and hence stopping the overheating). Secondly, no real-world program would ever do this sort of thing as there would always be a comparison in the loop within the timeframe.

    This is a theoretical problem only in the real world, especially as it only affects about 3000 processors in total (it has been quoted). This is why AMD gave it such a low priority. We should just forget about it and move on.

    --

    Agrajag: "Oh no, not again!"
  13. Re:Sounds familiar by smash · · Score: 4, Informative
    Hmm.... I doubt you'd need a few million cells though.

    Some of the tendering spreadsheets i've seen for a few companies i've worked for have had quite a lot of calculation going on in them - change a few cells that others depend on that have others depending on them, etc.... do that all day, it adds up quick.

    You only need 1 of those operations in that instance to screw up and you could be down a few million dollars, if it's not picked up.

    Even forgetting that it's just the moral thing to do...Risk vs replacement cost = no brainer. If only 3000 cpus are affected at say $300 each for amd to sell retail (i'm sure their cost is FAR less), they'd be mad not to just do it (maybe even offer a free speed bump) and reap the positive PR.

    All it needs is for ONE company to blame a budget blowout on them and it's well and truly paid for...

    smash.

    --
    I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
  14. This will not happen to you by Bloater · · Score: 4, Informative

    If you have any interrupts coming in, or your loop has a termination condition. I think you have to have your hardware set to send an interrupt many hours in the future then start an otherwise nonterminating loop.

    So under normal conditions on normal PC hardware, this simply won't happen.

  15. Surprising. AMD uses my `cpuburn` by redelm · · Score: 4, Informative
    About 7 years ago, I wrote a suite of open-source CPU stress-tests I called `cpuburn`. Little optimized assember pgms designed to stress different parts of the CPU. `burnK7` does precisely this FPU dot product.

    Of course, I expect AMD's production testing dept to have far better code, since they will devote more job hours to it and know proprietary chip details. Still, different parts of AMD as emailed me several times to thank me because they found the pgms useful. Great.

    But these guys know what they're doing. Heat transfer from the hot multipliers has to be carefully analysed [3D finite element heat transfer analysis]. I suspect something far more mundane, like someone reducing die or slug thickness, or a mfg problem with the die/slug gap or thermal goop.