Flawed AMD Chip Can Lead To Data Corruption

← Back to Stories (view on slashdot.org)

Flawed AMD Chip Can Lead To Data Corruption

Posted by ryuzaki0 on Friday April 28, 2006 @05:39PM from the crunchy-mistakes dept.

Brandonski writes "Apparently AMD allowed some flawed chips to slip through their detection grid. The problem affects only a small number of chips and only single core 2.6 and 2.8 GHz CPUs." From the article: "It is believed that the glitch is triggered when the affected chip's FPU is made to loop through a series of memory-fetch, multiplication and addition operations without any condition checks on the result of the calculations. The loop has to run over and over again for long enough to cause localized heating which together with high ambient temperatures could combine to cause the result of the operation to be recorded incorrectly, leading to data corruption."

8 of 203 comments (clear)

Min score:

Reason:

Sort:

I Have an AMD CPU by ozmanjusri · 2006-04-28 17:50 · Score: 5, Funny

Hey, I have an AMD 2.8Ghz. Maybe I should stop refresðN9'óI]öR9ù¥Î6ýPoe}+èa(ê{

--
"I've got more toys than Teruhisa Kitahara."
1. Re:I Have an AMD CPU by zaguar · 2006-04-28 18:52 · Score: 5, Funny
  
  ðN9'óI]öR9ù¥Î6ýPoe}+èa(ê{
  Interesting Perl script.
  
  --
  "Sure there's porn and piracy on the Web but there's probably a downside too."
Uh oh.. by BigZaphod · 2006-04-28 18:11 · Score: 5, Funny

Wow! AMD has invented a way to crash an infinite loop! Awesome! Intel? I bet their solution will take twice as long to crash this loop:

10 PRINT "HELLO WORLD"
20 GOTO 10

AMD is always innovating.

--
Hexy - a strategy game for iPhone/iPod Touch
Re:Kernel fix? by larry+bagina · 2006-04-28 18:39 · Score: 5, Informative

I'm sure someone will have a kernel patch to prevent this from happening in linux in very short order.
Not likely. This is valid user code that is being executed. On other CPUs, the same code wouldn't cause a problem. Something like the F00F bug is fixable in the kernel by mucking with exception handler. This is pure user-land code.

--
Do you even lift?
These aren't the 'roids you're looking for.
Re:An old problem by Mister+Transistor · 2006-04-28 18:41 · Score: 5, Informative

You may be referring to the early MC6800 8-bit processors. The first ones had a major problem in that the internal registers were dynamic RAM style memory, and synchronized to the internal state clock. If you halted the processor for an extended period of time, the refresh clock to them ceased and the registers got hot, drew too much current and burned up!

I'm pretty sure that gave rise to the joke "Halt and Catch Fire"...

I always figured that if you were to burn out a register from overuse, it would be the carry bit ;)

Anyway, as to the story at hand, it sounds like this would only ever occur a) to only 3000 processors total - MAYBE, and b) would only ever happen under such an artifically contrived laboratory stress-test/benchmark situation. Any CPU running in a real system would a) have to do other things like service hardware interrupts, and b) wouldn't do something useless like perform a looping calculation without checking to see if it was done periodically. It really sounds like this is a big non-issue in reality.

--
-- You are in a maze of little, twisty passages, all different... --
Re:Corruption by leendertv · 2006-04-28 18:51 · Score: 5, Insightful

No CPU can guarantee to be free of corruption, the goal of the designer is just to minimize the likelihood of corruption. The design margins are usually such that proper operation is ensured, except for the statistical outliers. However, even CPUs with several error checking and correcting mechanisms can still corrupt data, it is just extremely unlikely. A CPU can never know for sure if it can compute a result accurately, or if an operation was performed correctly, just like no communications system can achieve bit error rates of 0.

Data corruption in integrated circuits can come from several different sources. Cosmic rays are likely to alter memory values, especially so in DRAM cells. Typically, only ICs for space applications are actually radiation hardened. Much less likely, transistor device noise can corrupt data. Transistor device noise is usually more an issue in RF circuits. Finally, not all manufacturing defects can be found during manufacturing test, since most test sequences don't even achieve 100% fault coverage under currently used fault models, and this does not even consider how closely the models represent the actually circuit failure modes.

Really, for most people this floating point data corruption is probably a non-issue. It is even more unlikely that errors in floating point data lead to exploits. It is more likely that some bits of your DRAM memory will get corrupted. On my system with ECC RAM that is a few years old, logs show that I get about 1 or 2 (correctable) errors per day...
Quality Control at AMD must be good. by MROD · 2006-04-28 19:46 · Score: 5, Interesting

Having read a lot about this flaw it's actually amazing that AMD's quality control found the problem in the first place.

The actions needed to cause the problem to arise are so extreme that they'd never happen in the field. i.e. Loop through tight floating-point only instructions without any comparisons for maybe hours before the error occurs.

This would *NEVER* happen in the field. Firstly, in any modern OS the process would have been pre-empted long before any problem could occur (causing other instructions to run and hence stopping the overheating). Secondly, no real-world program would ever do this sort of thing as there would always be a comparison in the loop within the timeframe.

This is a theoretical problem only in the real world, especially as it only affects about 3000 processors in total (it has been quoted). This is why AMD gave it such a low priority. We should just forget about it and move on.

--

Agrajag: "Oh no, not again!"
Re:An old problem by something_wicked_thi · 2006-04-28 20:14 · Score: 5, Informative

RTFA. They are offering a free replacement. However, the FDIV bug was overblown. For most people, it didn't matter (few people were using software that required division precise enough to be affected). This bug is even less worrisome. Its effect is, at the moment, completely unobserved in the wild using real world applications. The FDIV bug was apparent to anyone with a calculator.

I'm not saying AMD should be let off the hook completely, but the bug isn't a big problem, they are offering free replacements, and they publicized it. The FDIV bug was bigger (though still hardly catastrophic), refused (at first) to offer replacements, and they sat on it. The two scenarios are nowhere near similar. Maybe AMD just has more character than Intel, or maybe they were watching in 94/95 when the FDIV bug happened and they've actually learned from Intel's mistakes. Regardless, this whole story is more of a heads-up to concerned buyers than a criticism of AMD.