Flawed AMD Chip Can Lead To Data Corruption

← Back to Stories (view on slashdot.org)

Flawed AMD Chip Can Lead To Data Corruption

Posted by ryuzaki0 on Friday April 28, 2006 @05:39PM from the crunchy-mistakes dept.

Brandonski writes "Apparently AMD allowed some flawed chips to slip through their detection grid. The problem affects only a small number of chips and only single core 2.6 and 2.8 GHz CPUs." From the article: "It is believed that the glitch is triggered when the affected chip's FPU is made to loop through a series of memory-fetch, multiplication and addition operations without any condition checks on the result of the calculations. The loop has to run over and over again for long enough to cause localized heating which together with high ambient temperatures could combine to cause the result of the operation to be recorded incorrectly, leading to data corruption."

11 of 203 comments (clear)

Min score:

Reason:

Sort:

Re:What? by qbwiz · 2006-04-28 17:49 · Score: 2, Interesting

Generally, chips aren't supposed to have localized heating problems. Either it should all have a problem, or none of it should.

--
Ewige Blumenkraft.
Fearmongering? by zaguar · 2006-04-28 17:54 · Score: 2, Interesting

Is it reasonable to be afraid of this. To exploit this, in a way to allow running of arbitary code, you would need a buffer overflow - which is what this AMD weakness is purporting to allow. However, how many are affected? Only a few of the AMD chips, and AMD has only what, 30% of the market. So to code an exploit, you would be writing to a very limited audience, to a point where it is futile. Why not just exploit the latest create.Textrange of WMF exploit in IE/Windows? Much more money in that.

--
"Sure there's porn and piracy on the Web but there's probably a downside too."
nice! by B3ryllium · 2006-04-28 18:28 · Score: 3, Interesting

Wow, that was fast. FreeBSD already has a patch for this.

Judging from the posting date, I *really* need to be updating my sources more often. :)

20060419: p7 FreeBSD-SA-06:14.fpu
Correct a local information leakage bug affecting AMD FPUs.

(could be an unrelated correction, I guess, it doesn't provide much more information in /usr/src/UPDATING)
CALL ESP by Myria · 2006-04-28 19:42 · Score: 3, Interesting

Probably the easiest errata to come by is the instruction "CALL ESP" (or "CALL RSP"). On AMD CPUs, "CALL ESP" will jump to the address in ESP, *then* push the return address. However, on Intel CPUs, it will push the return address first, then jump to the value it just pushed. This is, of course, disasterous if you try to use it.

According to Intel errata documents, this is a bug in the Pentium Pro that has been kept for several generations. The Pentium and below, except the 8086 and 8088, worked correctly with this instruction.

If you want to differentiate Intel and AMD in your program and don't want to use CPUID, you can set up a test with CALL ESP.

Melissa

--
"Screw Sun, cross-platform will never work. Let's move on and steal the Java language." - Visual J++ Product Manager
Quality Control at AMD must be good. by MROD · 2006-04-28 19:46 · Score: 5, Interesting

Having read a lot about this flaw it's actually amazing that AMD's quality control found the problem in the first place.

The actions needed to cause the problem to arise are so extreme that they'd never happen in the field. i.e. Loop through tight floating-point only instructions without any comparisons for maybe hours before the error occurs.

This would *NEVER* happen in the field. Firstly, in any modern OS the process would have been pre-empted long before any problem could occur (causing other instructions to run and hence stopping the overheating). Secondly, no real-world program would ever do this sort of thing as there would always be a comparison in the loop within the timeframe.

This is a theoretical problem only in the real world, especially as it only affects about 3000 processors in total (it has been quoted). This is why AMD gave it such a low priority. We should just forget about it and move on.

--

Agrajag: "Oh no, not again!"
Re:An old problem by Soul-Burn666 · 2006-04-28 20:25 · Score: 2, Interesting

Actually hardware IS different. As complex as hardware is, it is much less complex than software and has much simpler logic to check. This allows for systems for "formal verification" which happen to work exceedingly well for hardware. For example IBM's "RuleBase" is a system that uses temporal logic to verify a certain piece of "code" (which will later be compiled to hardware) against a set of logical rules.
When the system can be used, it helps clear out logic bugs very efficiently.

That being said, today's microprocessors are huge and therefore have to be split to modules in order to test like this. Moreover, it only tests logic. Other systems have to be used to test issues of overheating, cross-talk and actual physical design.

--
^_^
Re:What? by tomstdenis · 2006-04-29 00:37 · Score: 2, Interesting

There are two parts to that. First off, the composition of the die is varied. Some parts are the ALU, FPU, cache, etc. So depending where the current is going changes the heat [no duh]. The FPU is particularly nasty as unlike the ALU it takes at least 2 EX cycles to do anything and most complicated instructions are at least 4 EX cycles. This means something in the FPU is running for 4 cycles at a time, cannot be interrupted, etc.

So getting heat local to the FPU isn't too surprising. There are various things in place to mitigate that, for example, the heat spreader. But it can only absorb heat so fast. The lack of APIC interrupts (e.g. timers) makes this test rather artificial. If I recall correctly OSes send timer interrupts to processors to schedule tasks. So this would have to be something that is beyond an OSes control. Like you'd have to write your own mini-OS or something.

The other part though is you have to keep in mind making processors is not an exact process. My two x85 series opterons probably have slightly different features (e.g. exact alignment) even though they're made from the same design. If I sliced them open and got "my first electron microscope" and looked at them I'd probably be able to measure slight differences. There are other controlled issues (quality of material, chemcials, etc). So that a batch of processors exhibit this problem is concerning but not impossible.

I'll bet you they probably have another test on the QA line now :-)

Tom

--
Someday, I'll have a real sig.
Re:Uh oh.. by JollyFinn · 2006-04-29 01:43 · Score: 2, Interesting

No that won't crash its FLOATING POINT memory fetch, addition and multiplication loop! Then we need to unroll the loop enough to hide the floatingpoint unit latency. So that it stays active.

10 I = 10.1
20 K = I
21 K2 =I
22 K3= I
23 K4= I
30 K = K + 2.1
40 K = K * 2.1
50 K2 = K2 + 2.1
60 K2 = K2 * 2.1
70 K3 = K3 + 2.1
80 K3 = K3 * 2.1
90 K4 = K4 + 2.1
100 K4 = K4 * 2.1
50 GOTO 20

--
Emacs is good operating system, but it has one flaw: Its text editor could be better.
Re:An old problem by AWhistler · 2006-04-29 05:05 · Score: 2, Interesting

Then you REALLY need to get new MBA textbooks, since the one you have been reading is too politically correct to be useful. Here is a link from the guy who discovered the bug which includes a timeline (I can't believe his FAQ is still online!)...

http://www.trnicely.net/pentbug/pentbug.html/

Pay close attention to questions 9, 10, and 11. It explains what REALLY happened, and the author's opinions on the matter, which to my memory are quite accurate. How do I know? At the time I owned a Gateway Pentium 90 that I could use the Windows calculator on to verify the bug. So once Intel announced the recall, I called to get mine replaced. The box they shipped the replacement in was about 12" x 12" x 8" and very overpadded...way larger than the boxes they ship processors in now for sale. I had to replace my Pentium, box it up and send it back to Intel, and I had to give them my credit card number so that they could bill me $500 ($600?) if I failed to return the old CPU.
Re:An old problem by LurkerXXX · 2006-04-29 14:02 · Score: 2, Interesting

What the hell kind of crappy MBA program did you go to? Intel did *NOT* handle it well. I had one of those CPUs. Intel tried to tell me (a scientific researcher) that my computations didn't need that level of FPU accuracy, and that they wouldn't replace it. It was only after we, the users, screamed bloody murder and brought lawsuits that they decided to back down and replace them all.
The PR nightmare was *caused* specifically by the way Intel handled the discovery. They thought that they had the right to decide which users did or didn't 'need' accurate FPU computations.
I have been an 'AMD fanboy' from that day forward, specifically because of Intel's totally botched handling of an engineering glitch.
Interesting! by seebs · 2006-04-29 14:13 · Score: 1, Interesting

A friend of mine and I can reliably crash some similar-generation AMD chips with a loop setting a region of memory to all zeroes, but not with a loop setting it to 0xaaaaaaaa. The chips just lock up. Takes anywhere from a few seconds (linux) to a few minutes (windows).

--
My blog: http://www.seebs.net/log/ --- My iPhone/iPad app: http://www.seebs.net/seebsfrac/