Flawed AMD Chip Can Lead To Data Corruption
Brandonski writes "Apparently AMD allowed some flawed chips to slip through their detection grid. The problem affects only a small number of chips and only single core 2.6 and 2.8 GHz CPUs." From the article: "It is believed that the glitch is triggered when the affected chip's FPU is made to loop through a series of memory-fetch, multiplication and addition operations without any condition checks on the result of the calculations. The loop has to run over and over again for long enough to cause localized heating which together with high ambient temperatures could combine to cause the result of the operation to be recorded incorrectly, leading to data corruption."
Generally, chips aren't supposed to have localized heating problems. Either it should all have a problem, or none of it should.
Ewige Blumenkraft.
Is it reasonable to be afraid of this. To exploit this, in a way to allow running of arbitary code, you would need a buffer overflow - which is what this AMD weakness is purporting to allow. However, how many are affected? Only a few of the AMD chips, and AMD has only what, 30% of the market. So to code an exploit, you would be writing to a very limited audience, to a point where it is futile. Why not just exploit the latest create.Textrange of WMF exploit in IE/Windows? Much more money in that.
"Sure there's porn and piracy on the Web but there's probably a downside too."
Wow, that was fast. FreeBSD already has a patch for this.
:)
/usr/src/UPDATING)
Judging from the posting date, I *really* need to be updating my sources more often.
20060419: p7 FreeBSD-SA-06:14.fpu
Correct a local information leakage bug affecting AMD FPUs.
(could be an unrelated correction, I guess, it doesn't provide much more information in
Probably the easiest errata to come by is the instruction "CALL ESP" (or "CALL RSP"). On AMD CPUs, "CALL ESP" will jump to the address in ESP, *then* push the return address. However, on Intel CPUs, it will push the return address first, then jump to the value it just pushed. This is, of course, disasterous if you try to use it.
According to Intel errata documents, this is a bug in the Pentium Pro that has been kept for several generations. The Pentium and below, except the 8086 and 8088, worked correctly with this instruction.
If you want to differentiate Intel and AMD in your program and don't want to use CPUID, you can set up a test with CALL ESP.
Melissa
"Screw Sun, cross-platform will never work. Let's move on and steal the Java language." - Visual J++ Product Manager
Having read a lot about this flaw it's actually amazing that AMD's quality control found the problem in the first place.
The actions needed to cause the problem to arise are so extreme that they'd never happen in the field. i.e. Loop through tight floating-point only instructions without any comparisons for maybe hours before the error occurs.
This would *NEVER* happen in the field. Firstly, in any modern OS the process would have been pre-empted long before any problem could occur (causing other instructions to run and hence stopping the overheating). Secondly, no real-world program would ever do this sort of thing as there would always be a comparison in the loop within the timeframe.
This is a theoretical problem only in the real world, especially as it only affects about 3000 processors in total (it has been quoted). This is why AMD gave it such a low priority. We should just forget about it and move on.
Agrajag: "Oh no, not again!"
Actually hardware IS different. As complex as hardware is, it is much less complex than software and has much simpler logic to check. This allows for systems for "formal verification" which happen to work exceedingly well for hardware. For example IBM's "RuleBase" is a system that uses temporal logic to verify a certain piece of "code" (which will later be compiled to hardware) against a set of logical rules.
When the system can be used, it helps clear out logic bugs very efficiently.
That being said, today's microprocessors are huge and therefore have to be split to modules in order to test like this. Moreover, it only tests logic. Other systems have to be used to test issues of overheating, cross-talk and actual physical design.
^_^
There are two parts to that. First off, the composition of the die is varied. Some parts are the ALU, FPU, cache, etc. So depending where the current is going changes the heat [no duh]. The FPU is particularly nasty as unlike the ALU it takes at least 2 EX cycles to do anything and most complicated instructions are at least 4 EX cycles. This means something in the FPU is running for 4 cycles at a time, cannot be interrupted, etc.
:-)
So getting heat local to the FPU isn't too surprising. There are various things in place to mitigate that, for example, the heat spreader. But it can only absorb heat so fast. The lack of APIC interrupts (e.g. timers) makes this test rather artificial. If I recall correctly OSes send timer interrupts to processors to schedule tasks. So this would have to be something that is beyond an OSes control. Like you'd have to write your own mini-OS or something.
The other part though is you have to keep in mind making processors is not an exact process. My two x85 series opterons probably have slightly different features (e.g. exact alignment) even though they're made from the same design. If I sliced them open and got "my first electron microscope" and looked at them I'd probably be able to measure slight differences. There are other controlled issues (quality of material, chemcials, etc). So that a batch of processors exhibit this problem is concerning but not impossible.
I'll bet you they probably have another test on the QA line now
Tom
Someday, I'll have a real sig.
No that won't crash its FLOATING POINT memory fetch, addition and multiplication loop! Then we need to unroll the loop enough to hide the floatingpoint unit latency. So that it stays active.
10 I = 10.1
20 K = I
21 K2 =I
22 K3= I
23 K4= I
30 K = K + 2.1
40 K = K * 2.1
50 K2 = K2 + 2.1
60 K2 = K2 * 2.1
70 K3 = K3 + 2.1
80 K3 = K3 * 2.1
90 K4 = K4 + 2.1
100 K4 = K4 * 2.1
50 GOTO 20
Emacs is good operating system, but it has one flaw: Its text editor could be better.
Then you REALLY need to get new MBA textbooks, since the one you have been reading is too politically correct to be useful. Here is a link from the guy who discovered the bug which includes a timeline (I can't believe his FAQ is still online!)...
http://www.trnicely.net/pentbug/pentbug.html/
Pay close attention to questions 9, 10, and 11. It explains what REALLY happened, and the author's opinions on the matter, which to my memory are quite accurate. How do I know? At the time I owned a Gateway Pentium 90 that I could use the Windows calculator on to verify the bug. So once Intel announced the recall, I called to get mine replaced. The box they shipped the replacement in was about 12" x 12" x 8" and very overpadded...way larger than the boxes they ship processors in now for sale. I had to replace my Pentium, box it up and send it back to Intel, and I had to give them my credit card number so that they could bill me $500 ($600?) if I failed to return the old CPU.
The PR nightmare was *caused* specifically by the way Intel handled the discovery. They thought that they had the right to decide which users did or didn't 'need' accurate FPU computations.
I have been an 'AMD fanboy' from that day forward, specifically because of Intel's totally botched handling of an engineering glitch.
A friend of mine and I can reliably crash some similar-generation AMD chips with a loop setting a region of memory to all zeroes, but not with a loop setting it to 0xaaaaaaaa. The chips just lock up. Takes anywhere from a few seconds (linux) to a few minutes (windows).
My blog: http://www.seebs.net/log/ --- My iPhone/iPad app: http://www.seebs.net/seebsfrac/