Flawed AMD Chip Can Lead To Data Corruption
Brandonski writes "Apparently AMD allowed some flawed chips to slip through their detection grid. The problem affects only a small number of chips and only single core 2.6 and 2.8 GHz CPUs." From the article: "It is believed that the glitch is triggered when the affected chip's FPU is made to loop through a series of memory-fetch, multiplication and addition operations without any condition checks on the result of the calculations. The loop has to run over and over again for long enough to cause localized heating which together with high ambient temperatures could combine to cause the result of the operation to be recorded incorrectly, leading to data corruption."
Fetch Div, son of Eff Div, Heir to Count Zero, and Lord of a new generation of digital serfs, soon to be labled as having "emotional problems."
Generally, chips aren't supposed to have localized heating problems. Either it should all have a problem, or none of it should.
Ewige Blumenkraft.
Hey, I have an AMD 2.8Ghz. Maybe I should stop refresðN9'óI]öR9ù¥Î6ýPoe}+èa(ê{
"I've got more toys than Teruhisa Kitahara."
Corruption is the cardinal sin of a CPU. If it can't compute a result accurately, it should shut down rather than give a wrong answer.
I'm too young to remember the details (I think it goes back to the early eighties at least), but perhaps some of the elder gods that lurk around here might be able to supply more details.
sheep.horse - does not contain information on sheep or horses.
Is it reasonable to be afraid of this. To exploit this, in a way to allow running of arbitary code, you would need a buffer overflow - which is what this AMD weakness is purporting to allow. However, how many are affected? Only a few of the AMD chips, and AMD has only what, 30% of the market. So to code an exploit, you would be writing to a very limited audience, to a point where it is futile. Why not just exploit the latest create.Textrange of WMF exploit in IE/Windows? Much more money in that.
"Sure there's porn and piracy on the Web but there's probably a downside too."
The big question is will someone write malware/virus to somehow take advantage of this flaw?
I am curious how a virus could possibly exploit this. It would have to a) hog the resources so that it ran nearly exclusively, which would mean the virus already had control, and b) somehow cause a floating point error to result in a priviliages error. (priviliages and security routines rarely use floating point numbers). Also why would a kernel patch be released for this? It would hurt performance for the rest of us, customers with defective chips should simply return and replace them.
Philosophy.
Wow! AMD has invented a way to crash an infinite loop! Awesome! Intel? I bet their solution will take twice as long to crash this loop:
10 PRINT "HELLO WORLD"
20 GOTO 10
AMD is always innovating.
Hexy - a strategy game for iPhone/iPod Touch
AMD has a unique opportunity to do the right thing: offering to replace all the defective chips. If AMD does the right thing, then it will only help AMD in its litigation against Intel and in various attempts to increase marketshare. After all, would you not prefer to buy from a reputable company instead of a dishonest, shifty company?
"The company is also working with OEMs to identify affected parts and contact customers who could be affected - if they are, they will be offered free replacements."
forth paragraph in TFA.
Wow, that was fast. FreeBSD already has a patch for this.
:)
/usr/src/UPDATING)
Judging from the posting date, I *really* need to be updating my sources more often.
20060419: p7 FreeBSD-SA-06:14.fpu
Correct a local information leakage bug affecting AMD FPUs.
(could be an unrelated correction, I guess, it doesn't provide much more information in
Not likely. This is valid user code that is being executed. On other CPUs, the same code wouldn't cause a problem. Something like the F00F bug is fixable in the kernel by mucking with exception handler. This is pure user-land code.
Do you even lift?
These aren't the 'roids you're looking for.
This is different than the Intel bug; that was a logic flaw, where the chip computed a floating point quantity using an incorrect algorithm. This is an implementation error. In fact, the article mentions that they're going to re-spec the parts and they'll be fine. So if you've got a 2.8Ghz part, and you run this loop at 2.8Ghz (within the old spec), it's like you're "overclocking" (because you're actually outside of AMD's new spec). My guess is that if you over-bought your heatsink and got something better than the stock OEM cooling solution, you would be fine even if you ran this loop all day. Yay, arctic silver!
Is it reasonable to be afraid of this. To exploit this, in a way to allow running of arbitary code, you would need a buffer overflow
I think you are misunderstanding the nature of the problem. This is not data corruption as in buffer overflow, this is data corruption as in the calculation comes up with an incorrect answer. For some people that is not acceptible.
Just imagine if you had one of those Pinnacle chips and accidently pressed @[=g3,8d]\&fbb=-q]/hk%fg followed by delete..
I have used Prime95 in the past to identify problematic configurations. It's a tool whose main goal is to find prime numbers, but it can be used as an excellent stress test for the processor and memory units.
Could Prime95 be used to identify those AMD chips?
Probably the easiest errata to come by is the instruction "CALL ESP" (or "CALL RSP"). On AMD CPUs, "CALL ESP" will jump to the address in ESP, *then* push the return address. However, on Intel CPUs, it will push the return address first, then jump to the value it just pushed. This is, of course, disasterous if you try to use it.
According to Intel errata documents, this is a bug in the Pentium Pro that has been kept for several generations. The Pentium and below, except the 8086 and 8088, worked correctly with this instruction.
If you want to differentiate Intel and AMD in your program and don't want to use CPUID, you can set up a test with CALL ESP.
Melissa
"Screw Sun, cross-platform will never work. Let's move on and steal the Java language." - Visual J++ Product Manager
Having read a lot about this flaw it's actually amazing that AMD's quality control found the problem in the first place.
The actions needed to cause the problem to arise are so extreme that they'd never happen in the field. i.e. Loop through tight floating-point only instructions without any comparisons for maybe hours before the error occurs.
This would *NEVER* happen in the field. Firstly, in any modern OS the process would have been pre-empted long before any problem could occur (causing other instructions to run and hence stopping the overheating). Secondly, no real-world program would ever do this sort of thing as there would always be a comparison in the loop within the timeframe.
This is a theoretical problem only in the real world, especially as it only affects about 3000 processors in total (it has been quoted). This is why AMD gave it such a low priority. We should just forget about it and move on.
Agrajag: "Oh no, not again!"
Some of the tendering spreadsheets i've seen for a few companies i've worked for have had quite a lot of calculation going on in them - change a few cells that others depend on that have others depending on them, etc.... do that all day, it adds up quick.
You only need 1 of those operations in that instance to screw up and you could be down a few million dollars, if it's not picked up.
Even forgetting that it's just the moral thing to do...Risk vs replacement cost = no brainer. If only 3000 cpus are affected at say $300 each for amd to sell retail (i'm sure their cost is FAR less), they'd be mad not to just do it (maybe even offer a free speed bump) and reap the positive PR.
All it needs is for ONE company to blame a budget blowout on them and it's well and truly paid for...
smash.
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
If you have any interrupts coming in, or your loop has a termination condition. I think you have to have your hardware set to send an interrupt many hours in the future then start an otherwise nonterminating loop.
So under normal conditions on normal PC hardware, this simply won't happen.
There are two parts to that. First off, the composition of the die is varied. Some parts are the ALU, FPU, cache, etc. So depending where the current is going changes the heat [no duh]. The FPU is particularly nasty as unlike the ALU it takes at least 2 EX cycles to do anything and most complicated instructions are at least 4 EX cycles. This means something in the FPU is running for 4 cycles at a time, cannot be interrupted, etc.
:-)
So getting heat local to the FPU isn't too surprising. There are various things in place to mitigate that, for example, the heat spreader. But it can only absorb heat so fast. The lack of APIC interrupts (e.g. timers) makes this test rather artificial. If I recall correctly OSes send timer interrupts to processors to schedule tasks. So this would have to be something that is beyond an OSes control. Like you'd have to write your own mini-OS or something.
The other part though is you have to keep in mind making processors is not an exact process. My two x85 series opterons probably have slightly different features (e.g. exact alignment) even though they're made from the same design. If I sliced them open and got "my first electron microscope" and looked at them I'd probably be able to measure slight differences. There are other controlled issues (quality of material, chemcials, etc). So that a batch of processors exhibit this problem is concerning but not impossible.
I'll bet you they probably have another test on the QA line now
Tom
Someday, I'll have a real sig.
When AMD has a problem, it only affects 3000 or so processors and causes minor corruption when a million-line-long piece of code is called without being stopped at any time. When Intel has a problem it affects millions of processors and crashes your computer when a single 32-bit command is called. I know whom I'll be buying from.
Please, for the good of Humanity, vote Obama.
Of course, I expect AMD's production testing dept to have far better code, since they will devote more job hours to it and know proprietary chip details. Still, different parts of AMD as emailed me several times to thank me because they found the pgms useful. Great.
But these guys know what they're doing. Heat transfer from the hot multipliers has to be carefully analysed [3D finite element heat transfer analysis]. I suspect something far more mundane, like someone reducing die or slug thickness, or a mfg problem with the die/slug gap or thermal goop.
I think someone's confusing user error/not enough troubleshooting with an almost not reproducable issue. TFA mentions a lot of instructions without enough pause of FPU code to cool down. This isn't your bug if you're playing WC3. WC3 uses TCP/IP. TCP/IP generates interrupts - lots of interrupts. So many interrupts that your FPU has plenty of time to cool down between calculations. There are many handy ways of troubleshooting this issue of yours, and I'd bet you're not going to identify the problem by some slashdot story submission.
You are apt to be doing this extensively when processing audio or video streams.
"Prove all things; hold fast that which is good." [KJV: I Thessalonians 5:21]