Flawed AMD Chip Can Lead To Data Corruption
Brandonski writes "Apparently AMD allowed some flawed chips to slip through their detection grid. The problem affects only a small number of chips and only single core 2.6 and 2.8 GHz CPUs." From the article: "It is believed that the glitch is triggered when the affected chip's FPU is made to loop through a series of memory-fetch, multiplication and addition operations without any condition checks on the result of the calculations. The loop has to run over and over again for long enough to cause localized heating which together with high ambient temperatures could combine to cause the result of the operation to be recorded incorrectly, leading to data corruption."
Fetch Div, son of Eff Div, Heir to Count Zero, and Lord of a new generation of digital serfs, soon to be labled as having "emotional problems."
Generally, chips aren't supposed to have localized heating problems. Either it should all have a problem, or none of it should.
Ewige Blumenkraft.
Hey, I have an AMD 2.8Ghz. Maybe I should stop refresðN9'óI]öR9ù¥Î6ýPoe}+èa(ê{
"I've got more toys than Teruhisa Kitahara."
Corruption is the cardinal sin of a CPU. If it can't compute a result accurately, it should shut down rather than give a wrong answer.
I'm too young to remember the details (I think it goes back to the early eighties at least), but perhaps some of the elder gods that lurk around here might be able to supply more details.
sheep.horse - does not contain information on sheep or horses.
Is it reasonable to be afraid of this. To exploit this, in a way to allow running of arbitary code, you would need a buffer overflow - which is what this AMD weakness is purporting to allow. However, how many are affected? Only a few of the AMD chips, and AMD has only what, 30% of the market. So to code an exploit, you would be writing to a very limited audience, to a point where it is futile. Why not just exploit the latest create.Textrange of WMF exploit in IE/Windows? Much more money in that.
"Sure there's porn and piracy on the Web but there's probably a downside too."
The big question is will someone write malware/virus to somehow take advantage of this flaw?
I am curious how a virus could possibly exploit this. It would have to a) hog the resources so that it ran nearly exclusively, which would mean the virus already had control, and b) somehow cause a floating point error to result in a priviliages error. (priviliages and security routines rarely use floating point numbers). Also why would a kernel patch be released for this? It would hurt performance for the rest of us, customers with defective chips should simply return and replace them.
Philosophy.
To trigger the effect, the loop has to be run millions of time, an AMD customer source told Reg Hardware, potentially for hours at a time with no other operations being introduced during the run.
A flaw is a flaw, no doubt. However how likely is this particular scenario to happen other than a benchmark test? And 3000 CPU'S? In the news lately, this is almost categorizes as an oopsy. Security forms are losing millions of customers SSN's and everything. AMD could probaly tell you how to identify the CPU and afford to setup a program to exchange.
That which does not kill me only postpones the inevitable.
Memory fetch, multiplication, addition... where have I heard this before? Oh, I know. 3D graphics. Typically, those results go right to the screen and don't cause much damage if they get corrupted. I would be more worried about video or audio encoding, though, since those results do make a difference. Otherwise, I can't think of much else that would trigger this bug.
loop through a series of memory-fetch, multiplication and addition operations without any condition checks on the result of the calculations
I've been saying that for ages, check your results, but naah! Them young'uns and their series of memory-fetch, multiplication and addition operations.
It's not the first time their server chips have experienced heat problems...
Wow! AMD has invented a way to crash an infinite loop! Awesome! Intel? I bet their solution will take twice as long to crash this loop:
10 PRINT "HELLO WORLD"
20 GOTO 10
AMD is always innovating.
Hexy - a strategy game for iPhone/iPod Touch
"...customers with defective chips should simply return and replace them."
Simple for whom? It can be a real pain in the ass to swap a CPU.
AMD says that from now on, chips that have this problem will be rerated to lower clock speeds. It would be nice if they offered customers the option of turning down the clockspeed in exchange for a partial refund.
AMD has a unique opportunity to do the right thing: offering to replace all the defective chips. If AMD does the right thing, then it will only help AMD in its litigation against Intel and in various attempts to increase marketshare. After all, would you not prefer to buy from a reputable company instead of a dishonest, shifty company?
"The company is also working with OEMs to identify affected parts and contact customers who could be affected - if they are, they will be offered free replacements."
forth paragraph in TFA.
Wow, that was fast. FreeBSD already has a patch for this.
:)
/usr/src/UPDATING)
Judging from the posting date, I *really* need to be updating my sources more often.
20060419: p7 FreeBSD-SA-06:14.fpu
Correct a local information leakage bug affecting AMD FPUs.
(could be an unrelated correction, I guess, it doesn't provide much more information in
AMD has a unique opportunity to do the right thing: offering to replace all the defective chips. If AMD does the right thing, then it will only help AMD in its litigation against Intel and in various attempts to increase marketshare. After all, would you not prefer to buy from a reputable company instead of a dishonest, shifty company?
AMD have probably learnt from Intel's PR disaster. Without Intel's FP bug and the precedent it set, there's a good chance AMD would attempt to handle this problem the same way Intel did theirs. Business is business. An indication of this is that every component in the computer you're using probably has both documented and undocumented errata, and recalls are pretty unusual events.
In theory, a malicious user could exploit this vulnerability in a routine that's already calling that particular series of instructions. In practice, however, I think it would be nearly impossible to do anything useful, because you'll have no control over the values being written to memory; you'll just know that they aren't correct.
Just because it can't be explained doesn't mean it isn't true. Science fits into reality... not the other way around.
Not likely. This is valid user code that is being executed. On other CPUs, the same code wouldn't cause a problem. Something like the F00F bug is fixable in the kernel by mucking with exception handler. This is pure user-land code.
Do you even lift?
These aren't the 'roids you're looking for.
This is different than the Intel bug; that was a logic flaw, where the chip computed a floating point quantity using an incorrect algorithm. This is an implementation error. In fact, the article mentions that they're going to re-spec the parts and they'll be fine. So if you've got a 2.8Ghz part, and you run this loop at 2.8Ghz (within the old spec), it's like you're "overclocking" (because you're actually outside of AMD's new spec). My guess is that if you over-bought your heatsink and got something better than the stock OEM cooling solution, you would be fine even if you ran this loop all day. Yay, arctic silver!
Jesus. The things that people attribute to AMD's "moral superiority" here on Slashdot... It's astounding.
If AMD does "the right thing" it won't be because of a moral high road. It's because Intel already stepped on a similar PR landmine long ago. Learning from your rival's huge mistakes is not worth high praise. It's just common sense.
-- Mojo Tooth : exploring our world as only an idiot can.
I sure wouldn't want to find out which spots were hotter than others... touching an overheated chip that much would hurt! (Plus, if it gets too hot the CMOS should kill the machine anyway)
I make websites and stuff. Buy one.
There's no way the kernel can do anything about it, from the description of the problem.
And, contrary to AMD's attempts to downplay this issue, there are two immediate areas that I can think of which are affected. The first are certain scientific calculations (even worse, those involving Beowulf clusters). The second are CAD simulations.
Both areas can involve calculations which run for days at a time; far in excess of the hours mentioned in the fine article.
In general, people don't really seem to pay much attention to either the reliability of the CPU or the quality of the RAM though. Witness the number of really cheap systems that people buy for this type of work. Perhaps this will be a "heads up" that yes, even the most basic subsystem of your computer can go haywire, skewing your results, and wasting your time.
The best way to predict the future is to create it. - Peter Drucker.
AMD says that from now on, chips that have this problem will be rerated to lower clock speeds..
...
And then end users will overclock these CPU
Is it reasonable to be afraid of this. To exploit this, in a way to allow running of arbitary code, you would need a buffer overflow
I think you are misunderstanding the nature of the problem. This is not data corruption as in buffer overflow, this is data corruption as in the calculation comes up with an incorrect answer. For some people that is not acceptible.
Yes i have a faulty cpu..
Turn its clock down, right, yep done that.
So now ill never be affected by this obscure glitch that is almost totaly unreproducable outside of synthetic testing, oh thanks very much.
can i have the check now please ?
*check arives*
*cashes check*
*clocks cpus back up*
XML - A clever joke would be here if
These flaws only occur in unlikely circumstances, but they will be useful tools when fighting our new computer overlords.
Just imagine if you had one of those Pinnacle chips and accidently pressed @[=g3,8d]\&fbb=-q]/hk%fg followed by delete..
I have used Prime95 in the past to identify problematic configurations. It's a tool whose main goal is to find prime numbers, but it can be used as an excellent stress test for the processor and memory units.
Could Prime95 be used to identify those AMD chips?
How about that, I was wondering why my computer was giving me the message "all your base belong to us". heh ok that was dumb but hey! its slashdot, i know one of you laughed. But seriously, I do have this chip and my computer is evil, therefore, it must be the chip! Not the fact I have 98 gigabytes of music porn and uhh porn.
Actually it's very common for cpu manufacturers to just underclock overheating chips.
Also, an AMD chip is only rated up to around 75^C anyway from memory.
If his overclocking it causes a problem he can kick his own ass.
Probably the easiest errata to come by is the instruction "CALL ESP" (or "CALL RSP"). On AMD CPUs, "CALL ESP" will jump to the address in ESP, *then* push the return address. However, on Intel CPUs, it will push the return address first, then jump to the value it just pushed. This is, of course, disasterous if you try to use it.
According to Intel errata documents, this is a bug in the Pentium Pro that has been kept for several generations. The Pentium and below, except the 8086 and 8088, worked correctly with this instruction.
If you want to differentiate Intel and AMD in your program and don't want to use CPUID, you can set up a test with CALL ESP.
Melissa
"Screw Sun, cross-platform will never work. Let's move on and steal the Java language." - Visual J++ Product Manager
Having read a lot about this flaw it's actually amazing that AMD's quality control found the problem in the first place.
The actions needed to cause the problem to arise are so extreme that they'd never happen in the field. i.e. Loop through tight floating-point only instructions without any comparisons for maybe hours before the error occurs.
This would *NEVER* happen in the field. Firstly, in any modern OS the process would have been pre-empted long before any problem could occur (causing other instructions to run and hence stopping the overheating). Secondly, no real-world program would ever do this sort of thing as there would always be a comparison in the loop within the timeframe.
This is a theoretical problem only in the real world, especially as it only affects about 3000 processors in total (it has been quoted). This is why AMD gave it such a low priority. We should just forget about it and move on.
Agrajag: "Oh no, not again!"
as long as i have the nice fat "refund" check for clocking down the cpus, who cares :P
XML - A clever joke would be here if
don't "we" commonly measure code in terms of how many errors per KLoC it contains? How does any of this pale in comparison?
Some information derived from an old leaked MS email: The internal rule in Redmond is "4 bugs per KLoC".
that's one bug per 256 lines of code..one bug every couple of functions, for the smaller functions.. several bugs per function, for the more intensive functions.
and it's not like OSS is safe from this either. Sure, having error prone hardware will make perfection even more impossible, but seriously, you are far more likely to receive faulty RAM out of the box than to run in to an issue with one of these suckes.
Vehicle Stars used car search is my current project
If you have any interrupts coming in, or your loop has a termination condition. I think you have to have your hardware set to send an interrupt many hours in the future then start an otherwise nonterminating loop.
So under normal conditions on normal PC hardware, this simply won't happen.
True, but think of all the money and resources lost by Intel for a rare error that would not effect the vast majority of the customers using the chip. How many perfectly decent chips were just thrown away, passing the cost onto the consumer who had to make up for the money lost in remanufacturing these chips. Here is a report done by Intel on how often an average user might see an error: http://www.intel.com/support/processors/pentium/fd iv/wp/6.htm
It's certainly bad PR for AMD and they will most likely offer an exchange program like Intel, but the practical need for exchanges isn't really there (if what I am reading in other comments is correct).
I've run Prime95 on two of my boxes. Fine on one. My AMD Athlon XP 2000 (Socket A), which I often suspected to be unstable, reliably dies after less than an hour of running the Prime95 code (originally discovered from running the Seventeen or Bust client, which includes the same Prime95 code). BIOS says temperature is normal, so I'll just blame the motherboard caps for now.
Either way, Prime95 has given proof to my suspicion like no other tool could. I no longer run important/intensive apps there.
Intel had a very similar problem with some of their Itanium chips recently too, however i don't recall them offering free replacements, i believe they just told customers to clock down affected processors!
However, very few people cared because very few people use itanium chips, and those who do are used to them not performing as advertised.
http://spamdecoy.net - free throwaway anonymous email - avoid spam!
You could have bought a "downrated" chip and overclocked it, too. The clock rate on the box is merely specification, the chip can remain operational at higher clock rates, the manufacturer just won't be responsible for what happens then.
Justice is the sheep getting arrested while an impartial judge declares the vote void.
... I just got my pair of 285s! ... well fortunately I don't do a lot of FPU work like that. That and I run cpufreq in "ondemand" mode so I don't care about heat...
Tom
Someday, I'll have a real sig.
There are two parts to that. First off, the composition of the die is varied. Some parts are the ALU, FPU, cache, etc. So depending where the current is going changes the heat [no duh]. The FPU is particularly nasty as unlike the ALU it takes at least 2 EX cycles to do anything and most complicated instructions are at least 4 EX cycles. This means something in the FPU is running for 4 cycles at a time, cannot be interrupted, etc.
:-)
So getting heat local to the FPU isn't too surprising. There are various things in place to mitigate that, for example, the heat spreader. But it can only absorb heat so fast. The lack of APIC interrupts (e.g. timers) makes this test rather artificial. If I recall correctly OSes send timer interrupts to processors to schedule tasks. So this would have to be something that is beyond an OSes control. Like you'd have to write your own mini-OS or something.
The other part though is you have to keep in mind making processors is not an exact process. My two x85 series opterons probably have slightly different features (e.g. exact alignment) even though they're made from the same design. If I sliced them open and got "my first electron microscope" and looked at them I'd probably be able to measure slight differences. There are other controlled issues (quality of material, chemcials, etc). So that a batch of processors exhibit this problem is concerning but not impossible.
I'll bet you they probably have another test on the QA line now
Tom
Someday, I'll have a real sig.
Overheating leading to data corruption? Since when is this a flaw in chip design?
Since a normal temperature of functionning is written in the specifications of the hip
The Wise adapts himself to the world. The Fool adapts the world to himself. Therefore, all progress depends on the Fool.
Did anyone else have the reflex of doing cat /proc/cpuinfo?
Floating point is hopelessly problematic for the average programmer and too many average programmers wrote the programs from Excel to MS Calculator and by any number of other vendors, all of which had "Pentium bugs" reported, that didn't need particular Intel hardware to be reproduced.
When AMD has a problem, it only affects 3000 or so processors and causes minor corruption when a million-line-long piece of code is called without being stopped at any time. When Intel has a problem it affects millions of processors and crashes your computer when a single 32-bit command is called. I know whom I'll be buying from.
Please, for the good of Humanity, vote Obama.
Since a normal temperature of functionning is written in the specifications of the hip
;-)
I can't find any written instructions on my hip. Which is another piece of circumstantial evidence of my theory that my parents bought me from a chinese clone factory.
If J.K.R wrote Windows: Puteulanus fenestra mortalis!
Funny, the ad that appeared on the comments page had some code P.S. Anyone remember the HCF instruction (halt and catch fire).
Although the article specifies 2.6 and 2.8 ghz opterons, I've crashed my Venice core 3000+ socket 754 7 times from online gaming conditions generated by a particilar application (warcraft 3 TFT)
I thought it was the graphic card at first, but the type of crash I've been experiencing and the difficulty to reproduce it (I generally have to play AT with a pro gamer and go on about a 7 game win streak to get game conditions right for the crash) and it does have to be warm in my room...
WC3TFT can reproducably create a lot of memory operations at very High speeds repeatably, millions of times? try millions of operations over a 10 minute game. Sounds like it's not just 'hypothetical' to me.
https://www.gnu.org/philosophy/free-sw.html
So basically you have to stand on one leg, be male, wearing a pink tu tu, live in niger with exactly 3 children who happen to be eating pizza during a lunar eclipse for this to happen?
The best argument against democracy is a five-minute conversation with the average voter.
- Winston Churchill
On the other hand, a compiler fix is plausible. The idea would be to avoid generating this kind of code. I'm sure some compiler gurus can point out precedent for this sort of thing.
Of course, I expect AMD's production testing dept to have far better code, since they will devote more job hours to it and know proprietary chip details. Still, different parts of AMD as emailed me several times to thank me because they found the pgms useful. Great.
But these guys know what they're doing. Heat transfer from the hot multipliers has to be carefully analysed [3D finite element heat transfer analysis]. I suspect something far more mundane, like someone reducing die or slug thickness, or a mfg problem with the die/slug gap or thermal goop.
The NO CARRIER joke was so nice but does not fit with this probl*NO DATA*
--- I am known for the ones who want to find me on the net. Is that a privacy risk or a privilege? One might wonder..
http://www.heatsink-guide.com/content.php?content= maxtemp.shtml
OK, so Windows is immune too. These operating systems have a clock tick that interrupts at 100, 250, or 1000 Hz. That interrupts the FPU.
Crank up the clock rate even more if you are worried and you just have to run your CPU in tropical temperatures. You could also ping flood the machine, causing plenty off network interrupts.
I was an Intel man for many years. It's like being a ford or a chevy man
you know, you ignore all good things about the competition and smugly
goof on all their mistakes while ignoring your favorite's eccentricities.
My wakeup call came as I was looking into building a cheap comp to play
UT 2K4 on. I went through the benchmark results to find a good processor
for cheap and was appalled at the prices that intel wanted for middle of the road dreck while AMD had several budget choices that were faster. I finally settled on a sempron 3100+ and I can't believe how many games I can play with just an nvidia 6600le and they all rock. I tried out a friends dell that was supposed to be high end and it couldn't match my resolution or fps and he paid 2300 for his intel boat anchor while I paid exactly 404.17 with shipping for my budget screamer. All this and ethical
treatment for customers too? Long Live AMD
I have formed my own personal postulate/theory/law... and it's corollary: It is impossible to completely test a sufficiently complex system in every possible way to be certain that it's bug-free.
Along those lines - many years ago, Professor Turing set out to find a test for [among other things] the possible presence of an infinite loop within a computer program.
Sadly, though, he didn't get very far with that line of inquiry...
Hard to say. This is a design margin thing, depending upon worst case conditions plus localized heating, and localized heating (AFAIK) isn't generally modeled. Writing test vectors to find all logic errors is difficult, unpleasant, and labor intensive work. Even if software identifies the worst case path, it won't account for localized heating.
I'd guess there are other problems out there like this, but they generally can be avoided by staying well away from maximum operating conditions: keep your chip cool and within the specified voltage range, and don't overclock.
Contribute to civilization: ari.aynrand.org/donate
I suspect something far more mundane, like someone reducing die or slug thickness, or a mfg problem with the die/slug gap or thermal goop.
Care to go into a bit more detail for us noobs?
see the point was that in response to the GP, that they should give people money to down clock them themselves... and i was pointing out the inherent flaw that theres nothing stopping them from just pretending to have a faulty chip, pretending to underclock it, and leaving unchanged, and pocketing a pile of money
XML - A clever joke would be here if
Yes - the ability to take corruption into account is what differs mainframes (and also high-end IBM UNIX servers like p595) from PCs.
You can defy gravity... for a short time
I'm not sure why TFM didn't link to AMD for their disclosure of the problem, but here it is: http://www.amd.com/us-en/0,,3715_13965,00.html
If they have a way of verifying whether a CPU is affected they could indeed give the money back and tell the customer that his warranty now only applies to the lower frequency. Since these are server CPUs they probably have maintenance contracts attached and those require that the CPU is clocked at manufacturer spec. Warranty means a lot more to companies than home users.
I mean, they could exchange the CPU (maybe just changing the clock multiplier and sending it back) but nothing would stop you from operating it at the higher frequency except for the warranty.
Justice is the sheep getting arrested while an impartial judge declares the vote void.
No sufficiently complex system can ever be completely bug-free.
What do you get if you multiply six by nine?!
Constitutional rights may be respected, repealed, or modified; but they must never be ignored.
I don't know about fearmongering, but it's certainly going to some trouble to make a false accusation. Either it slipped through their "detection grid", or it was detected and ignored. It can't have been both.
You are apt to be doing this extensively when processing audio or video streams.
"Prove all things; hold fast that which is good." [KJV: I Thessalonians 5:21]
A friend of mine and I can reliably crash some similar-generation AMD chips with a loop setting a region of memory to all zeroes, but not with a loop setting it to 0xaaaaaaaa. The chips just lock up. Takes anywhere from a few seconds (linux) to a few minutes (windows).
My blog: http://www.seebs.net/log/ --- My iPhone/iPad app: http://www.seebs.net/seebsfrac/
I have a P90 (one of those that was remarked down to P75 for the market sweet spot, but because it's really a P90, it runs fine at 90MHz) that has some sort of FP bug... it passes the Calculator test, but locks up with certain math-intensive screen savers, like the old After Dark kaleidoscope. It never showed any other symptoms in its 6 years of useful life, so I didn't bother to RMA it.
I don't consider this as bad as the Sept.1998 batch of K6-2 450Mhz CPUs that could not run certain 32bit code AT ALL (neither Win32 Setup nor any species of Linux would run). AMD refused to replace those at all.
~REZ~ #43301. Who'd fake being me anyway?
Interrupts are not sufficient. You can make a tight loop and still hog >99% of the CPU scheduler. As long as the interrupts don't exceed the thermal time constant of the cooling solution you can easily write a virus to do this (assuming you know the loop).
https://www.accountkiller.com/removal-requested
From websters, the first definition is "posing a problem : difficult to solve or decide". It is the model that is at fault and preesents the problem. It is not because it is plaged by problems, but because it is the problem, the riddle, how you can use it with good results for complicated computations.