Flawed AMD Chip Can Lead To Data Corruption
Brandonski writes "Apparently AMD allowed some flawed chips to slip through their detection grid. The problem affects only a small number of chips and only single core 2.6 and 2.8 GHz CPUs." From the article: "It is believed that the glitch is triggered when the affected chip's FPU is made to loop through a series of memory-fetch, multiplication and addition operations without any condition checks on the result of the calculations. The loop has to run over and over again for long enough to cause localized heating which together with high ambient temperatures could combine to cause the result of the operation to be recorded incorrectly, leading to data corruption."
Fetch Div, son of Eff Div, Heir to Count Zero, and Lord of a new generation of digital serfs, soon to be labled as having "emotional problems."
Overheating leading to data corruption? Since when is this a flaw in chip design?
Hey, I have an AMD 2.8Ghz. Maybe I should stop refresðN9'óI]öR9ù¥Î6ýPoe}+èa(ê{
"I've got more toys than Teruhisa Kitahara."
Corruption is the cardinal sin of a CPU. If it can't compute a result accurately, it should shut down rather than give a wrong answer.
I'm sure someone will have a kernel patch to prevent this from happening in linux in very short order. The big question is will someone write malware/virus to somehow take advantage of this flaw?
I'm too young to remember the details (I think it goes back to the early eighties at least), but perhaps some of the elder gods that lurk around here might be able to supply more details.
sheep.horse - does not contain information on sheep or horses.
Is it reasonable to be afraid of this. To exploit this, in a way to allow running of arbitary code, you would need a buffer overflow - which is what this AMD weakness is purporting to allow. However, how many are affected? Only a few of the AMD chips, and AMD has only what, 30% of the market. So to code an exploit, you would be writing to a very limited audience, to a point where it is futile. Why not just exploit the latest create.Textrange of WMF exploit in IE/Windows? Much more money in that.
"Sure there's porn and piracy on the Web but there's probably a downside too."
To trigger the effect, the loop has to be run millions of time, an AMD customer source told Reg Hardware, potentially for hours at a time with no other operations being introduced during the run.
A flaw is a flaw, no doubt. However how likely is this particular scenario to happen other than a benchmark test? And 3000 CPU'S? In the news lately, this is almost categorizes as an oopsy. Security forms are losing millions of customers SSN's and everything. AMD could probaly tell you how to identify the CPU and afford to setup a program to exchange.
That which does not kill me only postpones the inevitable.
Memory fetch, multiplication, addition... where have I heard this before? Oh, I know. 3D graphics. Typically, those results go right to the screen and don't cause much damage if they get corrupted. I would be more worried about video or audio encoding, though, since those results do make a difference. Otherwise, I can't think of much else that would trigger this bug.
loop through a series of memory-fetch, multiplication and addition operations without any condition checks on the result of the calculations
I've been saying that for ages, check your results, but naah! Them young'uns and their series of memory-fetch, multiplication and addition operations.
It's not the first time their server chips have experienced heat problems...
Wow! AMD has invented a way to crash an infinite loop! Awesome! Intel? I bet their solution will take twice as long to crash this loop:
10 PRINT "HELLO WORLD"
20 GOTO 10
AMD is always innovating.
Hexy - a strategy game for iPhone/iPod Touch
AMD has a unique opportunity to do the right thing: offering to replace all the defective chips. If AMD does the right thing, then it will only help AMD in its litigation against Intel and in various attempts to increase marketshare. After all, would you not prefer to buy from a reputable company instead of a dishonest, shifty company?
"The company is also working with OEMs to identify affected parts and contact customers who could be affected - if they are, they will be offered free replacements."
forth paragraph in TFA.
Wow, that was fast. FreeBSD already has a patch for this.
:)
/usr/src/UPDATING)
Judging from the posting date, I *really* need to be updating my sources more often.
20060419: p7 FreeBSD-SA-06:14.fpu
Correct a local information leakage bug affecting AMD FPUs.
(could be an unrelated correction, I guess, it doesn't provide much more information in
AMD has a unique opportunity to do the right thing: offering to replace all the defective chips. If AMD does the right thing, then it will only help AMD in its litigation against Intel and in various attempts to increase marketshare. After all, would you not prefer to buy from a reputable company instead of a dishonest, shifty company?
AMD have probably learnt from Intel's PR disaster. Without Intel's FP bug and the precedent it set, there's a good chance AMD would attempt to handle this problem the same way Intel did theirs. Business is business. An indication of this is that every component in the computer you're using probably has both documented and undocumented errata, and recalls are pretty unusual events.
This is different than the Intel bug; that was a logic flaw, where the chip computed a floating point quantity using an incorrect algorithm. This is an implementation error. In fact, the article mentions that they're going to re-spec the parts and they'll be fine. So if you've got a 2.8Ghz part, and you run this loop at 2.8Ghz (within the old spec), it's like you're "overclocking" (because you're actually outside of AMD's new spec). My guess is that if you over-bought your heatsink and got something better than the stock OEM cooling solution, you would be fine even if you ran this loop all day. Yay, arctic silver!
That would be the first thing I do with my CPU. How about you!?
Jesus. The things that people attribute to AMD's "moral superiority" here on Slashdot... It's astounding.
If AMD does "the right thing" it won't be because of a moral high road. It's because Intel already stepped on a similar PR landmine long ago. Learning from your rival's huge mistakes is not worth high praise. It's just common sense.
-- Mojo Tooth : exploring our world as only an idiot can.
I sure wouldn't want to find out which spots were hotter than others... touching an overheated chip that much would hurt! (Plus, if it gets too hot the CMOS should kill the machine anyway)
I make websites and stuff. Buy one.
There's no way the kernel can do anything about it, from the description of the problem.
And, contrary to AMD's attempts to downplay this issue, there are two immediate areas that I can think of which are affected. The first are certain scientific calculations (even worse, those involving Beowulf clusters). The second are CAD simulations.
Both areas can involve calculations which run for days at a time; far in excess of the hours mentioned in the fine article.
In general, people don't really seem to pay much attention to either the reliability of the CPU or the quality of the RAM though. Witness the number of really cheap systems that people buy for this type of work. Perhaps this will be a "heads up" that yes, even the most basic subsystem of your computer can go haywire, skewing your results, and wasting your time.
The best way to predict the future is to create it. - Peter Drucker.
AMD says that from now on, chips that have this problem will be rerated to lower clock speeds..
...
And then end users will overclock these CPU
From TFA
AMD said it has introduced another screening test to catch any further affected parts. Chips caught in this test in future will be re-rated at a lower clock speed to prevent the problem.
Don't you find this to be a bit disturbing? (pun intended)
I wonder how hot the circuit had to be to fail. Let's say 100^C.
Now, that means AMD test their chips at less than that, let's say 90^C.
At least this bug was found. How many more like it are there, but we simply don't have the proper trace to find it?
--
Is it reasonable to be afraid of this. To exploit this, in a way to allow running of arbitary code, you would need a buffer overflow
I think you are misunderstanding the nature of the problem. This is not data corruption as in buffer overflow, this is data corruption as in the calculation comes up with an incorrect answer. For some people that is not acceptible.
These flaws only occur in unlikely circumstances, but they will be useful tools when fighting our new computer overlords.
Just imagine if you had one of those Pinnacle chips and accidently pressed @[=g3,8d]\&fbb=-q]/hk%fg followed by delete..
I have used Prime95 in the past to identify problematic configurations. It's a tool whose main goal is to find prime numbers, but it can be used as an excellent stress test for the processor and memory units.
Could Prime95 be used to identify those AMD chips?
How about that, I was wondering why my computer was giving me the message "all your base belong to us". heh ok that was dumb but hey! its slashdot, i know one of you laughed. But seriously, I do have this chip and my computer is evil, therefore, it must be the chip! Not the fact I have 98 gigabytes of music porn and uhh porn.
Probably the easiest errata to come by is the instruction "CALL ESP" (or "CALL RSP"). On AMD CPUs, "CALL ESP" will jump to the address in ESP, *then* push the return address. However, on Intel CPUs, it will push the return address first, then jump to the value it just pushed. This is, of course, disasterous if you try to use it.
According to Intel errata documents, this is a bug in the Pentium Pro that has been kept for several generations. The Pentium and below, except the 8086 and 8088, worked correctly with this instruction.
If you want to differentiate Intel and AMD in your program and don't want to use CPUID, you can set up a test with CALL ESP.
Melissa
"Screw Sun, cross-platform will never work. Let's move on and steal the Java language." - Visual J++ Product Manager
Having read a lot about this flaw it's actually amazing that AMD's quality control found the problem in the first place.
The actions needed to cause the problem to arise are so extreme that they'd never happen in the field. i.e. Loop through tight floating-point only instructions without any comparisons for maybe hours before the error occurs.
This would *NEVER* happen in the field. Firstly, in any modern OS the process would have been pre-empted long before any problem could occur (causing other instructions to run and hence stopping the overheating). Secondly, no real-world program would ever do this sort of thing as there would always be a comparison in the loop within the timeframe.
This is a theoretical problem only in the real world, especially as it only affects about 3000 processors in total (it has been quoted). This is why AMD gave it such a low priority. We should just forget about it and move on.
Agrajag: "Oh no, not again!"
You don't understand. We're all excruciatingly dimwitted here. So we can only comprehend things in dirt simple terms, like "good" and "evil". And if something is "evil", its rival must be "good", because if there was an ounce more of nuance or complexity involved than that, us Slashdotters' brains would explode!
don't "we" commonly measure code in terms of how many errors per KLoC it contains? How does any of this pale in comparison?
Some information derived from an old leaked MS email: The internal rule in Redmond is "4 bugs per KLoC".
that's one bug per 256 lines of code..one bug every couple of functions, for the smaller functions.. several bugs per function, for the more intensive functions.
and it's not like OSS is safe from this either. Sure, having error prone hardware will make perfection even more impossible, but seriously, you are far more likely to receive faulty RAM out of the box than to run in to an issue with one of these suckes.
Vehicle Stars used car search is my current project
If you have any interrupts coming in, or your loop has a termination condition. I think you have to have your hardware set to send an interrupt many hours in the future then start an otherwise nonterminating loop.
So under normal conditions on normal PC hardware, this simply won't happen.
True, but think of all the money and resources lost by Intel for a rare error that would not effect the vast majority of the customers using the chip. How many perfectly decent chips were just thrown away, passing the cost onto the consumer who had to make up for the money lost in remanufacturing these chips. Here is a report done by Intel on how often an average user might see an error: http://www.intel.com/support/processors/pentium/fd iv/wp/6.htm
It's certainly bad PR for AMD and they will most likely offer an exchange program like Intel, but the practical need for exchanges isn't really there (if what I am reading in other comments is correct).
I've run Prime95 on two of my boxes. Fine on one. My AMD Athlon XP 2000 (Socket A), which I often suspected to be unstable, reliably dies after less than an hour of running the Prime95 code (originally discovered from running the Seventeen or Bust client, which includes the same Prime95 code). BIOS says temperature is normal, so I'll just blame the motherboard caps for now.
Either way, Prime95 has given proof to my suspicion like no other tool could. I no longer run important/intensive apps there.
Netcraft confirms it. FreeBSD is dying.
The second post of the story already examined this otherwise humorous jest. And just so you know... the NO CARRIER thing is over now... you can stop trying to use it.
... I just got my pair of 285s! ... well fortunately I don't do a lot of FPU work like that. That and I run cpufreq in "ondemand" mode so I don't care about heat...
Tom
Someday, I'll have a real sig.
Did anyone else have the reflex of doing cat /proc/cpuinfo?
Floating point is hopelessly problematic for the average programmer and too many average programmers wrote the programs from Excel to MS Calculator and by any number of other vendors, all of which had "Pentium bugs" reported, that didn't need particular Intel hardware to be reproduced.
When AMD has a problem, it only affects 3000 or so processors and causes minor corruption when a million-line-long piece of code is called without being stopped at any time. When Intel has a problem it affects millions of processors and crashes your computer when a single 32-bit command is called. I know whom I'll be buying from.
Please, for the good of Humanity, vote Obama.
Funny, the ad that appeared on the comments page had some code P.S. Anyone remember the HCF instruction (halt and catch fire).
Although the article specifies 2.6 and 2.8 ghz opterons, I've crashed my Venice core 3000+ socket 754 7 times from online gaming conditions generated by a particilar application (warcraft 3 TFT)
I thought it was the graphic card at first, but the type of crash I've been experiencing and the difficulty to reproduce it (I generally have to play AT with a pro gamer and go on about a 7 game win streak to get game conditions right for the crash) and it does have to be warm in my room...
WC3TFT can reproducably create a lot of memory operations at very High speeds repeatably, millions of times? try millions of operations over a 10 minute game. Sounds like it's not just 'hypothetical' to me.
https://www.gnu.org/philosophy/free-sw.html
So basically you have to stand on one leg, be male, wearing a pink tu tu, live in niger with exactly 3 children who happen to be eating pizza during a lunar eclipse for this to happen?
The best argument against democracy is a five-minute conversation with the average voter.
- Winston Churchill
A whole paragraph in Forth ?! Ingenious!
Of course, I expect AMD's production testing dept to have far better code, since they will devote more job hours to it and know proprietary chip details. Still, different parts of AMD as emailed me several times to thank me because they found the pgms useful. Great.
But these guys know what they're doing. Heat transfer from the hot multipliers has to be carefully analysed [3D finite element heat transfer analysis]. I suspect something far more mundane, like someone reducing die or slug thickness, or a mfg problem with the die/slug gap or thermal goop.
The NO CARRIER joke was so nice but does not fit with this probl*NO DATA*
--- I am known for the ones who want to find me on the net. Is that a privacy risk or a privilege? One might wonder..
Wait... a few weeks ago they talked about flaws with an Intel chip... and the AMD Fanboys assured me this would never happen, that only Intel can make mistakes! I thought AMD was teh l33t hax0r!! Why God Why!!
OK, so Windows is immune too. These operating systems have a clock tick that interrupts at 100, 250, or 1000 Hz. That interrupts the FPU.
Crank up the clock rate even more if you are worried and you just have to run your CPU in tropical temperatures. You could also ping flood the machine, causing plenty off network interrupts.
I was an Intel man for many years. It's like being a ford or a chevy man
you know, you ignore all good things about the competition and smugly
goof on all their mistakes while ignoring your favorite's eccentricities.
My wakeup call came as I was looking into building a cheap comp to play
UT 2K4 on. I went through the benchmark results to find a good processor
for cheap and was appalled at the prices that intel wanted for middle of the road dreck while AMD had several budget choices that were faster. I finally settled on a sempron 3100+ and I can't believe how many games I can play with just an nvidia 6600le and they all rock. I tried out a friends dell that was supposed to be high end and it couldn't match my resolution or fps and he paid 2300 for his intel boat anchor while I paid exactly 404.17 with shipping for my budget screamer. All this and ethical
treatment for customers too? Long Live AMD
I have formed my own personal postulate/theory/law... and it's corollary: It is impossible to completely test a sufficiently complex system in every possible way to be certain that it's bug-free.
Along those lines - many years ago, Professor Turing set out to find a test for [among other things] the possible presence of an infinite loop within a computer program.
Sadly, though, he didn't get very far with that line of inquiry...
I suspect something far more mundane, like someone reducing die or slug thickness, or a mfg problem with the die/slug gap or thermal goop.
Care to go into a bit more detail for us noobs?
Yes - the ability to take corruption into account is what differs mainframes (and also high-end IBM UNIX servers like p595) from PCs.
You can defy gravity... for a short time
I'm not sure why TFM didn't link to AMD for their disclosure of the problem, but here it is: http://www.amd.com/us-en/0,,3715_13965,00.html
Sorry pal, but the world ain't black and white. Replacing those parts is the WRONG thing to do from everything but a PR perspective. This kind of thing is called errata, design flaws which are either worked around or have little impact on the end user. There's not been a processor released without a list of these. What Intel failed to do was convince people that this was no big deal.
If you want to demand 100% perfectly designed processors expect little innovation and a price tag an order of magnitude higher. The consumer ultimately pays for every recall.
One more thing, this is totally irrelevant to the lawsuit. And small/underdog does not mean "all good" and big/topdog does not mean "all bad."
As an engineer old enought to remeber listening for tight loops on an AM radio. I suggest trying Prime95. There maybe other undiscovered overheating bugs and not just on these AMD chips either.
No sufficiently complex system can ever be completely bug-free.
What do you get if you multiply six by nine?!
Constitutional rights may be respected, repealed, or modified; but they must never be ignored.
I don't know about fearmongering, but it's certainly going to some trouble to make a false accusation. Either it slipped through their "detection grid", or it was detected and ignored. It can't have been both.
You are apt to be doing this extensively when processing audio or video streams.
"Prove all things; hold fast that which is good." [KJV: I Thessalonians 5:21]
A friend of mine and I can reliably crash some similar-generation AMD chips with a loop setting a region of memory to all zeroes, but not with a loop setting it to 0xaaaaaaaa. The chips just lock up. Takes anywhere from a few seconds (linux) to a few minutes (windows).
My blog: http://www.seebs.net/log/ --- My iPhone/iPad app: http://www.seebs.net/seebsfrac/
I have a P90 (one of those that was remarked down to P75 for the market sweet spot, but because it's really a P90, it runs fine at 90MHz) that has some sort of FP bug... it passes the Calculator test, but locks up with certain math-intensive screen savers, like the old After Dark kaleidoscope. It never showed any other symptoms in its 6 years of useful life, so I didn't bother to RMA it.
I don't consider this as bad as the Sept.1998 batch of K6-2 450Mhz CPUs that could not run certain 32bit code AT ALL (neither Win32 Setup nor any species of Linux would run). AMD refused to replace those at all.
~REZ~ #43301. Who'd fake being me anyway?
"Problematic" means "debatable" or "dubious". It does not mean "plagued by problems". Did you mean "problematic" or "plagued by problems"?
From websters, the first definition is "posing a problem : difficult to solve or decide". It is the model that is at fault and preesents the problem. It is not because it is plaged by problems, but because it is the problem, the riddle, how you can use it with good results for complicated computations.