Flawed AMD Chip Can Lead To Data Corruption
Brandonski writes "Apparently AMD allowed some flawed chips to slip through their detection grid. The problem affects only a small number of chips and only single core 2.6 and 2.8 GHz CPUs." From the article: "It is believed that the glitch is triggered when the affected chip's FPU is made to loop through a series of memory-fetch, multiplication and addition operations without any condition checks on the result of the calculations. The loop has to run over and over again for long enough to cause localized heating which together with high ambient temperatures could combine to cause the result of the operation to be recorded incorrectly, leading to data corruption."
Corruption is the cardinal sin of a CPU. If it can't compute a result accurately, it should shut down rather than give a wrong answer.
I'm too young to remember the details (I think it goes back to the early eighties at least), but perhaps some of the elder gods that lurk around here might be able to supply more details.
sheep.horse - does not contain information on sheep or horses.
"The company is also working with OEMs to identify affected parts and contact customers who could be affected - if they are, they will be offered free replacements."
forth paragraph in TFA.
Not likely. This is valid user code that is being executed. On other CPUs, the same code wouldn't cause a problem. Something like the F00F bug is fixable in the kernel by mucking with exception handler. This is pure user-land code.
Do you even lift?
These aren't the 'roids you're looking for.
source
Do you even lift?
These aren't the 'roids you're looking for.
Is it reasonable to be afraid of this. To exploit this, in a way to allow running of arbitary code, you would need a buffer overflow
I think you are misunderstanding the nature of the problem. This is not data corruption as in buffer overflow, this is data corruption as in the calculation comes up with an incorrect answer. For some people that is not acceptible.
I have used Prime95 in the past to identify problematic configurations. It's a tool whose main goal is to find prime numbers, but it can be used as an excellent stress test for the processor and memory units.
Could Prime95 be used to identify those AMD chips?
Some of the tendering spreadsheets i've seen for a few companies i've worked for have had quite a lot of calculation going on in them - change a few cells that others depend on that have others depending on them, etc.... do that all day, it adds up quick.
You only need 1 of those operations in that instance to screw up and you could be down a few million dollars, if it's not picked up.
Even forgetting that it's just the moral thing to do...Risk vs replacement cost = no brainer. If only 3000 cpus are affected at say $300 each for amd to sell retail (i'm sure their cost is FAR less), they'd be mad not to just do it (maybe even offer a free speed bump) and reap the positive PR.
All it needs is for ONE company to blame a budget blowout on them and it's well and truly paid for...
smash.
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
If you have any interrupts coming in, or your loop has a termination condition. I think you have to have your hardware set to send an interrupt many hours in the future then start an otherwise nonterminating loop.
So under normal conditions on normal PC hardware, this simply won't happen.
The actions needed to cause the problem to arise are so extreme that they'd never happen in the field.
This kind of thing is standard practice. If you want to stress test a piece of hardware, you write specialised test code which will consume the maximum amount of power possible, not a real world program. You have to be sure that nobody will be able to write software which will drive the processor harder than your tests have. Its good that AMD found this fault, and even better that they owned up to it, but it's not remarkable.
If I seem short sighted, it is because I stand on the shoulders of midgets
Of course, I expect AMD's production testing dept to have far better code, since they will devote more job hours to it and know proprietary chip details. Still, different parts of AMD as emailed me several times to thank me because they found the pgms useful. Great.
But these guys know what they're doing. Heat transfer from the hot multipliers has to be carefully analysed [3D finite element heat transfer analysis]. I suspect something far more mundane, like someone reducing die or slug thickness, or a mfg problem with the die/slug gap or thermal goop.