Flawed AMD Chip Can Lead To Data Corruption

I dub thee by Anonymous Coward · 2006-04-28 17:46 · Score: 2, Funny

Fetch Div, son of Eff Div, Heir to Count Zero, and Lord of a new generation of digital serfs, soon to be labled as having "emotional problems."

Re:I dub thee by Overly+Critical+Guy · 2006-04-29 11:05 · Score: 1

Since this is an AMD problem, expect lots of justifications and defensiveness compared to if this was an Intel problem.

--
"Sufferin' succotash."
Re:I dub thee by Anarke_Incarnate · 2006-04-29 12:43 · Score: 1

If you read the article, it sounded like some sort of apologistic approach anyhow: "If star A is in Uranus while star B is in Neptune, and you look at a pr0n site, then there is a chance that the girl's boobies will appear larger than they are" Basically it states that if the thermals are bad, and you cause a hotspot on the CPU, that there MAY be corruption in the data processed. Thats a bit of a stretch. Its not good, but its not major.

Re:What? by qbwiz · 2006-04-28 17:49 · Score: 2, Interesting

Generally, chips aren't supposed to have localized heating problems. Either it should all have a problem, or none of it should.

--
Ewige Blumenkraft.

I Have an AMD CPU by ozmanjusri · 2006-04-28 17:50 · Score: 5, Funny

Hey, I have an AMD 2.8Ghz. Maybe I should stop refresðN9'óI]öR9ù¥Î6ýPoe}+èa(ê{

--
"I've got more toys than Teruhisa Kitahara."

Re:I Have an AMD CPU by zaguar · 2006-04-28 18:52 · Score: 5, Funny

ðN9'óI]öR9ù¥Î6ýPoe}+èa(ê{
Interesting Perl script.

--
"Sure there's porn and piracy on the Web but there's probably a downside too."
Re:I Have an AMD CPU by MadUndergrad · 2006-04-28 19:27 · Score: 1

Mod parent "LotR nerd". Me too, for knowing what that is.
Re:I Have an AMD CPU by ozmanjusri · 2006-04-28 20:35 · Score: 1

Interesting Perl script.
Let's see...
$ ./line_noise.pl
Warning: Programmer attempting to re-invent the wheel. There's a function that does the exact same thing on CPAN. Sometimes it actually works.
ERROR: Unable to create life
Exiting

--
"I've got more toys than Teruhisa Kitahara."
Re:I Have an AMD CPU by arkhan_jg · 2006-04-28 21:08 · Score: 1

You have a 2.8Ghz Opteron in your desktop PC at home? Don't you have someone to press the refresh button for you?

--
Remember kids, it's all fun and games until someone commits wholesale galactic genocide.
Re:I Have an AMD CPU by Minwee · 2006-04-29 01:33 · Score: 4, Funny

ðN9'óI]öR9ù¥Î6ýPoe}+èa(ê{
Interesting Perl script.
It's also rule number 26 in sendmail.cf.
Re:I Have an AMD CPU by chris_eineke · 2006-04-29 01:51 · Score: 1

Yeah, I tested it out...

It's an operating system.

With drivers.

And GUI.

And emacs. ;)

--
"All you have to do is be fragile and grateful. So stay the underdog." Chuck Palahniuk, Choke
Re:I Have an AMD CPU by ozmanjusri · 2006-04-29 02:01 · Score: 2, Funny

Don't you have someone to press the refresh button for you?
Yeah, but unfortunately I hired a MCSE and it's turning out to be tougher than I thought training him.

--
"I've got more toys than Teruhisa Kitahara."
Re:I Have an AMD CPU by Dolda2000 · 2006-04-29 05:23 · Score: 1

It's also "Hello World" in APL.
Re:I Have an AMD CPU by dascandy · 2006-04-29 19:53 · Score: 1

Did you try bananas? They seem to work quite well.
Re:I Have an AMD CPU by ozmanjusri · 2006-04-29 22:45 · Score: 1

Did you try bananas? They seem to work quite well.
Yeah. They did better than the MCSE, but they run out of momentum too quickly.

--
"I've got more toys than Teruhisa Kitahara."

Corruption by XanC · 2006-04-28 17:51 · Score: 2, Informative

Corruption is the cardinal sin of a CPU. If it can't compute a result accurately, it should shut down rather than give a wrong answer.

Re:Corruption by frosty_tsm · 2006-04-28 17:58 · Score: 1

As in, commit electronic seppuku?
Re:Corruption by leendertv · 2006-04-28 18:51 · Score: 5, Insightful

No CPU can guarantee to be free of corruption, the goal of the designer is just to minimize the likelihood of corruption. The design margins are usually such that proper operation is ensured, except for the statistical outliers. However, even CPUs with several error checking and correcting mechanisms can still corrupt data, it is just extremely unlikely. A CPU can never know for sure if it can compute a result accurately, or if an operation was performed correctly, just like no communications system can achieve bit error rates of 0.

Data corruption in integrated circuits can come from several different sources. Cosmic rays are likely to alter memory values, especially so in DRAM cells. Typically, only ICs for space applications are actually radiation hardened. Much less likely, transistor device noise can corrupt data. Transistor device noise is usually more an issue in RF circuits. Finally, not all manufacturing defects can be found during manufacturing test, since most test sequences don't even achieve 100% fault coverage under currently used fault models, and this does not even consider how closely the models represent the actually circuit failure modes.

Really, for most people this floating point data corruption is probably a non-issue. It is even more unlikely that errors in floating point data lead to exploits. It is more likely that some bits of your DRAM memory will get corrupted. On my system with ECC RAM that is a few years old, logs show that I get about 1 or 2 (correctable) errors per day...
Re:Corruption by smash · 2006-04-28 19:43 · Score: 1

Granted, for most people this may well be a non-issue, and data corruption is a fact of life.
However, when a CPU is KNOWN DEFECTIVE in a repeatable, data-corrupting way, it is the vendor's responsibility to replace/fix it.
Similar to vehicle recalls. Most people would never be affected by many of the things vehicles are recalled for, but that doesn't mean that known *serious* defects are simply let go.
smash.

--
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
Re:Corruption by dubl-u · 2006-04-28 19:55 · Score: 1

If it can't compute a result accurately, it should shut down rather than give a wrong answer.

Congress could learn a lot from this.
Re:Corruption by caspper69 · 2006-04-29 03:39 · Score: 2, Informative

Granted, for most people this may well be a non-issue, and data corruption is a fact of life. However, when a CPU is KNOWN DEFECTIVE in a repeatable, data-corrupting way, it is the vendor's responsibility to replace/fix it.

Similar to vehicle recalls. Most people would never be affected by many of the things vehicles are recalled for, but that doesn't mean that known *serious* defects are simply let go.

I have actually studied this bug, and it is only observed when the fpu code is iterated in the MILLIONS of times without ever executing another instruction (only a tight FPU loop), in addition, the environmental temperature must also be high (think tropical). AMD has stated (1) that this problem has never been identified in actual production code (only a single benchmark in these environmental conditions); and (2) that they are identifying and replacing (for free) all affected CPUs. It is estimated that 2-3,000 chips have this particular defect (out of the millions shipped). Further, AMD has added an additional validation step to identify processors affected by this glitch, which will cause them to be pushed down to a lower speed grade (i.e. 2.8GHz affected CPUs will be sold as 2.6GHz parts), where this problem does not manifest itself.

I for one am happy that this story broke 2 days ago, and 1 day ago AMD had already figured out which CPU batches could potentially be affected, and is offering free replacements (without the customer complaining first). Now today it's on Slashdot. At least this isn't the F00F bug which Intel didn't tell anyone about until the public discovered it and raised hell. Further, the likliehood of data corruption caused by this glitch, even in fpu-heavy code, is extremely unlikely as there would be other non-fpu instructions executed in between in nearly every case (except extreme benchmarking-- i.e. the reason AMD discovered the problem in the first place).
Re:Corruption by afidel · 2006-04-29 18:42 · Score: 1

This is why the old HP MIPS CPU's were so cool, every memory area was ECC and all calculations were run on two cores, if the cores disagreed then they ran the calculation again, if they disagreed again then the CPU shut down and the operation was offloaded to another CPU in the machine.

--
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.

An old problem by AndrewStephens · 2006-04-28 17:53 · Score: 4, Informative

Something similar used to happen on very old processors, back in the day. If certain instructions were executed in tight loops, the chips would experience localised heating and eventually malfunction (sometimes with permanent damage).

I'm too young to remember the details (I think it goes back to the early eighties at least), but perhaps some of the elder gods that lurk around here might be able to supply more details.

--
sheep.horse - does not contain information on sheep or horses.

Re:An old problem by Alien+Being · 2006-04-28 18:18 · Score: 3, Funny

I used to burn out a lot of abacus beads.
Re:An old problem by Jerf · 2006-04-28 18:20 · Score: 4, Funny

Do not meddle in the affairs of the Elder Gods, for you are crunchy, and good with ketchup.
Re:An old problem by dhall · 2006-04-28 18:27 · Score: 1

Beads? Using our sexagesimal system, we didn't have the true concept of zero as a number!
Re:An old problem by Dadoo · 2006-04-28 18:38 · Score: 2, Informative

If certain instructions were executed in tight loops, the chips would experience localised heating and eventually malfunction (sometimes with permanent damage).

You're thinking about magnetic cores.

Whenever you reverse a core's magnetic field, its temperature rises a little. Keep reversing the field fast enough and for a long enough period of time, and the core (or maybe the wires running through it?) will melt, permanently damaging that bit.

--
Sit, Ubuntu, sit. Good dog.
Re:An old problem by Mister+Transistor · 2006-04-28 18:41 · Score: 5, Informative

You may be referring to the early MC6800 8-bit processors. The first ones had a major problem in that the internal registers were dynamic RAM style memory, and synchronized to the internal state clock. If you halted the processor for an extended period of time, the refresh clock to them ceased and the registers got hot, drew too much current and burned up!

I'm pretty sure that gave rise to the joke "Halt and Catch Fire"...

I always figured that if you were to burn out a register from overuse, it would be the carry bit ;)

Anyway, as to the story at hand, it sounds like this would only ever occur a) to only 3000 processors total - MAYBE, and b) would only ever happen under such an artifically contrived laboratory stress-test/benchmark situation. Any CPU running in a real system would a) have to do other things like service hardware interrupts, and b) wouldn't do something useless like perform a looping calculation without checking to see if it was done periodically. It really sounds like this is a big non-issue in reality.

--
-- You are in a maze of little, twisty passages, all different... --
Re:An old problem by AndrewStephens · 2006-04-28 18:52 · Score: 2, Insightful

I agree with your comments on the current story. In reality, all modern processors have flaws that only occur in extrememly unlikely circumstances. This one is not any different.

--
sheep.horse - does not contain information on sheep or horses.
Re:An old problem by Mister+Transistor · 2006-04-28 19:05 · Score: 4, Insightful

I'll go you one better - I have formed my own personal postulate/theory/law that:

No sufficiently complex system can ever be completely bug-free.

and it's corollary:

It is impossible to completely test a sufficiently complex system in every possible way to be certain that it's bug-free.

In that vein, someone once said "Foolproof is impossible because fools are so ingenious", and "As soon as an idiot-proof system is devised, they go and invent a better idiot!"

--
-- You are in a maze of little, twisty passages, all different... --
Re:An old problem by Warg!+The+Orcs!! · 2006-04-28 19:27 · Score: 1

Well I am no Elder God but I knew them when they were kids.

There used to be ways of programming certain early personal computers to make smoke come out of them. I think the BBC computer and the ZX80 were the main ones. The BBC was vulnerable (dredges memory) to a POKE command that would make it fall over and die howibbly

Prince Ludwig - You will die howibbly!
Blackadder - Howibbly?
Prince Ludwig - Howibbly howibbly

--
Travelling forward in time at a rate of 1 second per second.
Re:An old problem by smash · 2006-04-28 19:36 · Score: 1

Now, not having a go at the parent post, but if intel was to release a statement like this, the /. community would be having a field day.
A defect that is known to give incorrect calculations is a serious issue that should be rectified via microcode update or exchange CPU for free (if microcode can't fix it).
Intel got raked over the coals for the FDIV problem, and so should AMD unless they do the right thing and offer an exchange/free fix so that users get the functional CPU they intended to purchase.
smash.

--
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
Re:An old problem by dubl-u · 2006-04-28 19:53 · Score: 1

For those wondering: Jargon File and Wikipedia entries for HCF.
Re:An old problem by something_wicked_thi · 2006-04-28 20:07 · Score: 1

No sufficiently complex system can ever be completely bug-free.
Not to come off as being too sarcastic, but, wow! You came up with that all by yourself?
Seriously, though, that's one of the main tenets of software engineering. Hardware is no different. The phenomenon is well studied and it basically results from the fact that any change you make to a complex system has a certain likelihood of introducing new problems. Also, as you reduce the number of bugs, the complexity of the remaining bugs increases, thus requiring more complicated fixes. Usually, these factors result in a "steady state" number of bugs (that is, there is a point at which the number of bugs you are fixing is equal to the number of bugs you are adding by fixing them).
Therefore, once a system becomes sufficiently complex, it is impossible to eliminate all the bugs. The only way to fix more bugs, therefore, is to reduce complexity. There are other techniques that can be employed to reduce the number of bugs that are introduced per change (e.g. automated regression testing), but that just lowers the steady-state bug count.
Re:An old problem by something_wicked_thi · 2006-04-28 20:14 · Score: 5, Informative

RTFA. They are offering a free replacement. However, the FDIV bug was overblown. For most people, it didn't matter (few people were using software that required division precise enough to be affected). This bug is even less worrisome. Its effect is, at the moment, completely unobserved in the wild using real world applications. The FDIV bug was apparent to anyone with a calculator.

I'm not saying AMD should be let off the hook completely, but the bug isn't a big problem, they are offering free replacements, and they publicized it. The FDIV bug was bigger (though still hardly catastrophic), refused (at first) to offer replacements, and they sat on it. The two scenarios are nowhere near similar. Maybe AMD just has more character than Intel, or maybe they were watching in 94/95 when the FDIV bug happened and they've actually learned from Intel's mistakes. Regardless, this whole story is more of a heads-up to concerned buyers than a criticism of AMD.
Re:An old problem by Bill+Dog · 2006-04-28 20:14 · Score: 1

Bet there won't be any "I am AMD of Borg, you will be approximated" jokes. Afterall, sacred cows and humor do not mix.

--
Attention zealots and haters: 00100 00100
Re:An old problem by Soul-Burn666 · 2006-04-28 20:25 · Score: 2, Interesting

Actually hardware IS different. As complex as hardware is, it is much less complex than software and has much simpler logic to check. This allows for systems for "formal verification" which happen to work exceedingly well for hardware. For example IBM's "RuleBase" is a system that uses temporal logic to verify a certain piece of "code" (which will later be compiled to hardware) against a set of logical rules.
When the system can be used, it helps clear out logic bugs very efficiently.

That being said, today's microprocessors are huge and therefore have to be split to modules in order to test like this. Moreover, it only tests logic. Other systems have to be used to test issues of overheating, cross-talk and actual physical design.

--
^_^
Re:An old problem by Tim+C · 2006-04-28 20:53 · Score: 1

Yes and no; I've never heard any such rumour about the ZX80 or the BBC Micro. I heard that POKEing a certain memory location on the Commodore Pet would cause it to burst into flames, but never saw it happen so can't confirm it. A quick google turned up this page, which has details about the Pet rumour and the BBC Micro one, but nothing about the ZX80.

--
It's official. Most of you are morons.
Re:An old problem by Warg!+The+Orcs!! · 2006-04-28 21:04 · Score: 1

swiftly researched...

It's an 'urban myth' which was made up about the BBC Micro. However, it was based on a true story about the PET - there was a location you could poke to do with the graphics frequency which if you set it wrong could cause the HT supply in the monitor way over-voltage, which would sometimes break down the transformer. This came up in the PCW magazine* after someone wrote "it is impossible to damage a computer with bad software".

So I've been urban-mythed again, eh? I apologise to the designers of the BBC Micro for perpetuating the notion that their machine would combust.

--
Travelling forward in time at a rate of 1 second per second.
Re:An old problem by Cicero382 · 2006-04-28 23:13 · Score: 1

Yup! This sort of thing happened, not with the processor itself, but with the core. For those of you who are too young to know (sigh! - ie most of you - I feel old) core memory was made up of tiny little rings of ferromagnetic material which was threaded with thin wires to energise the core element and another to sense its state. Err.. you might get a better explanation from http://en.wikipedia.org/wiki/Core_memory

Anyway, while I was at UMIST, we wrote a program to make the front panel lights flash in interesting ways on a Nova. Unfortunately, we had a hot loop and managed to damage the core. We were in serious trouble. Fortunately, the Dean was quite impressed with the programming - so we didn't have to pay for it.

So, it's hardly a new problem.

BTW. Did anyone else use a radio to listen to the uP to detect tight loops etc?
Re:An old problem by myowntrueself · 2006-04-28 23:57 · Score: 2, Funny

"Keep reversing the field fast enough and for a long enough period of time, and the core (or maybe the wires running through it?) will melt"

Shit, so for once reversing the polarity does more harm than good?

--
In the free world the media isn't government run; the government is media run.
Re:An old problem by AWhistler · 2006-04-29 00:50 · Score: 3, Insightful

There is a HUGE difference between this AMD problem and the FDIV bug. The FDIV bug, once found, was one of those "1,2, BANG" bugs (do step 1, then step 2 and BANG, the bug is there). With this AMD bug, you have to do the same operation many times before you see the problem, and then the problem is random (only if it overheats enough). Another possible solution to this is to use better heat sinks. This AMD problem isn't 1,2,BANG. Bugs that are of this nature are orders of magnitude harder to find and characterize.

But you're right, since Intel blundered so badly on their handling of he FDIV bug, everyone else learned from it.
Re:An old problem by Sebastopol · 2006-04-29 03:56 · Score: 1

What exactly do you mean "blundered badly"?

It is a textbook case in many MBA programs how _WELL_ Intel handled this.

They recalled EVERY CPU at their own expense of millions of dollars. Managing the recall, the disposal, the resupply, the competition, AND the PR nightmare was handled so well that this incident has become canon for MBA candidates.

--
https://www.accountkiller.com/removal-requested
Re:An old problem by AWhistler · 2006-04-29 05:05 · Score: 2, Interesting

Then you REALLY need to get new MBA textbooks, since the one you have been reading is too politically correct to be useful. Here is a link from the guy who discovered the bug which includes a timeline (I can't believe his FAQ is still online!)...

http://www.trnicely.net/pentbug/pentbug.html/

Pay close attention to questions 9, 10, and 11. It explains what REALLY happened, and the author's opinions on the matter, which to my memory are quite accurate. How do I know? At the time I owned a Gateway Pentium 90 that I could use the Windows calculator on to verify the bug. So once Intel announced the recall, I called to get mine replaced. The box they shipped the replacement in was about 12" x 12" x 8" and very overpadded...way larger than the boxes they ship processors in now for sale. I had to replace my Pentium, box it up and send it back to Intel, and I had to give them my credit card number so that they could bill me $500 ($600?) if I failed to return the old CPU.
Re:An old problem by something_wicked_thi · 2006-04-29 05:14 · Score: 1

I think you've missed the point a little. While the FDIV bug didn't really harm Intel long-term. It was mostly a lot of hype and not a large problem. I don't really know why MBAs would learn from it, unless they have a course in handling overblown non-problems, in which case Intel handled it very poorly, because the problem got hyped quite a bit. My father, who was buying a computer at the time, was considering purchasing a 486 instead just because he had heard about the bug.

The problem is that most people don't like thinking that parts of their computers are fallible. A bug like the FDIV bug calls that faith into jeopardy. That, along with Intel's original downplaying of the bug, is what caused the hype. People were afraid of getting a bad CPU and not being able to get it replaced as Intel was saying it wasn't a big enough problem to warrant a replacement. Intel was probably right, but it doesn't change the fact that people really want their computers to give correct answers all the time.

Oh, and before you start trying to explain what CPU steppings are, I'll mention that most errata fixed in the various CPU steppings are very minor, even compared to the FDIV bug and this bug.
Re:An old problem by jizmonkey · 2006-04-29 05:20 · Score: 1

What exactly do you mean "blundered badly"? It is a textbook case in many MBA programs how _WELL_ Intel handled this. They recalled EVERY CPU at their own expense of millions of dollars. Managing the recall, the disposal, the resupply, the competition, AND the PR nightmare was handled so well that this incident has become canon for MBA candidates.
My lord, this reinforces just about every stereotype of b-school students I developed while living in Schwab.
First Intel refused to replace the chips, except for certain classes of customers who really needed accurate results (in the sole opinion of Intel). Then there was a PR war with companies like IBM piling on against Intel to look good to their own customers. There were ridiculous press releases from all companies estimating how often the average user would come across the bug, and whether or not that meant the user had to worry about it. Everyday software programs started to include options to disable the FPU to avoid the bug, and those options are still present in the programs today. It became a total media circus because Intel fucked up its response. To kill the furor, Intel agreed to replace anyone's broken CPU. They took a big charge, and it went away (as you said). Aside from the lingering bad PR (people still remember the incident, unlike all the other errata that were in early CPUs), Intel took a much bigger charge (billions, I think) than it had to by letting its CPU bug get on the evening news. If Intel had rolled over in the very beginning, its obstinancy wouldn't have made the news, and only people who really cared would have learned about the bug and cared enough to replace their chip.
You might compare this to the Nintendo Famicom recall. If you don't remember that motherboard issue, it's probably because Nintendo handled it much better even though it was essentially the same general kind of issue.

--
With great power comes great fan noise.
Re:An old problem by something_wicked_thi · 2006-04-29 05:25 · Score: 1

Great link (note that the link doesn't work as-is; the terminating slash must be removed).
I'd never seen that before, but it seems that the original discoverer thinks pretty much as I do: that the original bug was overblown. However, the last sentence in there is interesting:

In several of these instances, I had no reason to be suspicious of the result except that a second machine produced a different result.

With the FDIV bug, in an all-pentium environment, he wouldn't notice that difference. The AMD bug would be noticed because, even if both machines overheated, they'd likely produce different (random) results. Thus, he provides yet another reason why this bug isn't as severe as the FDIV bug.
Re:An old problem by AWhistler · 2006-04-29 05:48 · Score: 1

This is what made the FDIV bug so insidious. If you only bought one brand of computer in your lab (or in your home), which even today is a very common practice, you would always get the same results, never thinking anything is wrong. If you published your results (financial, scientific) or were responsible for something critical (space launches, health care), you could lose millions of dollars, any scientific credibility, expensive equipment, or even cause peoples' deaths. Even if the possibility of this error even mattering was remote, the uneasiness of whether your computer was reliable is too great for most people to tolerate.

In order to work around the FDIV bug, you had to rewrite your software for a Mac, or to run it on older PC's, or disable the FPU. What is the point of buying the fastest PC of the day if you have to rerun it again on slower PC's, or to run on another platform? Why not just use the slower PC or the other platform all the time? This is the crux of why most people wanted their FDIV bug fixed. It didn't matter that the it was a miniscule problem. It's that you couldn't rely on your PC.

At least with the AMD problem you can run the same software on two AMD PC's of the same brand (again, a very common practice) at the same time and compare the results. If something starts diverging, then you know something is wrong. Run it on three PC's at the same time, and you can overcome any discrepancies...just let the erroneous PC rest a while before resuming. Better yet...use good coding practice and check the results along the way (I'm not completely sure what "checking" is required to avoid the problem).
Re:An old problem by assassinator42 · 2006-04-29 07:44 · Score: 1

I never knew about this. If I have a processor with this bug, can I still get it replaced? ;)
Re:An old problem by I+Like+Pudding · 2006-04-29 10:03 · Score: 1

Isn't that all just a special case of Godel's incompleteness theorem?
Re:An old problem by Gleenie · 2006-04-29 12:45 · Score: 1

You are indeed correct, though about 70 years late. :( Alan Turing proved that theorem way back before the war.

--
-- Your mother uses Emacs.
Re:An old problem by LurkerXXX · 2006-04-29 14:02 · Score: 2, Interesting

What the hell kind of crappy MBA program did you go to? Intel did *NOT* handle it well. I had one of those CPUs. Intel tried to tell me (a scientific researcher) that my computations didn't need that level of FPU accuracy, and that they wouldn't replace it. It was only after we, the users, screamed bloody murder and brought lawsuits that they decided to back down and replace them all.
The PR nightmare was *caused* specifically by the way Intel handled the discovery. They thought that they had the right to decide which users did or didn't 'need' accurate FPU computations.
I have been an 'AMD fanboy' from that day forward, specifically because of Intel's totally botched handling of an engineering glitch.
Re:An old problem by QuantumFTL · 2006-04-29 21:17 · Score: 1

No sufficiently complex system can ever be completely bug-free.

and it's corollary:

It is impossible to completely test a sufficiently complex system in every possible way to be certain that it's bug-free.
This is why we have automated reasoning systems, theorom provers, etc. They allow us to reduce the set of all possible states down to a set of orthogonal equivilence classes, only one example from which need to actually be tested.

Now, of course, at some point non-ideal physical characteristics can interfere with this, along with incomplete knowledge from noisy measuring devices. Of course one could also say that "No sufficiently complex system can ever be completely bug-free." is not true, rather that "No sufficiently complex system can ever be completely bug-free with high probability," or that "No sufficiently complex system can ever be completely bug-free and also known to be bug free with 100% certainty."

Ahh, the hair splitting that is possible with such nebulous concepts as "complexity."
Re:An old problem by Criton · 2006-04-30 07:26 · Score: 1

The FDIV bug was far woprse then the AMD heating bug and atleat AMD is offering free replacemnts right off vs having to be sued into offering free replacements like another chip company who had the FDIV bug. Also this requires a tight repeted loop the FDIV bug gave an error right off .The FDIV was not harmless if the chip was used in a guidince system the FDIV bug could cause something to go far off course also if you were using an early pentium in scientific research the cumlative errors could ruin your data.
Re:An old problem by Mr+Z · 2006-04-30 10:10 · Score: 1

I believe it was one of the Data General Nova series that effectively had a Halt and Catch Fire instruction. It went something like this (memory may be faulty): Reading a 0 took 3 times as much energy as reading a 1 on those machines. The Jump-Indirect instruction was encoded as all zeros (or nearly all zeros). If you put JMP @0 at location 0 and branched to it, you'd repeatedly fetch zeros at location 0 at the fastest rate the machine could manage.

*poof*

(The reason the DG burned 3x the energy for 0 is that it apparently sensed 0s by writing 1s and looking to see if that changed the value by looking for a pulse on the sense line. Writes are destructive. If it turned out 0, it'd have to rewrite the 0 afterwards.)
--Joe

--
Program Intellivision!
Re:An old problem by Mr+Z · 2006-04-30 11:35 · Score: 1

Might I suggest an experiment? Go out to your car, and switch the two wires going to the battery, and see how that works out. :-)
--Joe

--
Program Intellivision!
Re:An old problem by Listen+Up · 2006-04-30 15:10 · Score: 1

Your ideas as a whole are false, but are on the correct track. Using words such as "impossible", "sufficiently" and "complex" can be highly ambiguous. What is true is that the universe operates Mathematically. As such, a much more accurate rephrasing of your ideas could say:

"Given a sufficiently computationally complex system, although complete accuracy can be calculated, it is often difficult to reach high levels of accuracy within reasonable periods of time. Therefore assumptions and other shortcuts may be used which can give rise to possible errors and ommisions in testing".

There is a fundamental difference between the fact that everything in the universe can be calculated and whether or not such calculations will actually yield improved results given the problem that is being solved.

In this case, AMD and other manufacturers test the issues which they feel are the most important to successfully bring their product to market. They do not fully test for issues which may or not improve their product, but would take extra time they cannot afford to take just to test, and in the end the testing may cause them to miss a product deadline, announcement window, or otherwise affect their time to market.

Note: The Heisenberg Uncertainty Principle -only- applies to subatomic phenomena, not macroscopic physical phenomena. As the problem being discussed above is macroscopic in nature, the Heisenberg Uncertainty Principle is irrelevant to this discussion. As well, while there is empirical evidence supporting the Heisenberg Uncertainty Principle, there are mathematical constructs which can be created to calculate beyond the time-energy uncertainty relation. Therefore, there is a debate as to whether the Heisenberg Uncertainty Principle is completely correct.

Fearmongering? by zaguar · 2006-04-28 17:54 · Score: 2, Interesting

Is it reasonable to be afraid of this. To exploit this, in a way to allow running of arbitary code, you would need a buffer overflow - which is what this AMD weakness is purporting to allow. However, how many are affected? Only a few of the AMD chips, and AMD has only what, 30% of the market. So to code an exploit, you would be writing to a very limited audience, to a point where it is futile. Why not just exploit the latest create.Textrange of WMF exploit in IE/Windows? Much more money in that.

--
"Sure there's porn and piracy on the Web but there's probably a downside too."

Re:Fearmongering? by Saven+Marek · 2006-04-28 18:09 · Score: 2, Insightful

> Only a few of the AMD chips, and AMD has only what, 30% of the market.

The intel fanboys have been too noisy lately! AMD has more than 50% of the market since this year already!
Re:Fearmongering? by andy_t_roo · 2006-04-28 23:10 · Score: 1

do you have a reference for that figure? (for the record i just bought an AMD)

Re:Kernel fix? by Umbral+Blot · 2006-04-28 18:00 · Score: 4, Insightful

The big question is will someone write malware/virus to somehow take advantage of this flaw?

I am curious how a virus could possibly exploit this. It would have to a) hog the resources so that it ran nearly exclusively, which would mean the virus already had control, and b) somehow cause a floating point error to result in a priviliages error. (priviliages and security routines rarely use floating point numbers). Also why would a kernel patch be released for this? It would hurt performance for the rest of us, customers with defective chips should simply return and replace them.

--

Philosophy.

having this happen by mikesd81 · 2006-04-28 18:01 · Score: 1

To trigger the effect, the loop has to be run millions of time, an AMD customer source told Reg Hardware, potentially for hours at a time with no other operations being introduced during the run.

A flaw is a flaw, no doubt. However how likely is this particular scenario to happen other than a benchmark test? And 3000 CPU'S? In the news lately, this is almost categorizes as an oopsy. Security forms are losing millions of customers SSN's and everything. AMD could probaly tell you how to identify the CPU and afford to setup a program to exchange.

--
That which does not kill me only postpones the inevitable.

Sounds familiar by swansontec · 2006-04-28 18:02 · Score: 1

Memory fetch, multiplication, addition... where have I heard this before? Oh, I know. 3D graphics. Typically, those results go right to the screen and don't cause much damage if they get corrupted. I would be more worried about video or audio encoding, though, since those results do make a difference. Otherwise, I can't think of much else that would trigger this bug.

Re:Sounds familiar by smash · 2006-04-28 19:32 · Score: 1

Business spreadsheets (price = cost + (cost*markup%))? Scientific modelling?
There, that wasn't so hard to think of?
smash.

--
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
Re:Sounds familiar by corychristison · 2006-04-28 19:33 · Score: 1

This actually happened to a customer of mine.
He's a heavy gamer... for the longest time we thought it was something like the power supply or maybe the RAM... turns out it was the processor.
Long story short, we replaced the processor and I haven't heard any complaints yet.
Re:Sounds familiar by swansontec · 2006-04-28 19:40 · Score: 1

It would only work if your spreadsheet had a few million cells in it. Now that you mention scientific modelling, though, I feel stupid. That is probably the single biggest user of repetetive floating-point operations around.
Re:Sounds familiar by smash · 2006-04-28 20:08 · Score: 4, Informative

Hmm.... I doubt you'd need a few million cells though.
Some of the tendering spreadsheets i've seen for a few companies i've worked for have had quite a lot of calculation going on in them - change a few cells that others depend on that have others depending on them, etc.... do that all day, it adds up quick.
You only need 1 of those operations in that instance to screw up and you could be down a few million dollars, if it's not picked up.
Even forgetting that it's just the moral thing to do...Risk vs replacement cost = no brainer. If only 3000 cpus are affected at say $300 each for amd to sell retail (i'm sure their cost is FAR less), they'd be mad not to just do it (maybe even offer a free speed bump) and reap the positive PR.
All it needs is for ONE company to blame a budget blowout on them and it's well and truly paid for...
smash.

--
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
Re:Sounds familiar by caspper69 · 2006-04-29 03:54 · Score: 1

Hmm.... I doubt you'd need a few million cells though. Some of the tendering spreadsheets i've seen for a few companies i've worked for have had quite a lot of calculation going on in them - change a few cells that others depend on that have others depending on them, etc.... do that all day, it adds up quick.

You only need 1 of those operations in that instance to screw up and you could be down a few million dollars, if it's not picked up.

Even forgetting that it's just the moral thing to do...Risk vs replacement cost = no brainer. If only 3000 cpus are affected at say $300 each for amd to sell retail (i'm sure their cost is FAR less), they'd be mad not to just do it (maybe even offer a free speed bump) and reap the positive PR.

All it needs is for ONE company to blame a budget blowout on them and it's well and truly paid for...

smash.

This could never happen in an Excel spreadsheet. If you'd bother to read the actual AMD errata, first, you need extremely high environmental termperatures. I don't know about where you work, but most people doing spreadsheets are not doing so in a 90+ degree office with high humidity. Second, the FPU code needs to run UNINTERRUPTED (i.e. no checking of counters, no checking of results, no non-fpu instructions period for millions upon millions of iterations of a tight loop). This could never happen on windows (or Linux for that matter). The scheduler would kick in at its given time slice and most certainly execute instructions other than those in the tight FPU loop, which would give sufficient time for this localized heat buildup to dissipate. This is a theoretical bug caused by running a tight FPU loop directly on the bare metal (no OS) in very high environmental temperatures. Just how often is an end user going to be executing code in this scenario? My best guess is with a distributed system where CPUs can be set to run and only return at the end (i.e. no preemption or time-slicing). Oh yeah, and the designer of this system would have to put in inadequate cooling.

Seems not only highly unlikely, but given that only 2-3,000 chips are affected, just what would be the real world odds of this error actually manifesting itself?
Re:Sounds familiar by Slithe · 2006-04-29 12:49 · Score: 1

> Seems not only highly unlikely, but given that only 2-3,000 chips are affected, just what would be the real world odds of this error actually manifesting itself?

If some typewriter-monkey at a major corporation enters some customer data incorrectly, and if the negative PR from their mistake could cost the company MEGA-BUCKS, they will look for a scape-goat, and AMD will be a good choice.

--
---- "XML is like violence. If it doesn't fix the problem, you aren't using enough."

Obligatory by suv4x4 · 2006-04-28 18:03 · Score: 1

loop through a series of memory-fetch, multiplication and addition operations without any condition checks on the result of the calculations

I've been saying that for ages, check your results, but naah! Them young'uns and their series of memory-fetch, multiplication and addition operations.

Bug or chip quality? by daybot · 2006-04-28 18:05 · Score: 1

So is it a flaw in the design or simply a few high temp FPU batches that cook when hot? Do those outside the 3K "affected" chips have the same inherent flaws but simply don't get so hot?

It's not the first time their server chips have experienced heat problems...

Uh oh.. by BigZaphod · 2006-04-28 18:11 · Score: 5, Funny

Wow! AMD has invented a way to crash an infinite loop! Awesome! Intel? I bet their solution will take twice as long to crash this loop:

10 PRINT "HELLO WORLD"
20 GOTO 10

AMD is always innovating.

--
Hexy - a strategy game for iPhone/iPod Touch

Re:Uh oh.. by ultranova · 2006-04-28 19:40 · Score: 1

Wow! AMD has invented a way to crash an infinite loop! Awesome! Intel? I bet their solution will take twice as long to crash this loop:

10 PRINT "HELLO WORLD"
20 GOTO 10

This loop won't crash. Memory fetch, addition and multiplication, remember ? So you'd need something like this:

10 I = 10
20 K = I
30 K = K + 2
40 K = K * 2
50 GOTO 20

--
Forget magic. Any technology distinguishable from divine power is insufficiently advanced.
Re:Uh oh.. by JollyFinn · 2006-04-29 01:43 · Score: 2, Interesting

No that won't crash its FLOATING POINT memory fetch, addition and multiplication loop! Then we need to unroll the loop enough to hide the floatingpoint unit latency. So that it stays active.

10 I = 10.1
20 K = I
21 K2 =I
22 K3= I
23 K4= I
30 K = K + 2.1
40 K = K * 2.1
50 K2 = K2 + 2.1
60 K2 = K2 * 2.1
70 K3 = K3 + 2.1
80 K3 = K3 * 2.1
90 K4 = K4 + 2.1
100 K4 = K4 * 2.1
50 GOTO 20

--
Emacs is good operating system, but it has one flaw: Its text editor could be better.

Re:Kernel fix? by Alien+Being · 2006-04-28 18:12 · Score: 1

"...customers with defective chips should simply return and replace them."

Simple for whom? It can be a real pain in the ass to swap a CPU.

AMD says that from now on, chips that have this problem will be rerated to lower clock speeds. It would be nice if they offered customers the option of turning down the clockspeed in exchange for a partial refund.

Deja Vu: Intel Processor's Bug in 1994 by reporter · 2006-04-28 18:23 · Score: 3, Insightful

In 1994, Intel's Pentium processor suffered from a division error. Intel handled the problem by initially requiring customers to "prove" that the error caused a serious impact on the customers' lives before Intel would agree to replace the defective chips. Later, after much pressure and lost credibility, Intel agreed to replace all the defective chips without requiring the customer to "prove" his case.

AMD has a unique opportunity to do the right thing: offering to replace all the defective chips. If AMD does the right thing, then it will only help AMD in its litigation against Intel and in various attempts to increase marketshare. After all, would you not prefer to buy from a reputable company instead of a dishonest, shifty company?

Re:Deja Vu: Intel Processor's Bug in 1994 by Anonymous Coward · 2006-04-28 18:28 · Score: 4, Informative

"The company is also working with OEMs to identify affected parts and contact customers who could be affected - if they are, they will be offered free replacements."

forth paragraph in TFA.

nice! by B3ryllium · 2006-04-28 18:28 · Score: 3, Interesting

Wow, that was fast. FreeBSD already has a patch for this.

Judging from the posting date, I *really* need to be updating my sources more often. :)

20060419: p7 FreeBSD-SA-06:14.fpu
Correct a local information leakage bug affecting AMD FPUs.

(could be an unrelated correction, I guess, it doesn't provide much more information in /usr/src/UPDATING)

Re:nice! by B3ryllium · 2006-04-28 18:43 · Score: 1

Ah, I believe I may be incorrect - the longer description sounds like an unrelated FPU bug:

FXSAVE and FXSTOR
Re:nice! by larry+bagina · 2006-04-28 18:44 · Score: 3, Informative

it is an unrelated correction:
...As a result of this discrepancy remaining unnoticed until now, the FreeBSD kernel does not restore the contents of the FOP, FIP, and FDP registers between context switches.
source

--
Do you even lift?
These aren't the 'roids you're looking for.
Re:nice! by B3ryllium · 2006-04-28 18:48 · Score: 1

Caught that already. Sorry for the disinformation. :)

Re:Deja Vu: Intel Processor's Bug in 1994 by cowbutt · 2006-04-28 18:34 · Score: 1

Later, after much pressure and lost credibility, Intel agreed to replace all the defective chips without requiring the customer to "prove" his case.

AMD has a unique opportunity to do the right thing: offering to replace all the defective chips. If AMD does the right thing, then it will only help AMD in its litigation against Intel and in various attempts to increase marketshare. After all, would you not prefer to buy from a reputable company instead of a dishonest, shifty company?

AMD have probably learnt from Intel's PR disaster. Without Intel's FP bug and the precedent it set, there's a good chance AMD would attempt to handle this problem the same way Intel did theirs. Business is business. An indication of this is that every component in the computer you're using probably has both documented and undocumented errata, and recalls are pretty unusual events.

Re:Kernel fix? by Crazyscottie · 2006-04-28 18:38 · Score: 1

In theory, a malicious user could exploit this vulnerability in a routine that's already calling that particular series of instructions. In practice, however, I think it would be nearly impossible to do anything useful, because you'll have no control over the values being written to memory; you'll just know that they aren't correct.

--
Just because it can't be explained doesn't mean it isn't true. Science fits into reality... not the other way around.

Re:Kernel fix? by larry+bagina · 2006-04-28 18:39 · Score: 5, Informative

I'm sure someone will have a kernel patch to prevent this from happening in linux in very short order.

Not likely. This is valid user code that is being executed. On other CPUs, the same code wouldn't cause a problem. Something like the F00F bug is fixable in the kernel by mucking with exception handler. This is pure user-land code.

--
Do you even lift?

These aren't the 'roids you're looking for.

It's like you're overclocking when you're not by IvyMike · 2006-04-28 18:50 · Score: 4, Insightful

This is different than the Intel bug; that was a logic flaw, where the chip computed a floating point quantity using an incorrect algorithm. This is an implementation error. In fact, the article mentions that they're going to re-spec the parts and they'll be fine. So if you've got a 2.8Ghz part, and you run this loop at 2.8Ghz (within the old spec), it's like you're "overclocking" (because you're actually outside of AMD's new spec). My guess is that if you over-bought your heatsink and got something better than the stock OEM cooling solution, you would be fine even if you ran this loop all day. Yay, arctic silver!

Re:Deja Vu: Intel Processor's Bug in 1994 by mojotooth · 2006-04-28 18:54 · Score: 1, Insightful

Jesus. The things that people attribute to AMD's "moral superiority" here on Slashdot... It's astounding.

If AMD does "the right thing" it won't be because of a moral high road. It's because Intel already stepped on a similar PR landmine long ago. Learning from your rival's huge mistakes is not worth high praise. It's just common sense.

--
-- Mojo Tooth : exploring our world as only an idiot can.

If my chip is overheating... by Ethan+Allison · 2006-04-28 18:59 · Score: 1

I sure wouldn't want to find out which spots were hotter than others... touching an overheated chip that much would hurt! (Plus, if it gets too hot the CMOS should kill the machine anyway)

--
I make websites and stuff. Buy one.

Mod parent up please by btarval · 2006-04-28 19:03 · Score: 1

Agreed; the GP doesn't understand the problem. At best, you might modify gcc; but I suspect that might be a pain, considering it's such a limited problem (according to the rumor mentioned in TFA).

There's no way the kernel can do anything about it, from the description of the problem.

And, contrary to AMD's attempts to downplay this issue, there are two immediate areas that I can think of which are affected. The first are certain scientific calculations (even worse, those involving Beowulf clusters). The second are CAD simulations.

Both areas can involve calculations which run for days at a time; far in excess of the hours mentioned in the fine article.

In general, people don't really seem to pay much attention to either the reliability of the CPU or the quality of the RAM though. Witness the number of really cheap systems that people buy for this type of work. Perhaps this will be a "heads up" that yes, even the most basic subsystem of your computer can go haywire, skewing your results, and wasting your time.

--
The best way to predict the future is to create it. - Peter Drucker.

Re:Mod parent up please by something_wicked_thi · 2006-04-28 19:34 · Score: 1

From what I read, you don't really understand the problem, either.

Even scientific applications are probably not going to be affected by this because even checking the counter on a for loop seems to provide enough of a break to let the FPU cool off. They also say that no applications that were tested exhibited the problem, and I'd say it's pretty likely the first thing they thought of when they tried applications were heavy floating point applications (scientific apps, GIMPS, etc.).

Maybe you'd see the problem on some really tight FPU code, like something in GIMPS during the torture test, but I doubt it. I think the only way you'd manage to get this problem is if you wrote code specifically to do it.

That said, I agree with your thesis: people really need to be aware that computers can fail for *no good reason*. There are techniques to correct that (e.g. quorum sensing), and any large-scale cluster that requires accuracy should be using such techniques, anyway, as MTTF is inversely related to the number of components being utilized.
Re:Mod parent up please by btarval · 2006-04-29 02:11 · Score: 1

Ummm, I take it you've never had to optimize FP code, have you? Nor have you even looked at code in this category.
For the time intensive calculations, people actually spend a lot of time optimizing the code. First they put it into assembly; and then they pour over every single assembly statement. You set, a tiny efficiency tweak, saving X number of cycles does indeed add up if you're running it for days or weeks at a time.
This is why I mentioned that mods to gcc might be a solution, but I doubt any changes there would happen. Perhaps I should've been more clear, but I thought that was self-explanatory.

--
The best way to predict the future is to create it. - Peter Drucker.

Overclocking ... by AHumbleOpinion · 2006-04-28 19:05 · Score: 1

AMD says that from now on, chips that have this problem will be rerated to lower clock speeds..

And then end users will overclock these CPU ...

Fearmongering? No, you misunderstand ... by AHumbleOpinion · 2006-04-28 19:09 · Score: 2, Informative

Is it reasonable to be afraid of this. To exploit this, in a way to allow running of arbitary code, you would need a buffer overflow

I think you are misunderstanding the nature of the problem. This is not data corruption as in buffer overflow, this is data corruption as in the calculation comes up with an incorrect answer. For some people that is not acceptible.

Re:Fearmongering? No, you misunderstand ... by larry+bagina · 2006-04-28 19:58 · Score: 2, Funny

yeah, 1.0 + 1.0 = 3.0, for sufficiently large values of 1.0

--
Do you even lift?
These aren't the 'roids you're looking for.

Re:Kernel fix? by Lucractius · 2006-04-28 19:10 · Score: 1

Yes i have a faulty cpu..

Turn its clock down, right, yep done that.

So now ill never be affected by this obscure glitch that is almost totaly unreproducable outside of synthetic testing, oh thanks very much.

can i have the check now please ?

*check arives*
*cashes check*
*clocks cpus back up*

--
XML - A clever joke would be here if /. didn't mangle tag brackets.

Re:An old problem.. now usedto fight the overlords by __aaijsn7246 · 2006-04-28 19:14 · Score: 1

These flaws only occur in unlikely circumstances, but they will be useful tools when fighting our new computer overlords.

Could be worse by Khith · 2006-04-28 19:16 · Score: 2, Funny

Just imagine if you had one of those Pinnacle chips and accidently pressed @[=g3,8d]\&fbb=-q]/hk%fg followed by delete..

Prime95 as a detection tool? by Antiocheian · 2006-04-28 19:20 · Score: 2, Informative

I have used Prime95 in the past to identify problematic configurations. It's a tool whose main goal is to find prime numbers, but it can be used as an excellent stress test for the processor and memory units.

Could Prime95 be used to identify those AMD chips?

Re:Prime95 as a detection tool? by PatrickThomson · 2006-04-28 22:27 · Score: 1

I was always under the impression that prime numbers weren't floating point. I've never seen a prime finding algorithm that exclusively used floating point.

--
I am one of many. My idea is not unique, nor do I expect my voice alone to sway you. I speak in a chorus of opinion.
Re:Prime95 as a detection tool? by ettlz · 2006-04-28 23:52 · Score: 1

The GIMPS clients do use the FPU (read the FAQ). Something to do with intensive use of FFTs.
Re:Prime95 as a detection tool? by smallfries · 2006-04-29 00:48 · Score: 1

It seems unlikely. The synthetic benchmark that they describe is four particular FP instructions, repeated several million times. Firstly no compiler would unroll a loop that far. Secondly, the four instructions would never occur in practice because you would need some memory accesses at some point to load/store data. Between the loop control, and the memory accesses, the FP instructions are broken up enough that they don't trigger the failure.

--
Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php

Uh oh by IAMTHEMEDIA · 2006-04-28 19:23 · Score: 1, Funny

How about that, I was wondering why my computer was giving me the message "all your base belong to us". heh ok that was dumb but hey! its slashdot, i know one of you laughed. But seriously, I do have this chip and my computer is evil, therefore, it must be the chip! Not the fact I have 98 gigabytes of music porn and uhh porn.

Re:Quality Assurance? by fabs64 · 2006-04-28 19:26 · Score: 1

Actually it's very common for cpu manufacturers to just underclock overheating chips.
Also, an AMD chip is only rated up to around 75^C anyway from memory.

Re:Kernel fix? by Alien+Being · 2006-04-28 19:28 · Score: 1

If his overclocking it causes a problem he can kick his own ass.

CALL ESP by Myria · 2006-04-28 19:42 · Score: 3, Interesting

Probably the easiest errata to come by is the instruction "CALL ESP" (or "CALL RSP"). On AMD CPUs, "CALL ESP" will jump to the address in ESP, *then* push the return address. However, on Intel CPUs, it will push the return address first, then jump to the value it just pushed. This is, of course, disasterous if you try to use it.

According to Intel errata documents, this is a bug in the Pentium Pro that has been kept for several generations. The Pentium and below, except the 8086 and 8088, worked correctly with this instruction.

If you want to differentiate Intel and AMD in your program and don't want to use CPUID, you can set up a test with CALL ESP.

Melissa

--
"Screw Sun, cross-platform will never work. Let's move on and steal the Java language." - Visual J++ Product Manager

Quality Control at AMD must be good. by MROD · 2006-04-28 19:46 · Score: 5, Interesting

Having read a lot about this flaw it's actually amazing that AMD's quality control found the problem in the first place.

The actions needed to cause the problem to arise are so extreme that they'd never happen in the field. i.e. Loop through tight floating-point only instructions without any comparisons for maybe hours before the error occurs.

This would *NEVER* happen in the field. Firstly, in any modern OS the process would have been pre-empted long before any problem could occur (causing other instructions to run and hence stopping the overheating). Secondly, no real-world program would ever do this sort of thing as there would always be a comparison in the loop within the timeframe.

This is a theoretical problem only in the real world, especially as it only affects about 3000 processors in total (it has been quoted). This is why AMD gave it such a low priority. We should just forget about it and move on.

--

Agrajag: "Oh no, not again!"

Re:Quality Control at AMD must be good. by Izrath · 2006-04-28 20:49 · Score: 1

Having read a lot about this flaw it's actually amazing that AMD's quality control found the problem in the first place
Im an AMD fanboy but I used to know a guy that worked QA for one of the Intel plants here in AZ. He said they run the chips through very intense stress tests and such for days... if one has a problem they toss the whole batch.
Not too keen on the manufacturing process of chipsets myself, but I would think AMD's QA is comparable.
Re:Quality Control at AMD must be good. by kinnell · 2006-04-28 23:38 · Score: 2, Informative

Having read a lot about this flaw it's actually amazing that AMD's quality control found the problem in the first place.
The actions needed to cause the problem to arise are so extreme that they'd never happen in the field.
This kind of thing is standard practice. If you want to stress test a piece of hardware, you write specialised test code which will consume the maximum amount of power possible, not a real world program. You have to be sure that nobody will be able to write software which will drive the processor harder than your tests have. Its good that AMD found this fault, and even better that they owned up to it, but it's not remarkable.

--
If I seem short sighted, it is because I stand on the shoulders of midgets
Re:Quality Control at AMD must be good. by Ruie · 2006-04-29 02:41 · Score: 1

This would *NEVER* happen in the field. Firstly, in any modern OS the process would have been pre-empted long before any problem could occur (causing other instructions to run and hence stopping the overheating). Secondly, no real-world program would ever do this sort of thing as there would always be a comparison in the loop within the timeframe.
I am not so sure. The TFA said millions of instructions and the chips are capable of billions. So with HZ=100 there is room enough for 28e6 instructions to be executed - plenty enough to trigger the problem.
Also, the TFA mentioned specifically FPU, so it is possible that an integer loop counter gets processed fast enough and in parallel so that FPU is always loaded.
In particular, suppose you have an array of 10 million floats and you want to find the mean - would this trigger the bug ?
I do agree that it is very nice that they checked and announced the problem.
Re:Quality Control at AMD must be good. by Sebastopol · 2006-05-01 08:49 · Score: 1

You sound EXACTLY like Intel immediately after FDIV:

[[AMD states the test conditions involved running floating-point intensive code sequences, a highly computational task usually performed in research labs. "It's very hard to imagine this type of [tight FP loop] code in our [financial services] environment,"]]

[[Flashback to 1995 - "Intel distorted how serious [FDIV bug] was telling people it was only an issue to researchers and scientists"]]

--
https://www.accountkiller.com/removal-requested

Re:Kernel fix? by Lucractius · 2006-04-28 19:50 · Score: 1

as long as i have the nice fat "refund" check for clocking down the cpus, who cares :P

--
XML - A clever joke would be here if /. didn't mangle tag brackets.

Who cares by Kwiik · 2006-04-28 20:02 · Score: 1

don't "we" commonly measure code in terms of how many errors per KLoC it contains? How does any of this pale in comparison?

Some information derived from an old leaked MS email: The internal rule in Redmond is "4 bugs per KLoC".

that's one bug per 256 lines of code..one bug every couple of functions, for the smaller functions.. several bugs per function, for the more intensive functions.

and it's not like OSS is safe from this either. Sure, having error prone hardware will make perfection even more impossible, but seriously, you are far more likely to receive faulty RAM out of the box than to run in to an issue with one of these suckes.

--
Vehicle Stars used car search is my current project

This will not happen to you by Bloater · 2006-04-28 20:38 · Score: 4, Informative

If you have any interrupts coming in, or your loop has a termination condition. I think you have to have your hardware set to send an interrupt many hours in the future then start an otherwise nonterminating loop.

So under normal conditions on normal PC hardware, this simply won't happen.

Re:This will not happen to you by Sebastopol · 2006-05-01 08:47 · Score: 1

What exactly do you assume the service time of said interrupt is, 10 minutes?

Let's say I kick of Maya to render a fairly complex scene. I don't touch my CPU for 20 minutes. Interrupts come in with the CPU tick so that the OS can reschedule threads, but in general, nothing is happening but rendering.

Now, if interrupts take 100's of milliseconds, no problem: the die can probably cool down at that point because the thermal time constant is on the order of milliseconds. But most interrupts are serviced in MICROSECONDS.

Sorry, interrupts will not save the machine. You could sit there and type a novel and wildly move your mouse: you'd probably use less than 1% of the CPU while the die heats.

Since dies heat up fairly quickly, I severely doubt the technical accuraccy of this article, unless AMD is specifically addressing a cooling solution problem or die that tested at a higher temperature but weren't labeled as such.

--
https://www.accountkiller.com/removal-requested

Re:Deja Vu: Intel Processor's Bug in 1994 by dukiebbtwin · 2006-04-28 21:34 · Score: 1

True, but think of all the money and resources lost by Intel for a rare error that would not effect the vast majority of the customers using the chip. How many perfectly decent chips were just thrown away, passing the cost onto the consumer who had to make up for the money lost in remanufacturing these chips. Here is a report done by Intel on how often an average user might see an error: http://www.intel.com/support/processors/pentium/fd iv/wp/6.htm It's certainly bad PR for AMD and they will most likely offer an exchange program like Intel, but the practical need for exchanges isn't really there (if what I am reading in other comments is correct).

Mod parent up! by Mixel · 2006-04-28 21:38 · Score: 1

I've run Prime95 on two of my boxes. Fine on one. My AMD Athlon XP 2000 (Socket A), which I often suspected to be unstable, reliably dies after less than an hour of running the Prime95 code (originally discovered from running the Seventeen or Bust client, which includes the same Prime95 code). BIOS says temperature is normal, so I'll just blame the motherboard caps for now.

Either way, Prime95 has given proof to my suspicion like no other tool could. I no longer run important/intensive apps there.

Re:Quality Assurance? by Bert64 · 2006-04-28 21:50 · Score: 1

Intel had a very similar problem with some of their Itanium chips recently too, however i don't recall them offering free replacements, i believe they just told customers to clock down affected processors!

However, very few people cared because very few people use itanium chips, and those who do are used to them not performing as advertised.

--
http://spamdecoy.net - free throwaway anonymous email - avoid spam!

Re:Kernel fix? by KDR_11k · 2006-04-28 21:55 · Score: 1

You could have bought a "downrated" chip and overclocked it, too. The clock rate on the box is merely specification, the chip can remain operational at higher clock rates, the manufacturer just won't be responsible for what happens then.

--
Justice is the sheep getting arrested while an impartial judge declares the vote void.

NOOOOOOOO! by tomstdenis · 2006-04-29 00:25 · Score: 1

... I just got my pair of 285s! ... well fortunately I don't do a lot of FPU work like that. That and I run cpufreq in "ondemand" mode so I don't care about heat...

Tom

--
Someday, I'll have a real sig.

Re:NOOOOOOOO! by tomstdenis · 2006-04-29 02:33 · Score: 1

Admitedly I didn't RTFA until after I posted...

phew..

Tom

--
Someday, I'll have a real sig.

Re:What? by tomstdenis · 2006-04-29 00:37 · Score: 2, Interesting

There are two parts to that. First off, the composition of the die is varied. Some parts are the ALU, FPU, cache, etc. So depending where the current is going changes the heat [no duh]. The FPU is particularly nasty as unlike the ALU it takes at least 2 EX cycles to do anything and most complicated instructions are at least 4 EX cycles. This means something in the FPU is running for 4 cycles at a time, cannot be interrupted, etc.

So getting heat local to the FPU isn't too surprising. There are various things in place to mitigate that, for example, the heat spreader. But it can only absorb heat so fast. The lack of APIC interrupts (e.g. timers) makes this test rather artificial. If I recall correctly OSes send timer interrupts to processors to schedule tasks. So this would have to be something that is beyond an OSes control. Like you'd have to write your own mini-OS or something.

The other part though is you have to keep in mind making processors is not an exact process. My two x85 series opterons probably have slightly different features (e.g. exact alignment) even though they're made from the same design. If I sliced them open and got "my first electron microscope" and looked at them I'd probably be able to measure slight differences. There are other controlled issues (quality of material, chemcials, etc). So that a batch of processors exhibit this problem is concerning but not impossible.

I'll bet you they probably have another test on the QA line now :-)

Tom

--
Someday, I'll have a real sig.

Re:What? by Yvanhoe · 2006-04-29 01:07 · Score: 1

Overheating leading to data corruption? Since when is this a flaw in chip design?

Since a normal temperature of functionning is written in the specifications of the hip

--
The Wise adapts himself to the world. The Fool adapts the world to himself. Therefore, all progress depends on the Fool.

Phew, I'm not affected by int19h · 2006-04-29 01:14 · Score: 1

Did anyone else have the reflex of doing cat /proc/cpuinfo?

Re:Phew, I'm not affected by springbox · 2006-04-29 04:10 · Score: 1

I tried, but I think I might have one of the affected chips:
C:\>cat /proc/cpu meow
Re:Phew, I'm not affected by int19h · 2006-04-30 09:46 · Score: 1

I can tell that you're not running Tiger.

99% of reported Pentium bugss were program flaws by expro · 2006-04-29 01:42 · Score: 1

Floating point is hopelessly problematic for the average programmer and too many average programmers wrote the programs from Excel to MS Calculator and by any number of other vendors, all of which had "Pentium bugs" reported, that didn't need particular Intel hardware to be reproduced.

Re:Haha by WilliamSChips · 2006-04-29 02:16 · Score: 2, Insightful

When AMD has a problem, it only affects 3000 or so processors and causes minor corruption when a million-line-long piece of code is called without being stopped at any time. When Intel has a problem it affects millions of processors and crashes your computer when a single 32-bit command is called. I know whom I'll be buying from.

--
Please, for the good of Humanity, vote Obama.

Re:What? by LarsG · 2006-04-29 02:19 · Score: 1

Since a normal temperature of functionning is written in the specifications of the hip

I can't find any written instructions on my hip. Which is another piece of circumstantial evidence of my theory that my parents bought me from a chinese clone factory. ;-)

--
If J.K.R wrote Windows: Puteulanus fenestra mortalis!

Coincident Advertising by lildogie · 2006-04-29 02:54 · Score: 1

Funny, the ad that appeared on the comments page had some code P.S. Anyone remember the HCF instruction (halt and catch fire).

Re:Coincident Advertising by Watson+Ladd · 2006-04-29 13:57 · Score: 1

Which one? There are a lot. Some were bus overclocking errors, others were current overloads due to many transistors turning on at once.

--
Inventions have long since reached their limit, and I see no hope for further development.-- Frontinus, 1st cent. AD

Humanly reproducable :) by kesuki · 2006-04-29 03:20 · Score: 1

Although the article specifies 2.6 and 2.8 ghz opterons, I've crashed my Venice core 3000+ socket 754 7 times from online gaming conditions generated by a particilar application (warcraft 3 TFT)

I thought it was the graphic card at first, but the type of crash I've been experiencing and the difficulty to reproduce it (I generally have to play AT with a pro gamer and go on about a 7 game win streak to get game conditions right for the crash) and it does have to be warm in my room...

WC3TFT can reproducably create a lot of memory operations at very High speeds repeatably, millions of times? try millions of operations over a 10 minute game. Sounds like it's not just 'hypothetical' to me.

--
https://www.gnu.org/philosophy/free-sw.html

Re:Humanly reproducable :) by fimbulvetr · 2006-04-29 06:15 · Score: 2, Insightful

I think someone's confusing user error/not enough troubleshooting with an almost not reproducable issue. TFA mentions a lot of instructions without enough pause of FPU code to cool down. This isn't your bug if you're playing WC3. WC3 uses TCP/IP. TCP/IP generates interrupts - lots of interrupts. So many interrupts that your FPU has plenty of time to cool down between calculations. There are many handy ways of troubleshooting this issue of yours, and I'd bet you're not going to identify the problem by some slashdot story submission.
Re:Humanly reproducable :) by kesuki · 2006-04-30 13:23 · Score: 1

well, I can't quite discount what you've said, could you clarify how these 'inturrupts' that can occur 1000 times a minute (that's how many tcp/ip frames a pro level match, with several pro level players generates) allows the fpu to cool down?

ah well. Now i'm back to trying to figure out if it's the sound or graphic subsystem that's the issue, or if it's simply a 'disconect' hack that exploits a flaw in the amd-64 cpu. meh.

--
https://www.gnu.org/philosophy/free-sw.html
Re:Humanly reproducable :) by beetlefeet · 2006-05-04 19:09 · Score: 1

Not to mention all the non-FPU code run everytime the mouse moves, a unit moves, a keypress happens, the sound code in the game wants to do something, anything animates, the AI code runs (pathfinding), any counter anywhere changes, any other app or driver running within the OS does anything, the OS itself does any multitasking or other operations.

This fault occurs when you have nothing else running, just a piece of code that continuously uses the floating point operations over and over. Like continuously squaring a floating point number, without even counting how many times it has done it, just doing it over and over as fast as possible. Then it might get some data corrpution.

OK.... by NVP_Radical_Dreamer · 2006-04-29 03:23 · Score: 1

So basically you have to stand on one leg, be male, wearing a pink tu tu, live in niger with exactly 3 children who happen to be eating pizza during a lunar eclipse for this to happen?

--
The best argument against democracy is a five-minute conversation with the average voter.

- Winston Churchill

Re:Kernel fix? by Coppit · 2006-04-29 03:25 · Score: 1

On the other hand, a compiler fix is plausible. The idea would be to avoid generating this kind of code. I'm sure some compiler gurus can point out precedent for this sort of thing.

Surprising. AMD uses my `cpuburn` by redelm · 2006-04-29 04:27 · Score: 4, Informative

About 7 years ago, I wrote a suite of open-source CPU stress-tests I called `cpuburn`. Little optimized assember pgms designed to stress different parts of the CPU. `burnK7` does precisely this FPU dot product.

Of course, I expect AMD's production testing dept to have far better code, since they will devote more job hours to it and know proprietary chip details. Still, different parts of AMD as emailed me several times to thank me because they found the pgms useful. Great.

But these guys know what they're doing. Heat transfer from the hot multipliers has to be carefully analysed [3D finite element heat transfer analysis]. I suspect something far more mundane, like someone reducing die or slug thickness, or a mfg problem with the die/slug gap or thermal goop.

Obnoxious? by freaker_TuC · 2006-04-29 04:51 · Score: 1

The NO CARRIER joke was so nice but does not fit with this probl*NO DATA*

--
--- I am known for the ones who want to find me on the net. Is that a privacy risk or a privilege? One might wonder..

Re:Quality Assurance? by deadlocked · 2006-04-29 04:59 · Score: 1

http://www.heatsink-guide.com/content.php?content= maxtemp.shtml

Linux is immune by r00t · 2006-04-29 05:42 · Score: 1

OK, so Windows is immune too. These operating systems have a clock tick that interrupts at 100, 250, or 1000 Hz. That interrupts the FPU.

Crank up the clock rate even more if you are worried and you just have to run your CPU in tropical temperatures. You could also ping flood the machine, causing plenty off network interrupts.

This is why I switched by Anonymous Coward · 2006-04-29 05:46 · Score: 1, Insightful

I was an Intel man for many years. It's like being a ford or a chevy man
you know, you ignore all good things about the competition and smugly
goof on all their mistakes while ignoring your favorite's eccentricities.
My wakeup call came as I was looking into building a cheap comp to play
UT 2K4 on. I went through the benchmark results to find a good processor
for cheap and was appalled at the prices that intel wanted for middle of the road dreck while AMD had several budget choices that were faster. I finally settled on a sempron 3100+ and I can't believe how many games I can play with just an nvidia 6600le and they all rock. I tried out a friends dell that was supposed to be high end and it couldn't match my resolution or fps and he paid 2300 for his intel boat anchor while I paid exactly 404.17 with shipping for my budget screamer. All this and ethical
treatment for customers too? Long Live AMD

Professor Turing once contemplated this... by mosel-saar-ruwer · 2006-04-29 05:48 · Score: 1

I have formed my own personal postulate/theory/law... and it's corollary: It is impossible to completely test a sufficiently complex system in every possible way to be certain that it's bug-free.

Along those lines - many years ago, Professor Turing set out to find a test for [among other things] the possible presence of an infinite loop within a computer program.

Sadly, though, he didn't get very far with that line of inquiry...

Re:Quality Assurance? by ChrisMaple · 2006-04-29 05:56 · Score: 1

"At least this bug was found. How many more like it are there, but we simply don't have the proper trace to find it?"

Hard to say. This is a design margin thing, depending upon worst case conditions plus localized heating, and localized heating (AFAIK) isn't generally modeled. Writing test vectors to find all logic errors is difficult, unpleasant, and labor intensive work. Even if software identifies the worst case path, it won't account for localized heating.

I'd guess there are other problems out there like this, but they generally can be avoided by staying well away from maximum operating conditions: keep your chip cool and within the specified voltage range, and don't overclock.

--
Contribute to civilization: ari.aynrand.org/donate

Re:Surprising. AMD uses my `cpuburn` by fimbulvetr · 2006-04-29 06:18 · Score: 1

I suspect something far more mundane, like someone reducing die or slug thickness, or a mfg problem with the die/slug gap or thermal goop.

Care to go into a bit more detail for us noobs?

Re:Kernel fix? by Lucractius · 2006-04-29 06:55 · Score: 1

see the point was that in response to the GP, that they should give people money to down clock them themselves... and i was pointing out the inherent flaw that theres nothing stopping them from just pretending to have a faulty chip, pretending to underclock it, and leaving unchanged, and pocketing a pile of money

--
XML - A clever joke would be here if /. didn't mangle tag brackets.

Mainframe ;) by kompiluj · 2006-04-29 08:11 · Score: 1

Yes - the ability to take corruption into account is what differs mainframes (and also high-end IBM UNIX servers like p595) from PCs.

--
You can defy gravity... for a short time

Here is the info from AMD themselves by Swave+An+deBwoner · 2006-04-29 08:25 · Score: 1

I'm not sure why TFM didn't link to AMD for their disclosure of the problem, but here it is: http://www.amd.com/us-en/0,,3715_13965,00.html

Re:Kernel fix? by KDR_11k · 2006-04-29 08:51 · Score: 1

If they have a way of verifying whether a CPU is affected they could indeed give the money back and tell the customer that his warranty now only applies to the lower frequency. Since these are server CPUs they probably have maintenance contracts attached and those require that the CPU is clocked at manufacturer spec. Warranty means a lot more to companies than home users.

I mean, they could exchange the CPU (maybe just changing the clock multiplier and sending it back) but nothing would stop you from operating it at the higher frequency except for the warranty.

--
Justice is the sheep getting arrested while an impartial judge declares the vote void.

Woah! by modecx · 2006-04-29 10:01 · Score: 1

No sufficiently complex system can ever be completely bug-free.

What do you get if you multiply six by nine?!

--
Constitutional rights may be respected, repealed, or modified; but they must never be ignored.

Re:Woah! by Mister+Transistor · 2006-04-29 11:47 · Score: 1

I get a calculator! Actually, I get 54. Why do you ask?

--
-- You are in a maze of little, twisty passages, all different... --

"allowed...to slip through...detection grid..." by CarpetShark · 2006-04-29 11:13 · Score: 1

I don't know about fearmongering, but it's certainly going to some trouble to make a false accusation. Either it slipped through their "detection grid", or it was detected and ignored. It can't have been both.

Re:Surprising. AMD uses my `cpuburn` by redelm · 2006-04-29 11:14 · Score: 1

More detail? Sure: Modern CPUs are a tough heat-transfer problem. Some circuits throw off a lot of heat, and some don't. This heat first goes into the die, where it spreads in 3D. Too thin, and it can't spread before it has to cross out of the die, through the thermal goop (mfrs have to find _really_ good stuff) and into the big shiney coppyer heatslug. Too thick, and it gets too warm because silicon isn't the best thermal conductor.

This flaw seems damned serious to me... by anubi · 2006-04-29 13:26 · Score: 2, Insightful

... because the multiply-add is the basic building block of digital signal processing.

You are apt to be doing this extensively when processing audio or video streams.

--
"Prove all things; hold fast that which is good." [KJV: I Thessalonians 5:21]

Interesting! by seebs · 2006-04-29 14:13 · Score: 1, Interesting

A friend of mine and I can reliably crash some similar-generation AMD chips with a loop setting a region of memory to all zeroes, but not with a loop setting it to 0xaaaaaaaa. The chips just lock up. Takes anywhere from a few seconds (linux) to a few minutes (windows).

--
My blog: http://www.seebs.net/log/ --- My iPhone/iPad app: http://www.seebs.net/seebsfrac/

Re:Interesting! by Slashcrap · 2006-04-30 07:38 · Score: 1

A friend of mine and I can reliably crash some similar-generation AMD chips with a loop setting a region of memory to all zeroes, but not with a loop setting it to 0xaaaaaaaa. The chips just lock up. Takes anywhere from a few seconds (linux) to a few minutes (windows).

For Christ's sake, does the fact that the time required to crash is completely different depending on the OS not suggest to you that it's a software issue?

If this was a pure hardware problem then surely the timing would mainly depend on the number of loop iterations? So do you think that your loop runs 60 times faster under Windows?

Let's face it, the overwhelmingly likely explanation is that your code, which is presumably running at Ring 0 and written in assembly (let me guess - you don't normally program in assembly), is clobbering some registers used by the OS. Nothing happens until the OS tries to use those registers, then bang - game over. The time taken for that to happen depends on the OS. Linux just happens to do it slower than Windows.

I can't program in any language, but this scenario seems rather obvious to me. Did it not occur to you before you started to suspect an incredibly unlikely hardware bug? Or do you think that CPU errata are more common than bugs in your code?

Re:99% of reported Pentium bugss were program flaw by Reziac · 2006-04-29 18:24 · Score: 1

I have a P90 (one of those that was remarked down to P75 for the market sweet spot, but because it's really a P90, it runs fine at 90MHz) that has some sort of FP bug... it passes the Calculator test, but locks up with certain math-intensive screen savers, like the old After Dark kaleidoscope. It never showed any other symptoms in its 6 years of useful life, so I didn't bother to RMA it.

I don't consider this as bad as the Sept.1998 batch of K6-2 450Mhz CPUs that could not run certain 32bit code AT ALL (neither Win32 Setup nor any species of Linux would run). AMD refused to replace those at all.

--
~REZ~ #43301. Who'd fake being me anyway?

Re:Kernel fix? by Sebastopol · 2006-05-01 08:52 · Score: 1

Interrupts are not sufficient. You can make a tight loop and still hog >99% of the CPU scheduler. As long as the interrupts don't exceed the thermal time constant of the cooling solution you can easily write a virus to do this (assuming you know the loop).

--
https://www.accountkiller.com/removal-requested

posing a problem : difficult to solve or decide by expro · 2006-05-02 00:58 · Score: 1

From websters, the first definition is "posing a problem : difficult to solve or decide". It is the model that is at fault and preesents the problem. It is not because it is plaged by problems, but because it is the problem, the riddle, how you can use it with good results for complicated computations.

Slashdot Mirror

Flawed AMD Chip Can Lead To Data Corruption

157 of 203 comments (clear)