Erratum Plagues Quad-Core Opterons, Phenoms

← Back to Stories (view on slashdot.org)

Erratum Plagues Quad-Core Opterons, Phenoms

Posted by kdawson on Tuesday December 4, 2007 @11:43AM from the correct-or-fast-choose-at-most-one dept.

theraindog writes "Errata are not uncommon with new processors, but a problem with the TLB logic in AMD's quad-core Opteron and Phenom processors appears to be quite serious. The erratum is so severe that AMD has issued a 'stop ship' order on all quad-core Opterons. AMD has also blamed this bug for the delay of the 2.4GHz Phenom, despite the fact that the erratum is unrelated to clock speed. A BIOS-based workaround for the issue has been made available to motherboard makers, but it apparently carries a 10-20% performance penalty. What's more disturbing is that AMD knew of the erratum and the potential performance hit associated with fixing it before it launched the Phenom processor. Hardware provided to the press for reviews did not include the fix, conveniently overstating Phenom performance."

11 of 226 comments (clear)

Min score:

Reason:

Sort:

What??? by GregPK · 2007-12-04 11:47 · Score: 5, Informative

I'm a geek an all. But, I've never heard of erratum.

But dictionary.com is your friend.

Design errors and mistakes in a CPU's hardwired microcode may also be referred to as an erratum. One well publicised example is Intel's "flag" erratum in early Pentium Pro processors. This made the conversion of floating point numbers to integers unreliable due to an exception not being signaled under certain conditions.
1. Re:What??? by nuzak · 2007-12-04 12:08 · Score: 4, Informative
  
  Erratum is singular. Errata is plural.
  
  The conventional terms used for erratum, however, are usually "error" or "bug".
  
  --
  Done with slashdot, done with nerds, getting a life.
2. Re:What??? by Carnildo · 2007-12-04 13:26 · Score: 4, Informative
  
  Well... I can't remember any for my beloved 6502.
  
  They may not have been published, but there are at least three:
  1) A memory-indirect jump where the address is stored across a 256-byte boundary will read the second byte of the address from the wrong location.
  2) The arithmetic status flags are not valid when performing arithmetic in BCD mode.
  3) If a hardware interrupt occurs while the processor is fetching a BRK instruction, the BRK instruction is ignored.
  
  --
  "They redundantly repeated themselves over and over again incessantly without end ad infinitum" -- ibid.
3. Re:What??? by jamesh · 2007-12-04 22:51 · Score: 2, Informative
  
  Any Futurama fan will also know that Bender's brain is a 6502, as revealed by the 'F Ray' in 'Fry and the Slurm Factory'
Re:NDA not enforcible by TheThiefMaster · 2007-12-04 12:20 · Score: 5, Informative

The patch is under the NDA, the kernel is under GPL, so the resulting work (patched kernel) can't be distributed, because the licenses are incompatible.

The GPL only applies to redistribution. Private-use changes don't have to be GPL'd.

IANAL,TIJHIUI (I Am Not A Lawyer, This Is Just How I Understand It).
Re:Expect Theo de Raadt by the_brobdingnagian · 2007-12-04 12:45 · Score: 2, Informative

I don't know what Theo de Raadt has to do with this, I certainly did not see his reaction about this on one of the OpenBSD mailinglists. Can you at least explain what this erratum has to do with security. Because it does look like you're trolling. I do think this is not an isolated event and we can expect more and more processor bugs in the coming years. It's time to leave the antiquated x86 design behind us and move to a cleaner architecture.
Re:Old issue, really by CajunArson · 2007-12-04 12:45 · Score: 4, Informative

The old opty 170 didn't have an L3 cache which is where the bug lies. This bug is rare, but it is reproducible when the CPU is under heavy load and was one of the reasons why AMD was trying to get hardware reviewers to come to an AMD event in Tahoe to run benchmarks on AMD approved systems instead of just dropping chips into FedEx packages. Causing a full-blown system freeze is also on the serious side when it comes to bugs. There have been even more problems, techreport has a story that unlike the hand selected systems that ran at Tahoe, many of the actual consumer phenoms you can buy today actually use slower HT speeds (1.8Ghz vs. 2.0 Ghz in the demos). This means that the memory subsystem (AMD's one theoretical strength over Intel right now) is slowed down, so the somewhat unimpressive initial results are actually overstatements of what the consumer chips can do. (article here).

AMD is in a world of hurt right now. The "true" quad-core line appears to be nothing more than marketing hyperbole since year-old q6600's are faster clock-for-clock than Phenom is. AMD will hopefully get these bugs ironed out... by next February. Even then though, AMD will have chips that are MASSIVELY expensive to make, but that they can't sell for the higher prices Intel is able to command. AMD would be fine if they had an expensive chip they could sell at a premium, or a very cheap to produce chip they could sell for the budget crowd, but right now they have Acura production costs coupled with Kia per-unit revenues: bad times.

--
AntiFA: An abbreviation for Anti First Amendment.
They did by DreadSpoon · 2007-12-04 12:52 · Score: 5, Informative

AMD admitted there were errors in the early Phenom CPUs back before launch. They even put it in their presentations in the press conferences and such. They also said before launch that they were going to include the proper fix in the revised core used in the higher end Phenom, hence the delay.
1. Re:They did by Wavicle · 2007-12-04 19:37 · Score: 2, Informative
  
  AMD said there was a bug that only affected the 2.4GHz Phenom. Read this and note where they say:
  AMD already issued a fix to all of its motherboard/system partners, so if you already own a 790FX motherboard or plan to buy a Phenom system, make sure to update the BIOS. 9500 (2.2 GHz) and 9600 (2.3 GHz) parts are unaffected by the errata.
  Now we learn that the slower parts were affected as well.
  
  --
  Education is a better safeguard of liberty than a standing army.
  Edward Everett (1794 - 1865)
Re:"because", not "despite" by mr_mischief · 2007-12-04 13:23 · Score: 3, Informative

IANAEE (electrical engineer) and I've never built my own CPU, even from TTLs or in a simulator. It makes sense to me, though, that while chips having the error in them may not be tied to specific clock frequencies that the chances of encountering the bug still could be.

If it's a race condition in hardware, there's a good chance it's clock-sensitive. The bug probably exists in the whole line, sure. It'll manifest more as the clock ticks are closer together, because the margin for error without triggering the reversal of steps is smaller. If it's a matter of the wrong signal being sometimes being asserted because the edge of a clock line transition was missed, it's logically going to happen more when the clock cycles are shorter.

A bug being in the whole line regardless of clock frequency and that bug becoming more of an issue at higher clock frequencies are not at all mutually exclusive conditions. The higher frequencies and higher rates of the error may not coincide, but there's nothing in the article to logically say they don't.

The erratum probably does apply to the whole line equally but probably manifests as a percentage of the time in use as some function of the frequency.

For any geek wanting a basic understanding of issues like latching times, gate propagation delays, and other analog electrical signaling issues inside a digital CPU, I recommend the first few chapters of Structured Computer Organization. The book builds upon basic designs of computers from using TTLs to designing a CPU, then up by layers through microcode, designing an assembly language, and more. I have an older edition at home which covers up through the 68030 and the 80386 as examples. The newer one covers up through the Pentium II, the UltraSparc, and the Java chips. The book won't make you an electrical engineer by any means, but the discussions of the tricky timing issues within even simple CPUs might be useful here.

As for the clock speed not effecting the percentage loss in efficiency due to the microcode fix... well, yeah. The microcode is the same across the line regardless of the clock speed. If you insert two identical strings of instructions A1 and A2 into an identical pair of microcode stores B1 and B2, the resulting patched microcodes C1 and C2 will likewise be identical. The faster processor will decode and execute the microcode at the same clock speed as before, and so will the slower one. They'll each have the same percentage slowdown relative to their own clock speeds, because they're running the same microcode. We're not talking about two different generations of processors or even two different revisions. It's the same processor design at two clock speeds. One is going to get the same nerfs and buffs for any microcode change proportional to their clock speeds as the other.
Re:Old issue, really by MrFlibbs · 2007-12-05 02:52 · Score: 2, Informative

An excellent post, but one of your details is wrong -- The P6 was not designed in Israel. That design was done in Hillsboro, Oregon. Most of the Pentium Pros sold into the marketplace were from the "P6s", a shrink of the original design, and that was done in Folsom, California.

The design team in Israel added the MMX instructions into the last P5 and then worked on the ill-fated Timna design (integrated memory controller with RDRAM interface) while the P6 was ramping. After that they began the low-power design that became the Pentium M. They also did the Core and Core2 designs. (Except for the new Penryn, which is from Folsom.)