Slashdot Mirror


AMD Confirms Linux 'Performance Marginality Problem' On Ryzen (phoronix.com)

An anonymous reader writes: Ryzen customers experiencing segmentation faults under Linux when firing off many compilation processes have now had their problem officially acknowledged by AMD. The company describes it as a "performance marginality problem" affecting some Ryzen customers and only on Linux. AMD confirmed Threadripper and Epyc processors are unaffected; they will be dealing with the issue on a customer-by-customer basis, and their future consumer products will see better Linux testing/validation. Ryzen customers believed to be affected by the problem can contact AMD Customer Care. Michael Larabel writes via Phoronix: "With the Ryzen segmentation faults on Linux they are found to occur with many, parallel compilation workloads in particular -- certainly not the workloads most Linux users will be firing off on a frequent basis unless intentionally running scripts like ryzen-test/kill-ryzen. As I've previously written, my Ryzen Linux boxes have been working out great except in cases of intentional torture testing with these heavy parallel compilation tasks. [AMD's] analysis has also found that these Ryzen segmentation faults aren't isolated to a particular motherboard vendor or the like, contrary to rumors/noise online due to the complexity of the problem."

2 of 120 comments (clear)

  1. You guys new or something? by Orgasmatron · · Score: 4, Interesting

    Not (necessarily) a big deal. CPUs have bugs. The kernel, the compilers and the standard libraries are all stuffed full of workarounds for various CPU errors. They are called "errata" and pretty much every CPU has them. (One could argue that corrigendum would be a more appropriate word for them.) Intel has had some big ones, the most memorable (off the top of my head) were FOOF and FDIV. The 286 was so riddled with bugs that everyone gave up trying to write a protected mode kernel and just waited for the 386.

    Basically, they'll figure out what is causing the error and how to avoid it. If the workaround is easy, like "have the compiler reorder some instructions", a few patches will go out and life goes on, no big deal.

    If the workaround is less easy, like "don't utilize all cores", or "bump the clock multiplier down to overcome a thermal or electrical issue", that is a much bigger deal. If you don't meet marketing numbers, your choices are refund or replace. Intel spent a half billion dollars replacing CPUs because of the FDIV bug, even though they calculated that most people would never encounter it and it was relatively easy to patch around (but the patch would have been a drag on FPU performance - and marketing again had made promises).

    --
    See that "Preview" button?
  2. Re:so how does that work? by rew · · Score: 3, Interesting

    There MUST be some things in hardware to execute anything. While they (the chip manufacturers) have surprised me in the past, not all bugs CAN be fixed with a microcode update.

    A long, long time ago, people wrote "self modifying code". Say for doing bit-operations on parts of the screen buffer, you might pass 1 for AND 2 for OR and 3 for XOR. The function could then place the AND/OR/XOR opcode in the middle of the doit loop and then perform the loop.... So one day the manufacturer guarantees that the new machine will execute everything the old one did. Bad move. Turns out the new machine is faster because it prefetches instructions. By the time the code has determined the opcode for inside the loop, the loop (with the last AND/OR/XOR opcode in place) has already been prefetched. This prefetching is at the core of why the machine is fast. Implemented in hardware. Can you fix that with a microcode update? Apparently in the case at hand (PR1ME9955): yes.

    But I can easily see it happen that either you disable the whole prefetching stuff (slow everything down enormously) or you need say an extra comparator ("Is the store happening near my PC, possibly near my prefetch queue?") to allow for "normal" cases to use the prefetch queue, but this special case to flush the queue only when necessary. In any case, the microcode was updated and stuff worked properly again.