Slashdot Mirror


AMD Confirms Linux 'Performance Marginality Problem' On Ryzen (phoronix.com)

An anonymous reader writes: Ryzen customers experiencing segmentation faults under Linux when firing off many compilation processes have now had their problem officially acknowledged by AMD. The company describes it as a "performance marginality problem" affecting some Ryzen customers and only on Linux. AMD confirmed Threadripper and Epyc processors are unaffected; they will be dealing with the issue on a customer-by-customer basis, and their future consumer products will see better Linux testing/validation. Ryzen customers believed to be affected by the problem can contact AMD Customer Care. Michael Larabel writes via Phoronix: "With the Ryzen segmentation faults on Linux they are found to occur with many, parallel compilation workloads in particular -- certainly not the workloads most Linux users will be firing off on a frequent basis unless intentionally running scripts like ryzen-test/kill-ryzen. As I've previously written, my Ryzen Linux boxes have been working out great except in cases of intentional torture testing with these heavy parallel compilation tasks. [AMD's] analysis has also found that these Ryzen segmentation faults aren't isolated to a particular motherboard vendor or the like, contrary to rumors/noise online due to the complexity of the problem."

27 of 120 comments (clear)

  1. Just like FDIV by Anonymous Coward · · Score: 3, Insightful

    Will only affect a few people, so we aren't replacing any CPUs. Way to hand Intel the business, AMD!

    1. Re:Just like FDIV by arglebargle_xiv · · Score: 3, Insightful

      Except it doesn't apply to Threadripper, Epyc, or Ryzen Pro.

      We don't even know if it's an AMD problem, it could be any one of a number of previously-unnoticed Linux issues that happen to show up on Ryzen (note that the text says "may also affect other Unix-like operating systems", not "exists under FreeBSD as well", so currently it's pure speculation that it extends past Linux). We'll have to wait and see what further investigation turns up...

  2. oblig by Anonymous Coward · · Score: 5, Informative

    certainly not the workloads most Linux users will be firing off on a frequent basis

    I run Gentoo you insensitive clod!

    1. Re: oblig by GameboyRMH · · Score: 2

      Have you tried watching H.265/HEVC-encoded anime? :-P

      --
      "When information is power, privacy is freedom" - Jah-Wren Ryel
    2. Re:oblig by Misagon · · Score: 4, Insightful

      How was the parent modded as "Funny"?
      This is definitely not funny. Some users of compiled distros such as Gentoo have encountered the bug in fairly regular basis when trying to compile the distro -- which is needed to make it install.

      --
      "We mustn't be caught by surprise by our own advancing technology" -- Aldous Huxley
  3. so how does that work? by Jodka · · Score: 4, Insightful

    It is not like the CPU is testing for that particular combination of conditions alone and conditionally segfaulting. Really, there is a flaw in the CPU design which so far has only been demonstrated to exhibit itself under those conditions. That is much more worrying than the summary leads us to believe.

    I like AMD and Ryzen is a good bargain compared to Intel. It will be my next CPU purchase, though I am holding out until they fix the bug. But I don't like the way they are minimizing the impact.

         

    --
    Ceci n'est pas une signature.
    1. Re:so how does that work? by AmiMoJo · · Score: 3

      All modern CPUs run microcode that is updated on boot by the BIOS. So fixing this will just be a microcode update, i.e. a BIOS update. AMD has been quite good at getting vendors to ship such updates for their motherboards and systems, but if for some reason they don't you could load it via a driver under Linux too.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    2. Re:so how does that work? by rew · · Score: 3, Interesting

      There MUST be some things in hardware to execute anything. While they (the chip manufacturers) have surprised me in the past, not all bugs CAN be fixed with a microcode update.

      A long, long time ago, people wrote "self modifying code". Say for doing bit-operations on parts of the screen buffer, you might pass 1 for AND 2 for OR and 3 for XOR. The function could then place the AND/OR/XOR opcode in the middle of the doit loop and then perform the loop.... So one day the manufacturer guarantees that the new machine will execute everything the old one did. Bad move. Turns out the new machine is faster because it prefetches instructions. By the time the code has determined the opcode for inside the loop, the loop (with the last AND/OR/XOR opcode in place) has already been prefetched. This prefetching is at the core of why the machine is fast. Implemented in hardware. Can you fix that with a microcode update? Apparently in the case at hand (PR1ME9955): yes.

      But I can easily see it happen that either you disable the whole prefetching stuff (slow everything down enormously) or you need say an extra comparator ("Is the store happening near my PC, possibly near my prefetch queue?") to allow for "normal" cases to use the prefetch queue, but this special case to flush the queue only when necessary. In any case, the microcode was updated and stuff worked properly again.

  4. Don't worry... by ckatko · · Score: 5, Insightful

    ..the faults only happen for people with massive parallel loads.

    You know... the main reason people buy the CPUs.

    1. Re:Don't worry... by alvinrod · · Score: 2

      Well they do say Threadripper and Eypic are unaffected, and I think those chips are a different stepping than the initial batch of Ryzen chips so the probably may already be fixed. It may be possible to fix the others with a firmware update, though who knows how long that will take to roll out depending on other things AMD is working on and their other priorities.

  5. Re:MT was what AMD had over Intel by F.Ultra · · Score: 3, Insightful

    Well you can (run make -j ), just be prepared to rerun that if/when it segfaults... For most people so far they only get the segfault if they do "make clean && make -jX" a few times so a single make of even a large project should probably work most of the time. Will be interesting to see if/when AMD will be able to fix it, particular why Windows does not seam to suffer from it yet will be interesting to see.

  6. why would I buy a processor that *might* segfault by iggymanz · · Score: 4, Insightful

    never mind my load type today, what about 2 years from now? why would I spend money on something that *might* segfault and for which the vendor isn't going to provide a solution to *everyone*. case by case basis my ass, that's the sign of a tech hardware vendor which should be shunned.

  7. Phoronix FAIL by Anonymous Coward · · Score: 5, Insightful

    Phoronix: "certainly not the workloads most Linux users will be firing off on a frequent basis"

    Bullshit. Anyone who does video encoding will easily max out a Ryzen. Anyone who builds software for a living will max out q Ryzen. In fact, just about anybody who needs more computing power than a Chromebook will max out Ryzen.

    AMD you fucked up big time. Bigly.

    And Phoronix, who are you to say what people should be doing with their machines? People paid for this computational hardware and should expect it to perform as advertised.

    1. Re:Phoronix FAIL by 0123456 · · Score: 3, Informative

      Not to mention that one of the reasons we want more cores in our desktop machines is to speed up C++ compiles by compiling more files in parallel.

  8. Re: Micro needle in mega haystack. by Anonymous Coward · · Score: 5, Insightful

    Processors are not components where you design for the average case and accept failures during peak load. How can a single byte of anything compiled on this processor from now on be trusted not to have been silently corrupted? Does multithreaded disk access run the risk of silently corrupting my files? Until fixed, this processor is toast.

  9. So far so good by I'm+just+joshin · · Score: 4, Informative

    Anecdote here...

    Ryzen 1700 w/ 64GB running Promox and 6 virtual machines - 1 Debian, 1 Gentoo (build machine), 1 PF Sense, and 3 Windows.

    Been rock solid doing full world builds on Gentoo, PCI passthrough of a GTX 1070 card to one of the Windows VMs (gaming actually works well), and has only been rebooted once since getting it going. Uptime of 24 days.

    No segfaults,

    It is amazingly fast & quiet. Quite the upgrade from my I7-3770K.

    1. Re:So far so good by I'm+just+joshin · · Score: 2

      I mostly followed this: https://pve.proxmox.com/wiki/P.... If you're passing a nVidia GPU, be sure to pull a copy of its BIOS and pass it to KVM.

      In addition, I passed most USB ports, and my PCI-E Soundblaster card to the Windows VM.

      Good luck.

  10. You guys new or something? by Orgasmatron · · Score: 4, Interesting

    Not (necessarily) a big deal. CPUs have bugs. The kernel, the compilers and the standard libraries are all stuffed full of workarounds for various CPU errors. They are called "errata" and pretty much every CPU has them. (One could argue that corrigendum would be a more appropriate word for them.) Intel has had some big ones, the most memorable (off the top of my head) were FOOF and FDIV. The 286 was so riddled with bugs that everyone gave up trying to write a protected mode kernel and just waited for the 386.

    Basically, they'll figure out what is causing the error and how to avoid it. If the workaround is easy, like "have the compiler reorder some instructions", a few patches will go out and life goes on, no big deal.

    If the workaround is less easy, like "don't utilize all cores", or "bump the clock multiplier down to overcome a thermal or electrical issue", that is a much bigger deal. If you don't meet marketing numbers, your choices are refund or replace. Intel spent a half billion dollars replacing CPUs because of the FDIV bug, even though they calculated that most people would never encounter it and it was relatively easy to patch around (but the patch would have been a drag on FPU performance - and marketing again had made promises).

    --
    See that "Preview" button?
    1. Re:You guys new or something? by Misagon · · Score: 2

      The first bug report with a test case that reproduced the bug was submitted to AMD in April, and they have acknowledged the bug first now.

      And how long would we have to wait for a microcode update?

      --
      "We mustn't be caught by surprise by our own advancing technology" -- Aldous Huxley
  11. Mod Points by bobbuck · · Score: 2
    "I have no idea how this isn't +5."

    Well, the last time I had mod points, I wasted them on comments in a post announcing the invention of the telegraph so don't expect much modding from me.

  12. Re: Micro needle in mega haystack. by LostMyBeaver · · Score: 5, Insightful

    I tend to buy at least one AMD system from each generation to give it a go and see if we can't get somewhere without these problems.

    amd486 - system/memory clock (same thing back then) was unstable and too high. This caused all kinds of issues with Maxwell's theorem and it was impossible to run a VESA local bus IDE or VGA adapter reliably. Also consider that the CPU was implemented almost entirely without x86 debug registers which made debugging GPFs a complete nightmare. Very often, Windows NT 3.1 and 3.5 would crash on there and people immediately pointed a finger at Microsoft for the GPFs and blue screens. In reality on AMD CPUs, nearly 50 percent of the GPFs were actually AMD's fault.

    amd586 and 686... these CPUs were huge improvements, but there was some weird issue with the NMI that made debugging code almost impossible. They also had a really bad tendency of bursting capacitors on the system board

    AMD with later generations
    - built in MMU was implemented for users, not servers and developers. it was absolutely horrifying wondering whether my code was going to come out right. memory protection was more of a suggestion to them than a rule.
    - AMD was killing every desktop benchmark, I actually loved AMD at this time as I was playing games and I had bought myself four Shuttle Cubes with the nVidia chipsets and AMD CPUs. I programmed on a dual-Celeron system at work with Linux because it was just faster and better.
    - P4 vs Athlon days. Intel botched the P4 in so many ways it was terrible. It was almost not a challenge for AMD to out-perform Intel as the P4 architecture was an endless mess of cache miss hell. Now... let's be REALLY REALLY fair. P4 would have been the ultimate winner if CPUs were meant for DOS. What I mean is that on a system where there is only a single task (not including hardware interrupt handlers) the P4 pipeline is still a thing of true beauty. But the whole world had moved to Windows XP (got XP and my first P4 on the same shopping trip) and people left DOS, Windows 95/98/ME behind to run a real operating system for the first time... And the P4 was dead before it left the door. The Athlon which was basically equal to a higher clocked Pentium III with an internal MMU ... which in itself was the best thing they ever did.... was amazingly fast. Instead of making a fancier CPU, AMD just kept making the same one and in each generation, focused on moving more bottlenecking systems on-die so the chip performance wouldn't be throttled by external buses. Unfortunately, during this era, both Intel and AMD sucked for development. GCC was a hot wreck as it was still running the crap based on Richard Stallman's code, 2.77 was useless for optimization and 2.89-2.95 was absolutely unreliable. RedHat was trying to make a living porting Linux to every damn device and make it run on ARM (SHITTY DEVELOPMENT PLATFORM at the time), etc... Visual C++ was great and Intel C++ was amazing but you weren't allowed to say that out loud. See, Microsoft was truly evil at the time.
    Following generations of AMD (not including Ryzen)
    - Branding hell... no one that didn't take an obsessive interest in AMD could tell what generation of chip they were buying or even what tier. Even now, having owned many of them, I couldn't tell you which ones were good or bad because I was lost. Intel's current numbering is bad... but not that bad.
    - Memory problems. Yeh... wasted 5 days trying to debug a buffer overflow... then I switched to my Intel based laptop and it showed up in the debugger on the first try. AMD still can't make a fucking MMU. How the hell are you supposed to write a memory manager for an operating system if you can't trap buffer overflows when you clearly defined in the GDT and/or LDT where it should set bounds.
    - Order of execution. On an Intel Core CPU, I can write multiprocessing code, set core affinity based on the position of the core relative to the ring buses. Then I can queue tasks that read/write L1/L2/L3 cache and based on the queui

  13. Re:Only on Linux by Dagger2 · · Score: 2

    You could just as easily argue that the fact that Linux works fine on other Ryzen processors, AMD's older processors and Intel's processors, and only segfaults on these specific Ryzen models, tells you that it's these processors that are broken, not Linux.

    Of course -- and I shouldn't really have to explain this on Slashdot of all places, but neither of these observations actually tell you where the problem is. Doing that involves doing some investigation, and the fact that AMD appear to be accepting blame suggests that they've done the investigation and believe it's their fault.

  14. Re:These are a bear to track down by Misagon · · Score: 4, Informative

    It has been confirmed to be a processor bug, not a software bug.
    BSD kernel developer Matt Dillon sent AMD a reproducible test case back in April.
    You can read more about it here.

    --
    "We mustn't be caught by surprise by our own advancing technology" -- Aldous Huxley
  15. Re:MT was what AMD had over Intel by OneAhead · · Score: 2

    This. The very existence of that flag in an ubiquitous utility that is commonly run even by end users (of at least some distros ;-)) makes the following sentence in TFA sound quite ignorant at best (and dishonest at worst):
    With the Ryzen segmentation faults on Linux they are found to occur with many, parallel compilation workloads in particular -- certainly not the workloads most Linux users will be firing off on a frequent basis unless intentionally running scripts like ryzen-test/kill-ryzen.

  16. Something is bugging me about that by dbIII · · Score: 2

    Intel has been rock-solid since forever

    Complete F00F.

  17. Re:why would I buy a processor that *might* segfau by iCEBaLM · · Score: 2

    Intel has been rock-solid since forever.

    https://arstechnica.com/inform...

  18. Re:MT was what AMD had over Intel by rew · · Score: 2

    Wait!

    What is happening is that the CPU will mis-execute some instruction so that some "data" becomes invalid. When a compiler is running such data is often a pointer and the wrong pointer often results in a segfault.

    But especially while we don't know what's going on exactly, this could also corrupt data. i.e. give the wrong results in a computation, or result in a bad binary when the running program is a compiler.

    So you're suggesting I trust the resulting binaries when the compilation doesn't segfault? Even when I have to try several times? Ehh. not me!