AMD Confirms Linux 'Performance Marginality Problem' On Ryzen (phoronix.com)
An anonymous reader writes: Ryzen customers experiencing segmentation faults under Linux when firing off many compilation processes have now had their problem officially acknowledged by AMD. The company describes it as a "performance marginality problem" affecting some Ryzen customers and only on Linux. AMD confirmed Threadripper and Epyc processors are unaffected; they will be dealing with the issue on a customer-by-customer basis, and their future consumer products will see better Linux testing/validation. Ryzen customers believed to be affected by the problem can contact AMD Customer Care. Michael Larabel writes via Phoronix: "With the Ryzen segmentation faults on Linux they are found to occur with many, parallel compilation workloads in particular -- certainly not the workloads most Linux users will be firing off on a frequent basis unless intentionally running scripts like ryzen-test/kill-ryzen. As I've previously written, my Ryzen Linux boxes have been working out great except in cases of intentional torture testing with these heavy parallel compilation tasks. [AMD's] analysis has also found that these Ryzen segmentation faults aren't isolated to a particular motherboard vendor or the like, contrary to rumors/noise online due to the complexity of the problem."
Not (necessarily) a big deal. CPUs have bugs. The kernel, the compilers and the standard libraries are all stuffed full of workarounds for various CPU errors. They are called "errata" and pretty much every CPU has them. (One could argue that corrigendum would be a more appropriate word for them.) Intel has had some big ones, the most memorable (off the top of my head) were FOOF and FDIV. The 286 was so riddled with bugs that everyone gave up trying to write a protected mode kernel and just waited for the 386.
Basically, they'll figure out what is causing the error and how to avoid it. If the workaround is easy, like "have the compiler reorder some instructions", a few patches will go out and life goes on, no big deal.
If the workaround is less easy, like "don't utilize all cores", or "bump the clock multiplier down to overcome a thermal or electrical issue", that is a much bigger deal. If you don't meet marketing numbers, your choices are refund or replace. Intel spent a half billion dollars replacing CPUs because of the FDIV bug, even though they calculated that most people would never encounter it and it was relatively easy to patch around (but the patch would have been a drag on FPU performance - and marketing again had made promises).
See that "Preview" button?
It seems that Ryzen's hyperthreading, on Linux, under very rare circumstances, can cause memory errors. And Intel is spending millions flooding every tech forum and tech site with shill propaganda decaring this to be the 'end of the world'.
But Intel would like you to forget that its first two generations of hyperthreading were so broken, you had to switch it off altogether to do any serious work.
Hyperthreading needs scheduling to be sane and sympathetic. So no issues on the vastly better coded Windows. Sadly Linux is a joke from a software stability POV. So two threads on one core with inter-dependencies have many possibilities to cause bugs.
I once had Windows crash rarely when launching video. Turned out that I had a driver (emulating a DVD ROM) that failed to prevent its IRQ driver from 'paging out' under memory 'pressure'. And for some reason playing video had a real chance of grabbing the memory used by the interrupt code. The bug was 100% the fault of the IRQ code. And when i tracked it down, turned out there was a driver update that fixed the very bug.
Seems the Linux bug on Ryzen is the same sort of thing. One thread, apparently, has to be an interrupt. The compile load has to be so very taxing, the entire system RAM is under constant load. And I bet my bottom dollar the hopeless Linux coder has failed to flag the interrupt handling code as 'non-paging'. Or the Linux scheduler screws up ring zero ultra-priority interrurpt handlers, and lets then 'time out' under pressure.
Before you say "but Intel works"- WRONG. The person (sponsored by Intel) flooding forums with this 'bug' and the script to trigger it had to change the script code over and over again when users discovered it was triggering the same errors on Intel systems as well. What we know for REAL (as opposed to this fake news) is that certain compile workloads on Intel and AMD cause memory issues if hyperthreading is on. And the reason is certain to be bad linux coding.
If version 1,2,3,4,5 and 6 of the workload script crashed both Intel and AMD, and version 7 so far (so its claimed) only affects some ryzen chips, well the problem is clearly not unique to Ryzen.
PS again the people responsible for banging on about the issue are sponsored by Intel- and Intel has a very large active bounty for anyone who can 'prove' faults in Ryzen.
That tells me someone's code is fucked up, not that AMD's processors are screwed. Ain't happening on my Hackintosh, ain't happening on my Windows box.
Did someone let Grsecurity do the SMT kernel code?
Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.
There MUST be some things in hardware to execute anything. While they (the chip manufacturers) have surprised me in the past, not all bugs CAN be fixed with a microcode update.
A long, long time ago, people wrote "self modifying code". Say for doing bit-operations on parts of the screen buffer, you might pass 1 for AND 2 for OR and 3 for XOR. The function could then place the AND/OR/XOR opcode in the middle of the doit loop and then perform the loop.... So one day the manufacturer guarantees that the new machine will execute everything the old one did. Bad move. Turns out the new machine is faster because it prefetches instructions. By the time the code has determined the opcode for inside the loop, the loop (with the last AND/OR/XOR opcode in place) has already been prefetched. This prefetching is at the core of why the machine is fast. Implemented in hardware. Can you fix that with a microcode update? Apparently in the case at hand (PR1ME9955): yes.
But I can easily see it happen that either you disable the whole prefetching stuff (slow everything down enormously) or you need say an extra comparator ("Is the store happening near my PC, possibly near my prefetch queue?") to allow for "normal" cases to use the prefetch queue, but this special case to flush the queue only when necessary. In any case, the microcode was updated and stuff worked properly again.