Tracking Down The AMD "Processor Bug"

← Back to Stories (view on slashdot.org)

Tracking Down The AMD "Processor Bug"

Posted by ryuzaki0 on Thursday January 24, 2002 @07:23AM from the complex-causes dept.

tercero writes: "over at the Gentoo Linux website there is an update on the AMD processor bug mentioned here. The sum up is that AMD claims it's not a bug with the Athlon processor, but with the motherboard. More detailed information can be found on this LKML post." An Anonymous Coward points to a similar explanation at Linux Weekly News. Update: 01/25 01:25 GMT by T : Daniel Robbins from Gentoo clarifies: "AMD is not calling this a 'motherboard' issue, it is an interaction between a feature of the Athlon called 'speculative writes' and the design of the GART, which is not cache-coherent. It's a 'Athlon/cache coherency/GART' problem, not a 'motherboard' problem."

7 of 237 comments (clear)

Min score:

Reason:

Sort:

Re:Kernel parameter vs LILO config file by Anonymous Coward · 2002-01-24 07:55 · Score: 1, Interesting

> mem=nopentium

This does NOT help.

Only Option "NvAGP" "0" solves
TuxRacer problems. Go read the
linked docs and you will understand
why.
It is (not?) a CPU bug. by crandall · 2002-01-24 07:56 · Score: 2, Interesting

If the bug doesn't appear on intel chips, then how are we supposed to believe that it's not an AMD bug? Sure, we could blame the motherboard... but wouldn't that mean via/intel solutions would carry the same issue?

Anyone have any knowledge as to how intel treats this 4mb pages different?

I mean, if the bug is caused by AMD's precaching of AGP Gart mapped memory, and intel just doesn't precache that memory, then now is it NOT an AMD processor bug?

When two processors aren't equal, there has to be a reason for the difference in running software.

(Note that I prefer AMD, so I'm just looking for answers, not trolling).
Re:Easy - Buy Intel. The cost of using 2nd party.. by SirSlud · 2002-01-24 08:19 · Score: 3, Interesting

>Lower costs typically means lower perfomance

What planet are you from? Lower costs (in the case of demonstrated similarity in performance) typically means lower demand and lower consumer valuation of the brand name, which means smaller user base, which means that it generally takes longer to run into compatibility flaws.

For instance, Nike is more expensive than Puma. Does that mean Nike shoes are better? Of course not, it means people are more willing to buy Nike, because they percieve that the brand gives them additional values. In the world of shoes, that value is the value of conformity and fashion .. in CPUs, it's the value of a larger consumer base, which essentially translates into a higher possibility of latent design flaws (ie, they exist in the costlier platform as well, but are found earlier because of the larger user base), and the value of being in the same boat as everyone else should a product fail in some fashion.

Thue funniest thing is you're talking about performance. Performance is how well something works when it works. When it /doesnt/ work, thats not performance; it's either compatibility with the outside world or a design flaw. Anyhow, I feel sorry for your view, because I guess you're paying alot of money for brand security .. but everyone in-the-know computer geek I know (I'm a C++ developer, so I'm not talking tech fanboys here) knows that you'd have to enjoy wasting money to justify buying Intel CPUs at this point in time.

Lest you cite this situation as a reason why I might be wrong .. it has already been fixed in Windows, and there is a known Linux workaround. So really, there's not much of an issue, and my AMD chip still cost me half the price of an Intel CPU, and benchmarks faster than the Intel, to boot! Keep buying your Nikes! I just want the shoe. :)

--
"Old man yells at systemd"
Re:this is not a motherboard bug either... by barawn · 2002-01-24 08:24 · Score: 4, Interesting

Interestingly enough, this feature of AGP is not really critical to increasing performance in games - in fact, it could be counterproductive to it.

The AGP GART (Graphics Address Remapping Table, I believe) maps "video card memory addresses" to "main memory addresses", i.e., it's to allow the graphics card to grab textures, etc. directly from main memory without going through the CPU.

Many motherboard manufacturers use this feature to provide on-board video without any dedicated memory so they don't have to include any additional memory for the graphics card.

Of course, since this blows so massively performance-wise, it's mostly abandoned now.

Is the GART actually useful for anything except extending the video card's onboard memory? I'm not really sure...
Re:This is embarassing by whovian · 2002-01-24 08:31 · Score: 2, Interesting

how was AMD supposed to know that Linux was doing the same bad thing-

How did AMD know that Windows-* was doing the bad things? I guess it didn't occur to AMD to download and inspect the kernel source code or talk to the linux kernel mailing list(s) and developers? It seems to me that that is effectively what they would have had to do with Microsoft.

OFF-TOPIC: This sort of touches on the point another poster made 1-2 weeks ago or so. I probably will recall the specific accusations incorrectly (and hence flamed), but the gist of that post was the hypothesis that AMD has a loyal following of users, in particular linux users, and it would be nice if AMD reciprocated a little in recognition of that. I am largely ignorant of AMD's contributions to the community per se, so put the flame on a low setting, ok?, as I am an AMD newlywed myself :)

--
To-do List: Receive telemarketing call during a tornado warning. Check.
Re:You are assuming... by Salamander · 2002-01-24 09:26 · Score: 5, Interesting

Why does the same code that causes the athlon to crash work fine on pentiums? Apparently the GART is cacheable on pentium systems? And the Athlon is billed as pentium-compatible...

There are different types and levels of compatibility. The Athlon claims base-instruction-set and register compatibility with the Pentium, but it's not pin-compatible and may also differ in any number of behavioral/timing characteristics. This is one such case. The behavior in question is perfectly acceptable within the bounds of the compatibility and standards compliance that AMD claims.

Why does disabling large pages fix the problem?

Because it's the large pages that are (incorrectly) marked as cacheable. No large pages, no incorrect mappings, no problem.

But to claim it's not a hardware bug is ludicrous. It's a bug with the Athlon CPU, or with certain GARTS found in Athlon chipsets, or both.

Nope. It's a bug in the OS. Anyone who works with memory systems should know the dangers inherent in mixing cache-coherent and non-coherent accesses to the same memory, and should mark pages accordingly.

It's very tempting to criticize AMD for their handling of speculative writes, but that handling is really irrelevant. It seems to me that the cache line's contents should not be marked dirtybefore the processor has actually written to it (which in this case it never does). Under normal conditions, though, this would only be a performance issue. If a coherent access were made from elsewhere, invalidation and writeback would ensue; the writeback would be unnecessary but not harmful, because it would be writing the same data that were already in main memory. However, the cache wouldn't be involved in the first place if the pages were mapped correctly. There would be no write-allocate, no invalidation, no writeback, and no problem. The invalid mapping turns a slightly silly but legal and normally-harmless processor behavior into a serious coherency problem.

--
Slashdot - News for Herds. Stuff that Splatters.
Re:VM Implications? by WNight · 2002-01-24 10:24 · Score: 4, Interesting

I think most people see the VM as eventually becoming quite complex. Profiling memory and disk usage (well, having hooks to allow the disk cache to cache based on memory use) allows you to guess when something will be needed and not page it out if it's needed immediately, or to page out something because you know it's not going to be needed for a while.

And eventually, all memory management systems will either reach an out of memory issue (even with a reserved cache, the OS can still grow beyond safety margins) and either stall or kill processes. While some people feel that RIk is focusing a little heavily on the killing processes side, it is something you have to be prepared to do so you want to kill a less useful task (a forked apache server, not the main process, for example) instead of killing something critical to operation.

You can usually come up with a simple solution that covers 95% of the cases very well, but it'll fall apart on that last 5% in a bad way. The complex solutions often offer lower performance in everyday situations but guarantee performance will never get as bad as the easy solutions would allow.

So, I think anyone with design experience expects Rik's VM (or one like it) to go back into the kernel eventually.

Personally, I think Rik should look at the issue of having "Emergency" swap that you don't go into except for OS processess. Once main swap is filled all non-OS processes fail to allocate any new RAM. This lets the system function well enough for non-kernel code (ideally more customizable) to make a system-specific determination on how to proceed. For instance, kill any processes from /usr/bin/games and see if that helps the issue... But, I'll admit to not being an expert and that this is only an educated guess.