Slashdot Mirror


Tracking Down The AMD "Processor Bug"

tercero writes: "over at the Gentoo Linux website there is an update on the AMD processor bug mentioned here. The sum up is that AMD claims it's not a bug with the Athlon processor, but with the motherboard. More detailed information can be found on this LKML post." An Anonymous Coward points to a similar explanation at Linux Weekly News. Update: 01/25 01:25 GMT by T : Daniel Robbins from Gentoo clarifies: "AMD is not calling this a 'motherboard' issue, it is an interaction between a feature of the Athlon called 'speculative writes' and the design of the GART, which is not cache-coherent. It's a 'Athlon/cache coherency/GART' problem, not a 'motherboard' problem."

19 of 237 comments (clear)

  1. Bug? by smack_attack · · Score: 4, Funny

    2+2=3.9999999999999999999999999999983774

    Oh wait that ws Intel.

  2. Think "Matrix" by SpookComix · · Score: 5, Funny
    AMD claims it's not a bug with the Athlon processor, but with the motherboard

    According to young bald children everywhere, "There is no bug".

    In related news, the motherboard manufacturers are quoted as saying, "It's not a bug with the motherboard, but with the Athlon processor."

    --SC

    --
    You read fiction? I write it! Lemme know what you th
  3. I work in software by Anonymous Coward · · Score: 4, Funny

    And it's never our program that has your bug.

    Meanwhile, we're feverishly fixing your bug in our software.

    "Yes sir, we've patched around the OS problem and this should get rid of that nasty bug you were seeing."

  4. Don't blame AMD entirely by ekrout · · Score: 5, Insightful

    Don't blame AMD entirely. They acknowledged the bug back in September of 2000 and immediately released patches for Windows 2000. Consequently, it doesn't affect users of Windows XP either. It's been around for over a year and now it's "news"? This should've been fixed in the Linux kernel months ago. Sorry for sounding so harsh.

    --

    If you celebrate Xmas, befriend me (538
  5. It's not a bug!! by ender-iii · · Score: 4, Funny

    It's an optimization for Windows XP!!

    --
    ender-iii
  6. Kernel parameter vs LILO config file by DragonHawk · · Score: 5, Informative

    The kernel will look for the parameter

    mem=nopentium

    and turn off 4MB pages (which may or may not prevent the problem from manifesting -- the situation is unclear at this time). You can do this at the boot prompt like this

    LILO boot: linux mem=nopentium

    or by placing the configuration directive

    append="mem=nopentium"

    in your /etc/lilo.conf configuration file.

    See the manual page for lilo.conf for the details.

    --

    dragonhawk@iname.microsoft.com
    I do not like Microsoft. Remove them from my email address.
  7. Re:this is something.. by NanoGator · · Score: 5, Funny

    Mac users don't have to worry about using the term 'Gigahertz' either.

    --
    "Derp de derp."
  8. More information by DragonHawk · · Score: 5, Informative

    Yesterday, information became widely available that described possible stability issues (system crashes, hangs, etc.) when using an AGP video card under Linux in conjunction with an AMD Athlon processor. It was generally called a "bug" in the Athlon CPU.

    More information is now available at http://www.gentoo.org, including an analysis of AMD's response. AMD's official response was posted to LKML, and is available at http://www.geocrawler.com/lists/3/Linux/35/175/762 6960/.

    There is apparently some kind of bad interaction between the AGP GART ("Graphics Address Remapping Table", I think?), speculative memory operations performed by the Athlon processor, the memory mappings used by the kernel, and cache coherency. The details are beyond me, but the practical upshot appears to be that the wrong data ends up being written back to main memory at some point.

    I recommend reading the above LKML thread if you suspect you are affected by this issue. Information is still being uncovered, and it is not immediately clear how this occurs, what causes it, who is affected by it, and how to work around it.

    In particular, there is some uncertainty as to whether the "mem=nopentium" option actually prevents the problem, or merely makes it less likely to occur.

    --

    dragonhawk@iname.microsoft.com
    I do not like Microsoft. Remove them from my email address.
  9. All of the above. by Christopher+Thomas · · Score: 5, Informative

    AMD claims it's not a bug with the Athlon processor, but with the motherboard

    According to young bald children everywhere, "There is no bug".

    In related news, the motherboard manufacturers are quoted as saying, "It's not a bug with the motherboard, but with the Athlon processor."


    Funny, I didn't think I was bald...

    It's an Athlon bug if you think doing speculative writes is a bug.

    It's a motherboard chipset bug if you think that the AGP controller should play nicely with cache-coherence protocols (right now it doesn't, presumably to gain a speed boost).

    It's an OS bug if you think that the OS should be bright enough not to make AGP-touched memory cacheable (it wasn't intended to be).

    I'm voting for option 3), myself.

    1. Re:All of the above. by WNight · · Score: 4, Funny

      > Wow, Windows and Linux stricken by the same bug. What's the probability of that?

      Probably quite good. I imagine if you examine both systems carefully you'll see a BSD license agreement in the system binaries that deal with AGP. :)

  10. Don't cache it then! by Papineau · · Score: 4, Insightful

    From the LKML post linked in the story, it seems it's because some 4MiB pages (I couldn't understand why 4KiB pages aren't affected, if they effectively are not) are allocated for the AGP (GART more specifically) with some bits set telling it is cacheable.

    Why would somebody want to cache the AGP memory? I'm pretty sure it's used 99.99% of the time as write-only memory, because it's the main output method of most computers. What's the point of caching that? It can only prevent the use of the CPU cache by some more important things, no?

    Feel free to correct me if I'm wrong, I'm not very familiar with the usage of AGP memory (or GARTs).

  11. Re:It is (not?) a CPU bug. by tommck · · Score: 5, Insightful
    If the bug doesn't appear on intel chips, then how are we supposed to believe that it's not an AMD bug?

    Well, based on my reading of other posts, it is a simple case of AMD taking advantage of some features of AGP that are within spec that Intel is not. When the OS assumes that things are done Intel's way instead of adhering to the spec, things will show up on an AMD processor and not on an Intel.

    AMD is doing things correctly, albeit differently from Intel. This is exactly how we are supposed to believe that it's not an AMD bug.

    T

    --
    ---- It puts the lotion on its skin or else it gets the hose again. It does this whenever it's told.
  12. VM Implications? by mjh · · Score: 4, Insightful
    From the gentoo article, I found the following very interesting:
    Yesterday, Rik van Riel, William Lee Irwin and myself were able to discuss this issue of Athlon/AGP instability with AMD....

    ...But now that the problem is out in the open, the solution is clear. The Linux kernel's approach to memory management must become more sophisticated in order to address potential conflicts between the highly-speculative nature of Athlon processors and the non-cache-coherent AGP GART.

    When Linus switched to the AA VM, I got the impression that one of the key differences between the AA VM and the RvR VM is that Rik's VM is much more flexible, but with that flexibility comes complexity, which is why Linus switched to AA's VM. AA's was much simpler to understand and helped to stabalize the VM problems. Does the above quote mean that the AA VM isn't going to be able to handle the requirements to fix this bug? Is this a plug to put back RvR's VM?

    I'm not trying to start a flame war here, just want to understand if I understood what the final paragraph was saying. Please mod me down if I'm way off base, but help me understand too!

    --
    Key to financial independence: Spend less than you earn. Save and invest the difference. Do it for a long time.
    1. Re:VM Implications? by WNight · · Score: 4, Interesting

      I think most people see the VM as eventually becoming quite complex. Profiling memory and disk usage (well, having hooks to allow the disk cache to cache based on memory use) allows you to guess when something will be needed and not page it out if it's needed immediately, or to page out something because you know it's not going to be needed for a while.

      And eventually, all memory management systems will either reach an out of memory issue (even with a reserved cache, the OS can still grow beyond safety margins) and either stall or kill processes. While some people feel that RIk is focusing a little heavily on the killing processes side, it is something you have to be prepared to do so you want to kill a less useful task (a forked apache server, not the main process, for example) instead of killing something critical to operation.

      You can usually come up with a simple solution that covers 95% of the cases very well, but it'll fall apart on that last 5% in a bad way. The complex solutions often offer lower performance in everyday situations but guarantee performance will never get as bad as the easy solutions would allow.

      So, I think anyone with design experience expects Rik's VM (or one like it) to go back into the kernel eventually.

      Personally, I think Rik should look at the issue of having "Emergency" swap that you don't go into except for OS processess. Once main swap is filled all non-OS processes fail to allocate any new RAM. This lets the system function well enough for non-kernel code (ideally more customizable) to make a system-specific determination on how to proceed. For instance, kill any processes from /usr/bin/games and see if that helps the issue... But, I'll admit to not being an expert and that this is only an educated guess.

  13. Re:this is not a motherboard bug either... by barawn · · Score: 4, Interesting

    Interestingly enough, this feature of AGP is not really critical to increasing performance in games - in fact, it could be counterproductive to it.

    The AGP GART (Graphics Address Remapping Table, I believe) maps "video card memory addresses" to "main memory addresses", i.e., it's to allow the graphics card to grab textures, etc. directly from main memory without going through the CPU.

    Many motherboard manufacturers use this feature to provide on-board video without any dedicated memory so they don't have to include any additional memory for the graphics card.

    Of course, since this blows so massively performance-wise, it's mostly abandoned now.

    Is the GART actually useful for anything except extending the video card's onboard memory? I'm not really sure...

  14. You are assuming... by Arker · · Score: 5, Insightful

    You are assuming that AMDs current explanation is 100% true, correct, and complete. There are good reasons to doubt this.


    The "explanation" so far has just raised more questions. Why does the same code that causes the athlon to crash work fine on pentiums? Apparently the GART is cacheable on pentium systems? And the Athlon is billed as pentium-compatible...


    Why does disabling large pages fix the problem? If their explanation is correct, that fix should not work, because it doesn't address the issue they claim to be the problem.


    I'm sure this will get worked around in software (and the linux fix will actually workaround the underlying problem, rather than just making it less likely as the windows world seems to be satisfied with) once the real details of this are known. But to claim it's not a hardware bug is ludicrous. It's a bug with the Athlon CPU, or with certain GARTS found in Athlon chipsets, or both. If AMD were less worried about spin-controlling it and claiming it's the software at fault maybe they would be more forthcoming about what is really going on here.

    --
    =-=-=-=-=-=-=-=-=-=-=-=-=-=-
    Friends don't let friends enable ecmascript.
    1. Re:You are assuming... by Salamander · · Score: 5, Interesting
      Why does the same code that causes the athlon to crash work fine on pentiums? Apparently the GART is cacheable on pentium systems? And the Athlon is billed as pentium-compatible...

      There are different types and levels of compatibility. The Athlon claims base-instruction-set and register compatibility with the Pentium, but it's not pin-compatible and may also differ in any number of behavioral/timing characteristics. This is one such case. The behavior in question is perfectly acceptable within the bounds of the compatibility and standards compliance that AMD claims.

      Why does disabling large pages fix the problem?

      Because it's the large pages that are (incorrectly) marked as cacheable. No large pages, no incorrect mappings, no problem.

      But to claim it's not a hardware bug is ludicrous. It's a bug with the Athlon CPU, or with certain GARTS found in Athlon chipsets, or both.

      Nope. It's a bug in the OS. Anyone who works with memory systems should know the dangers inherent in mixing cache-coherent and non-coherent accesses to the same memory, and should mark pages accordingly.

      It's very tempting to criticize AMD for their handling of speculative writes, but that handling is really irrelevant. It seems to me that the cache line's contents should not be marked dirtybefore the processor has actually written to it (which in this case it never does). Under normal conditions, though, this would only be a performance issue. If a coherent access were made from elsewhere, invalidation and writeback would ensue; the writeback would be unnecessary but not harmful, because it would be writing the same data that were already in main memory. However, the cache wouldn't be involved in the first place if the pages were mapped correctly. There would be no write-allocate, no invalidation, no writeback, and no problem. The invalid mapping turns a slightly silly but legal and normally-harmless processor behavior into a serious coherency problem.

      --
      Slashdot - News for Herds. Stuff that Splatters.
    2. Re:You are assuming... by Dahan · · Score: 5, Informative
      Apparently the GART is cacheable on pentium systems?

      There are Pentium systems with an AGP port? If you mean the Pentium II and up, I don't see why the GART would be cacheable there either; I don't know if the P4 chipsets have changed things, but with the PII and PIII, here's what Intel had to say about the subject:

      For current hardware implementations, the OS will make AGP memory (like other video memory) non-cacheable, so that there is no coherency problem between the CPU caches and the data that the graphics controller uses. Otherwise, graphics controller accesses to AGP memory would require "snooping" the CPU caches, which would cause delays in execution in some cases.

      -- AGP and Graphics Optimization Techniques

      (Emphasis added). As for why the bug doesn't happen on Intel CPUs, it sounds like the Athlon has more aggressive speculative writes and can change memory that wasn't explicitly written to, dirtying the cache. But in any case, even on Intel CPUs, the AGP area is supposed to be mapped non-cacheable.

      Why does disabling large pages fix the problem?

      Don't know about that one; I haven't read the various tech docs for the Athlon. Perhaps the cache works slightly differently with 4MB pages vs 4KB pages?

  15. In the spirit of... by Refrag · · Score: 5, Funny

    ...Slashdotters that always point out their favorite OS isn't vulnerable to a particular bug.

    My Macintosh isn't affected by this bug due to its PowerPC processor.

    --
    I have a website. It's about Macs.