Slashdot Mirror


Tracking Down The AMD "Processor Bug"

tercero writes: "over at the Gentoo Linux website there is an update on the AMD processor bug mentioned here. The sum up is that AMD claims it's not a bug with the Athlon processor, but with the motherboard. More detailed information can be found on this LKML post." An Anonymous Coward points to a similar explanation at Linux Weekly News. Update: 01/25 01:25 GMT by T : Daniel Robbins from Gentoo clarifies: "AMD is not calling this a 'motherboard' issue, it is an interaction between a feature of the Athlon called 'speculative writes' and the design of the GART, which is not cache-coherent. It's a 'Athlon/cache coherency/GART' problem, not a 'motherboard' problem."

11 of 237 comments (clear)

  1. Don't blame AMD entirely by ekrout · · Score: 5, Insightful

    Don't blame AMD entirely. They acknowledged the bug back in September of 2000 and immediately released patches for Windows 2000. Consequently, it doesn't affect users of Windows XP either. It's been around for over a year and now it's "news"? This should've been fixed in the Linux kernel months ago. Sorry for sounding so harsh.

    --

    If you celebrate Xmas, befriend me (538
  2. This is embarassing by jidar · · Score: 3, Insightful

    This is embarassing to the Linux community as a whole, and It also explains why I've had problems with crashes on two different systems running Linux and Athlons.

    What I don't understand is how this could have made it so far? This is exactly the sort of problem I have been telling people we don't have in the Linux world, and now it looks like I was wrong. Is this pointing out an underlying problem we have with QA in the Linux kernel? With Open Source in general? What can we do to make sure that a bugs of this magnitude are detected more quickly?

    --
    Sigs are awesome huh?
    1. Re:This is embarassing by Grelli · · Score: 2, Insightful
      This is embarassing to the Linux community as a whole

      Actually, it isn't embarassing at all. It wasn't the "Linux Community"'s fault. This is the fault of AMD who anounced/classified the bug as a Windows 2000 issue instead of a hardware issue. Many posters have pointed out that kernel hackers probably don't follow hardware bug reports for OTHER operating systems.

      The failing was on AMD's part, and nobody else. But don't get me wrong, I love AMD, and this won't change my overall opinion of them. If things like this continually happen, then I may have to reconsider. But if this is a one time thing, I'm not going to get overly mad, and I hope no-one else does either.

    2. Re:This is embarassing by Dahan · · Score: 3, Insightful
      How did AMD know that Windows-* was doing the bad things?

      Maybe because Microsoft reported the problem to them and asked for help?

      Perhaps I worded my question poorly--why would AMD even think that Linux had the same bug as Windows 2000? Whenever you see a Windows bug, do you usually wonder if Linux has the same bug? They're completely different codebases, and there's no reason to think that a bug in one OS would be present in the other.

  3. Re:A kernel bug -- not a motherboard bug by heretic · · Score: 2, Insightful

    Sheesh! Read the above article where it states "...AMD claims it's not a bug with the Athlon processor, but with the motherboard". AMD is claiming no such thing! They are claiming it's a Linux kernel bug.

  4. Don't cache it then! by Papineau · · Score: 4, Insightful

    From the LKML post linked in the story, it seems it's because some 4MiB pages (I couldn't understand why 4KiB pages aren't affected, if they effectively are not) are allocated for the AGP (GART more specifically) with some bits set telling it is cacheable.

    Why would somebody want to cache the AGP memory? I'm pretty sure it's used 99.99% of the time as write-only memory, because it's the main output method of most computers. What's the point of caching that? It can only prevent the use of the CPU cache by some more important things, no?

    Feel free to correct me if I'm wrong, I'm not very familiar with the usage of AGP memory (or GARTs).

  5. The Nature of the Bug by 4of12 · · Score: 3, Insightful

    Hmmmm.

    Is the Bug...

    • (A) In the Athlon cache?
    • (B) In the chipset?
    • (C) In the AGP-using devices misusing memory?
    • (D) In the Linux kernel?
    Well, AFAICT, the real bug is in the communication of relevent knowledge.

    These kinds of bugs would have significantly shorter duration if the specifications for all four possible culprits in (A)-(D) were openly published, completely, for all to see.

    --
    "Provided by the management for your protection."
  6. Re:It is (not?) a CPU bug. by tommck · · Score: 5, Insightful
    If the bug doesn't appear on intel chips, then how are we supposed to believe that it's not an AMD bug?

    Well, based on my reading of other posts, it is a simple case of AMD taking advantage of some features of AGP that are within spec that Intel is not. When the OS assumes that things are done Intel's way instead of adhering to the spec, things will show up on an AMD processor and not on an Intel.

    AMD is doing things correctly, albeit differently from Intel. This is exactly how we are supposed to believe that it's not an AMD bug.

    T

    --
    ---- It puts the lotion on its skin or else it gets the hose again. It does this whenever it's told.
  7. Re:All of the above. by Stiletto · · Score: 3, Insightful

    It's an OS bug if you think that the OS should be bright enough not to make AGP-touched memory cacheable (it wasn't intended to be).

    I'm voting for option 3), myself.


    I thought one of the main benefits of AGP was the ability to remap a bunch of non-contiguous physical blocks into one address space, so the entire bunch could be marked as cachable (for instance when DMA'ing a bunch of vertices across the bus).

  8. VM Implications? by mjh · · Score: 4, Insightful
    From the gentoo article, I found the following very interesting:
    Yesterday, Rik van Riel, William Lee Irwin and myself were able to discuss this issue of Athlon/AGP instability with AMD....

    ...But now that the problem is out in the open, the solution is clear. The Linux kernel's approach to memory management must become more sophisticated in order to address potential conflicts between the highly-speculative nature of Athlon processors and the non-cache-coherent AGP GART.

    When Linus switched to the AA VM, I got the impression that one of the key differences between the AA VM and the RvR VM is that Rik's VM is much more flexible, but with that flexibility comes complexity, which is why Linus switched to AA's VM. AA's was much simpler to understand and helped to stabalize the VM problems. Does the above quote mean that the AA VM isn't going to be able to handle the requirements to fix this bug? Is this a plug to put back RvR's VM?

    I'm not trying to start a flame war here, just want to understand if I understood what the final paragraph was saying. Please mod me down if I'm way off base, but help me understand too!

    --
    Key to financial independence: Spend less than you earn. Save and invest the difference. Do it for a long time.
  9. You are assuming... by Arker · · Score: 5, Insightful

    You are assuming that AMDs current explanation is 100% true, correct, and complete. There are good reasons to doubt this.


    The "explanation" so far has just raised more questions. Why does the same code that causes the athlon to crash work fine on pentiums? Apparently the GART is cacheable on pentium systems? And the Athlon is billed as pentium-compatible...


    Why does disabling large pages fix the problem? If their explanation is correct, that fix should not work, because it doesn't address the issue they claim to be the problem.


    I'm sure this will get worked around in software (and the linux fix will actually workaround the underlying problem, rather than just making it less likely as the windows world seems to be satisfied with) once the real details of this are known. But to claim it's not a hardware bug is ludicrous. It's a bug with the Athlon CPU, or with certain GARTS found in Athlon chipsets, or both. If AMD were less worried about spin-controlling it and claiming it's the software at fault maybe they would be more forthcoming about what is really going on here.

    --
    =-=-=-=-=-=-=-=-=-=-=-=-=-=-
    Friends don't let friends enable ecmascript.