Slashdot Mirror


Major Linux/Athlon CPU bug discovered

GeorgeFrancisco writes "I recently installed the nVidia drivers so I could play TuxRacer on my Athlon. Problem is it kept inexplicably hanging Linux. Now I know why. The CPU bug affects Athlon/Duron/Athlon MP AGP users. Fortunately there's a way around it, and: "Alan [Cox] is going to try to add some kind of Athlon/AGP CPU bug detection code to the kernel so that it will be able to auto-downgrade to 4K pages when necessary." Read more on the Gentoo Linux site."

32 of 402 comments (clear)

  1. For once Microsoft manged to fix it first by bob1000 · · Score: 2, Informative
    1. Re:For once Microsoft manged to fix it first by Skuto · · Score: 2, Informative

      >Why place all the blame on AMD? If you write
      >pentium-optimized code, what's so surprising if it
      >won't work exactly right on an AMD?

      It's not _nothing_ _whatsoever_ to do with Pentium optimized code. It's a new feature that both Intel and AMD cpu's support. Or in AMD's case, are supposed to support.

      --
      GCP

    2. Re:For once Microsoft manged to fix it first by Anonymous Coward · · Score: 1, Informative
      The question is if AMD documented this bug in their errata, or just fixed for Windows 2000 and figured that was good enough.

      AFAICT from AMDs Technical Resources, the patch is all there is. So AMD is infact concealing the bug, trying to make it look like a tiny "registry problem".

      Sorry, but that was a bad move, guys. Much worse than the Pentium bug thingy, which was rather theoretical, anyway.

  2. And we were blaming the NVIDIA drivers... by npietraniec · · Score: 2, Informative

    It really shows up if you use the pre-empt kernel patch. Ever since I added the workaround, things have been pretty solid. (not that it's been that long)

  3. Don't think so by Metrollica · · Score: 2, Informative

    I don't think so. AMD reverse engineered the x86 and made their own implementation without Intel's crap in it.

    AMD's version of the x86 that is in the Athlon and the Duron runs faster than Intel's chips because of this reverse engineering.

    This bug could be a problem of reverse engineering the x86. It doesn't say Intel's chips have the problem.

    --



    --Metrollica
  4. Re:Nice write-up. by bob1000 · · Score: 2, Informative

    Add 'mem=nopentium' to your lilo/grub/whatever bootup or compile the kernel for i386 to avoid extended cpu operations. The fault is something in the page size extension and agp.. which is strange because I though agp would be more of a chipset issue than processor.

  5. Another mirror/summary here by Afrosheen · · Score: 3, Informative

    Karma whoring, here I come. Hopefully this server can withstand a mild slashdotting. Link

  6. The quick answer: by Doctor+K · · Score: 5, Informative

    The site seems to be down. However, last week, I contacted nVidia about this problem on my two dual Ahtlon MP workstations (random hangs when OpenGL is invoked). So the quick answer is you can

    Boot your system with following option on your kernel command line: "mem=nopentium"

    or

    Disable AGP in XFree86 config (i.e. Option "NvAGP" "0" in the "Devices" section).

    nVidia clued me into the first approach about a week and a half ago. It made my system completely stable. However, there was still some texture flakiness in some OpenGL applications. Since my workstations are number crunchers (and thus Quake FPS don't matter to me), the latter option eliminated both the stability problems and the texture flakiness (at the expense of some graphics speed).

    By the way, nVidia mentioned the same issue exists on Win2K / Athlon boxes.

    Enjoy,
    Kevin

  7. Re:More info? by Sadfsdaf · · Score: 2, Informative

    Disable Fast AGP write (AGP Turbo?) in your BIOS.

    Read the manual. http://205.158.109.140/XFree86_40/1.0-2313/README. txt

  8. why we blame NVIDIA by Anonymous Coward · · Score: 0, Informative

    It is impossible to debug closed-source
    drivers like the NVIDIA one. So any NVIDIA
    bugs can't be found.

    But you say "this is an AMD bug"...

    How could we know that? The presence of
    closed-source drivers in the kernel made
    us unable to determine what was at fault.
    Video drivers can cause non-video problems,
    so in all cases only NVIDIA can help you.

  9. Re:Should AMD do the right thing? by Linux+Freak · · Score: 3, Informative
    Heh, microcode bugs go back, WAYYYY back as far as microprocessors do themselves.



    Shit happens. Work around it. ;-)
  10. Re:Incredible as it may... by Anonymous Coward · · Score: 1, Informative

    If their flow is the same as most other semis, they do functional verification both before (in simulation) and after tapeout. A chip is almost always rev'ed a few times before it get 'prodution' status and ships to customers in large quantities. Looks like they had a hole in their functional test plan and missed this one.

  11. Re:Is this present in Athlon optimized kernels? by Sits · · Score: 2, Informative

    Almost definitely not. It sounds like the existence of this bug was not known until recently and K7 options almost definitely enable all memory enhancements.

  12. Re:NO AMD BASHING by spauldo · · Score: 5, Informative
    Why are you worried about running 32-bit code on a 64-bit processor?

    Just as an aside, if you ever deal with ultrasparcs, you'll quickly find that the majority of the code used is 32 bit.

    The reason for it is simple; most applications will run slower at 64 bit than at 32 bit. The ultrasparc chips were designed to take this into account. Hell, due to a firmware bug, solaris on my ultra 1 installs as a 32 bit kernel by defualt - and runs no slower because of it (although it can't run 64 bit apps that way). After a firmware patch, it is easy to change to running the 64 bit kernel though.

    In all reality, why would most apps need 64 bit integers and whatnot? Most don't, and doing so is a waste of memory. If the processor is designed right, it can handle 32 bit code with no problems whatsoever.

    --
    Those who can't do, teach. Those who can't teach either, do tech support.
  13. The equivalent Win2k bug fix by LadyLucky · · Score: 3, Informative
    can be found here

    Funny, I knew something was wrong...

    --
    dominionrd.blogspot.com - Restaurants on
  14. Using Test Suites to Validate the Linux Kernel by goingware · · Score: 5, Informative
    Let me take this opportunity to plug Using Test Suites to Validate the Linux Kernel.

    Thank you for your attention.

    --
    -- Could you use my software consulting serv
  15. Quake 3 benchmarks by Sits · · Score: 5, Informative

    Quake 3 demo was run with \timedemo 1 and \demo DEMO001 . Each test was run three times. The system load average was < 0.5 before Quake 3 was run.

    Without mem=nopentium
    FPS = 79.4 (79.4, 79.4, 79.4)

    With mem=nopentium
    FPS = 79.2 (79.1, 79.3, 79.2)

    System tested:
    Athlon 850, 384MB RAM, Geforce 1 DDR, VIA KT133 Chipset
    Athlon/Duron/K7 optimised 2.4.17 kernel (optimising the kernel above pentium makes very little difference though)
    NVidia 1.0-2313 video drivers using agpgart
    Mandrake 8.0

    Quake 3 settings
    Texture depth = 16 bits
    Colour depth = 16 bits
    Geometric detail = High
    Texture detail = High
    Dynamic lights = On
    Video mode = 1024x768

    Looks like there is a difference but it's very slight (0.003%) but my benchmarks aren't very scientific. Either way, if there is an improvement in stability this tradeoff is easily worth it. Here's hoping that you don't run linux just for it's Quake 3 scores though...

  16. Re:Should AMD do the right thing? by Eric+Smith · · Score: 4, Informative
    That third article about the supposed "HCF" instruction on the 4004 is completely and utter BS. None of the instructions on the 4004 will cause it to burn up, even on the earliest production parts.

    Several processors had self-test instructions known as "HCF". The 6800 family and the 6502 had such instructions. They caused the processor to start fetching consecutive locations, thus continuously incrementing the address bus. Didn't damage the processor, even if you left it running that way. The "Catch Fire" was a figurative description of what was happening on the address bus, nothing more.

    On the original NMOS 6502, about 13 of the undefined opcodes had this effect. This was the most common cause of computer lockups if the code went into the weeds.

    On some of the later 6800 family members, the test instructions were actually documented, but Motorola's published description did not include any mnemonmic for them.

  17. Re:Performance hit? by Sits · · Score: 3, Informative

    You may want to take a look at the benchmarks posted later.

  18. "It does not affect FreeBSD" by Anonymous Coward · · Score: 1, Informative
  19. Other Hackers did it better . . . by Jeff+Kelly · · Score: 5, Informative
    Here is a Posting from Terry Lambert on the FreeBSD -stable Mailing List regarding this "Bug".
    Maybe it sheds some light on this issue.


    > Recently I found Linux 2.4 kernel is affected by the
    > bug of extended paging in AMD Athlon through the
    > following link. I don't know if FreeBSD is also
    > affected.
    >
    > http://linuxtoday.com/news_story.php3?ltsn=2002-01 -21-001-20-NW-KN

    I am well aware of this bug.

    It does not affect FreeBSD, which only uses 4M pages for
    the first 4M of the kernel itself.

    I've worked on code that enables 4M pages on other memory
    used in FreeBSD, that had this problem, but only if you
    were really stupid in your allocation mechanism.

    There's a workaround for this problem which is fairly
    trivial to implement in software, and should probably be
    done when 4M pages are enabled, if you are using an Athlon,
    and are adding 4M pages.
    [...]
    In any case, this will not be a problem for FreeBSD, and is
    only a problem for Linux because of the strange way they
    initialize things.
    1. Re:Other Hackers did it better . . . by jelle · · Score: 2, Informative

      When an OS doesn't use a CPU feature (4M pages, using it just for the kernel doesn't count), that doesn't make the hacker better, it makes the OS not taking advantage of all CPU features (and therefore not running into the related CPU bugs...).

      So this guy tried to do 4M pages, it didn't work well (he encountered the bug), and decided not to implement 4M pages at all. And for Linux, the guys just happened to implement 4M pages long before AMD created the processors with the bug.

      Different history, all good hackers.

      --
      --- Hindsight is 20/20, but walking backwards is not the answer.
    2. Re:Other Hackers did it better . . . by Jeff+Kelly · · Score: 2, Informative

      When an OS doesn't use a CPU feature (4M pages, using it just for the kernel doesn't count), that doesn't make the hacker better, it makes the OS not taking advantage of all CPU features (and therefore not running into the related CPU bugs...).


      Read again. The Posting states that "I've worked on code that enables 4M pages on other memory
      used in FreeBSD, that had this problem, but only if you
      were really stupid in your allocation mechanism."

      He encountered the Problem in his _own_ code and fixed it there. He also states: "There's a workaround for this problem which is fairly
      trivial to implement in software, and should probably be
      done when 4M pages are enabled, if you are using an Athlon,
      and are adding 4M pages." He very clearly states that 4M pages are not currently supported in FreeBSD (should be in 4.5) but that a workaround exists. (And it is _not_ deactivating the 4M paging as in linux).

      So although they are not affected by the Bug because they do not use that particular feature at least they know that it exists and they do have a workaround ready _now_ so that by the time this feature is implemented this bug will not cause any troubles. Which is more than I can say about the Linux hackers, which don't even bother to read the docs provided by AMD.

  20. AMD Rev A5/CPUID 662 by lanalyst · · Score: 2, Informative

    Recently purchased 2 XP 1600+s (1 in Dec and 1 in Jan) - both indicate they are Rev A5 (CPUID 662) and do not have the INVLPG bug according to AMD's errata sheet.

  21. 64 bit Performance by digitalEric · · Score: 2, Informative

    Yes, UltraSPARC's run significantly slower in 64 bit mode. IIRC, this is because it takes more instructions to load 64 bit constants and access 64 bit pointers. This is not true of all 64 bit processors -- and it is not true of x86-64.

    The x86-64 architecture allows 64 bit programs to take advantage of the extra precision (and doubles the number of general-purpose registers, which x86 desperately needs), without forcing them to take the performance hit of using the full 64 bit addressing. It also adds a new, IP-relative addressing, which makes position-independant code (ie, shared libraries) much more efficient. There will be an increase in code size (and possibly a performance drop, but this depends on how AMD implements the 'movabs' instruction) when you start using more than 4GB of data. And, when you start using >4GB of code, things get yucky (requiring indirect jumps).

    But, the point is, x86-64 will run all your 32 bit x86 code at full speed, and if you're able to re-compile your programs for 64 bit mode, you should get a performance boost, if only from getting 9 more registers (8 + no longer need to keep a pointer to the GOT).

  22. Re:Can registered and ECC RAM help? by Tazzy531 · · Score: 2, Informative

    It's not a matter of the type or quality of the memory but how the chip address the memory. There is a flaw in the chip itself. A layman's analogy might be: if a telephone book only list the first 5 numbers of a phone number. What you are suggesting is to replace all the telephones in the world. Even if you do, the phone book still won't work because the phone numbers are incorrect. What has to be fixed is the phone book [or the way of finding phone numbers]. Go here for more technical information.

    --


    _______________________________
    "I'm not Conceited...I'm just a realist..."
  23. Re:Is this the same as the Win2k bug? by DeeKayWon · · Score: 5, Informative
    The only revision without the bug is the A5 stepping (CPUID 662) Athlon XP/MP/Mobile Athlon 4. See the Athlon model 4 revision guide and the Athlon model 6 revision guide, erratum 16.

    Basically, if you run "cat /proc/cpuinfo" and see these:

    cpu family: 6
    model : 6
    stepping : 2

    Then you should be safe.

  24. Optimised kernels still buggy by Sits · · Score: 2, Informative

    I've posted this elsewhere but to clarify - it looks like this will still happen regardless of which processor you have selected (even i386!). This is because the test for whether your processor does pse seems to be run on startup (I think it's done by arch/i386/mm/init.c __init pagetable_init).

    As an aside, as far as I can tell the only (extra) things that optimising a kernel for a K7 seems to set are gcc options (someone please correct me if I'm wrong).

  25. Re:I just want to know by rew · · Score: 2, Informative

    ...and that never seemed to be an issue.

    The AMD erratum says that it is an issue if bit 21 of an address is actually 1. Thus you may have been lucky in where your video card got mapped.

    Roger.

  26. Re:Is this the same as the Win2k bug? by evilpaul13 · · Score: 2, Informative

    Cool, that includes my Athlon XP which I picked up this week!

  27. Re:what's the point of 4MB pages? by Anonymous Coward · · Score: 2, Informative

    It saves page table entries, which saves an irrelevantly small amount of memory.

    Much more importantly, it saves TLB entries, which makes more room for user memory, speeding up virtual->physical translations.

  28. This AMD bug exists on the AMD K6-3 by narfbot · · Score: 2, Informative

    I have an AGP Nividia Geforce 2 MX, and an AMD K6-3 333 MHz. I have experienced these memory corruption, graphical anomolies, and lockups in linux and windows 95.

    I noticed that AMD K6-3 was not mentioned, but it has to exist on it. The K6-3 was made with the same instruction set as a pre-Athlon. Thus the bug definately exists.

    Not sure about K6/K6-2, but it is possible.