Slashdot Mirror


Major Linux/Athlon CPU bug discovered

GeorgeFrancisco writes "I recently installed the nVidia drivers so I could play TuxRacer on my Athlon. Problem is it kept inexplicably hanging Linux. Now I know why. The CPU bug affects Athlon/Duron/Athlon MP AGP users. Fortunately there's a way around it, and: "Alan [Cox] is going to try to add some kind of Athlon/AGP CPU bug detection code to the kernel so that it will be able to auto-downgrade to 4K pages when necessary." Read more on the Gentoo Linux site."

24 of 402 comments (clear)

  1. Is this the same as the Win2k bug? by sprayNwipe · · Score: 4, Interesting

    There was a Win2k bug a while back that did the exact same thing, and you had to install a "LargePageMinimum" patch for it to not crash. Is this the Linux equivilant of that? And if so, how come it has taken so long to surface and fix?

    1. Re:Is this the same as the Win2k bug? by kilrogg · · Score: 5, Funny
      RTFA, AMD released a patch for w2k but never mentioned anything to the kernel developers.

      Instead of saying "oops, there a hardware bug", they said, "oops, here' a patch for w2k". Looks like none of the kernel developers knew they had to look a w2k bug fixes to find out about hardware bugs.

    2. Re:Is this the same as the Win2k bug? by Anonymous Coward · · Score: 4, Redundant
      It's slashdotted. Here's the article:

      The bad news is that a major Athlon CPU bug has been discovered, and it affects Linux 2.4. Note that this is a bug in the actual CPU itself, and is not a Linux bug. However, it becomes our problem because there are very many semi-broken Athlon/Duron/Athlon MP CPUs out there.

      Here are the details. As you may know, x86 systems have traditionally managed memory using 4K pages. However, with the introduction of the Pentium processor, Intel added a new feature called extended paging, which allows 4Mb pages to be used instead. Here's the problem -- many Athlon and Duron CPUs experience memory corruption when extended paging is used in conjunction with AGP. And, this problem hits us because Linux 2.4 kernels compiled with a Pentium-Classic or higher Processor family kernel configuration setting will automatically take advantage of extended paging (for kernel hackers out there, this is the X86_FEATURE_PSE constant defined in include/asm-i386/cpufeature.h.) Fortunately, there is a quick and easy fix for this problem. If you have been experiencing lockups on your Athlon, Duron or Athlon MP system when using AGP video, try passing the mem=nopentium option to your kernel (using GRUB or LILO) at boot-time. This tells Linux to go back to using 4K pages, avoiding this CPU bug. In addition, it should also be possible to avoid this problem by not using AGP on affected systems. As soon as I discovered that this CPU bug existed (which happened, unfortunately, because my CPU has the bug), I informed kernel hacker Andrew Morton of the issue; he put me in touch with Alan Cox. Alan is going to try to add some kind of Athlon/AGP CPU bug detection code to the kernel so that it will be able to auto-downgrade to 4K pages when necessary.

      The unfortunate thing about this situation is that AMD and others have known of this bug since September 2000. In fact, AMD's CPG technical marketing division announced this bug on September 21, 2000 in a technical note entitled Microsoft Windows 2000 Patch for AGP Applications on AMD Athlon and AMD Duron Processors (Technical Note TN17 revision 1). And, the kind folks at AMD even created a simple patch for Windows 2000 that disables extended paging by tweaking the registry. However, apparently AMD didn't realize that Linux 2.4 also uses extended paging when the kernel is compiled with a Pentium-Classic or higher Processor family kernel configuration setting. And, it looks like no one in the Linux community noticed that this "Microsoft Windows 2000/AGP Athlon/Duron bug" also applied to Linux 2.4 systems, probably because it was presented by AMD technical marketing as just that -- a Windows 2000-related AGP bug. An unfortunate miscommunication, which has resulted in lots of problems for Athlon, Duron and Athlon MP users. Here's something that's even more unsettling -- consider what kind of Linux users actually use AGP. That's right -- desktop users. And in what area has Linux been struggling? Yes, the desktop. One wonders how many negative desktop Linux experiences have resulted from this unfortunate problem. I don't know if any particular party is to blame for this issue. After all, AMD did prominently announce this bug when it was discovered. But due to an apparently unfortunate series of events, us Linux people never benefitted from this knowledge. But Microsoft Windows 2000 and XP users did. Let's hope that all parties involved can keep things like this from happening in the future.

    3. Re:Is this the same as the Win2k bug? by DeeKayWon · · Score: 5, Informative
      The only revision without the bug is the A5 stepping (CPUID 662) Athlon XP/MP/Mobile Athlon 4. See the Athlon model 4 revision guide and the Athlon model 6 revision guide, erratum 16.

      Basically, if you run "cat /proc/cpuinfo" and see these:

      cpu family: 6
      model : 6
      stepping : 2

      Then you should be safe.

  2. Re:For once Microsoft manged to fix it first by kilrogg · · Score: 4, Redundant

    Rather, AMD fixed it for microsoft, they made the w2k patch but didn't release a linux patch.

  3. The quick answer: by Doctor+K · · Score: 5, Informative

    The site seems to be down. However, last week, I contacted nVidia about this problem on my two dual Ahtlon MP workstations (random hangs when OpenGL is invoked). So the quick answer is you can

    Boot your system with following option on your kernel command line: "mem=nopentium"

    or

    Disable AGP in XFree86 config (i.e. Option "NvAGP" "0" in the "Devices" section).

    nVidia clued me into the first approach about a week and a half ago. It made my system completely stable. However, there was still some texture flakiness in some OpenGL applications. Since my workstations are number crunchers (and thus Quake FPS don't matter to me), the latter option eliminated both the stability problems and the texture flakiness (at the expense of some graphics speed).

    By the way, nVidia mentioned the same issue exists on Win2K / Athlon boxes.

    Enjoy,
    Kevin

  4. Performance hit? by mojo-raisin · · Score: 4, Interesting

    So does anyone know how performance is affected from this 4MB->4KB page thing?

    1. Re:Performance hit? by larien · · Score: 4, Interesting
      That's a rather naive assumption; it assumes that a 4KB page takes the same amount of time to move as a 4MB page. Admittedly, there will be 1024 times as much loop activity in order to move 4MB, but that probably isn't the real bottleneck, which would be memory/disk bandwidth. Also, you may gain some efficiency if you only want to move say 512KB.

      In short, you're better off with 4MB pages if it's stable, but I don't know by how much. I guess some benchmarks would be easy enough to do; e.g. run Q3A with and without the mem= options.

    2. Re:Performance hit? by andrewgaul · · Score: 5, Interesting

      The performance hit for using the smaller pages is mostly unrelated to paging. When a CPU loads an virtual address (all addressing in "protected mode" is virtual), there is a translation to a physical address before data can be accessed. This table is stored in memory and the CPU breaks into kernel mode to do the translation. To avoid this cost, there is a cache of translations (managed by the kernel) in the Translation Look-aside Buffer (TLB). Most of the entries in this cache are for 4kb pages, but there are a few 4mb pages which are generally used for kernel memory (I am unsure if any OSes use the big pages for user programs).

      That said, there should be a modest performance hit. Bigger pages can store more data, which results in fewer TLB misses. Hopefully someone will post benchmarks.

  5. Re:NO AMD BASHING by NanoGator · · Score: 4, Insightful

    AMD didn't turn interesting until the Athlon came out. The previous versions of its processors were decidedly inferior. This is *worse* than recalling for a bad, rarely used function call. I can't take a processor back 6 months after I bought it because it sucks, but I can get it replaced if it has a bona-fide bug.

    If this is a bug in the processor, AMD really should fix it and offer replacement processors to those who need it. If they don't, and they expect you to patch your OS instead, then that definitely shakes my faith in that company. When you're an artist dependent on OpenGL, you can't have problems like this.

    And finally...

    Why are you worried about running 32-bit code on a 64-bit processor? 64-bit processors are supposed to run 64-bit code. Intel's not marketing 64-bit processors to replace desktop computers (today), they're for servers and high-end graphics with custom code. They don't NEED to run 32-bit code. I hardly think that's a point against Intel, especially considering they don't make it a big secret that 32-bit code runs slower on it.

    --
    "Derp de derp."
  6. Nvidia + AGP + Irongate + Athlon by hack0rama · · Score: 4, Interesting

    Nvdia drivers forces AGP to 1x due to corruptions caused by AMD Irongate chipset signal integrity [ Mentioned at the README for Nvidia 1.0-2313 Drivers ]

    This newly discovered memory corruption with Athlon + AGP, is it contributing to the signal integrity of the Irongate ? Or is it a separate bug ?

    Anyway this makes AMD look very bad in my view. There is a bug in the CPU and their chipset screws up my AGP to 1x. Sigh.

  7. How-To: lilo workaround by Anonymous Coward · · Score: 4, Redundant

    If you're using lilo, and just want to apply the workaround quickly, edit /etc/lilo.conf.

    Before the first image= line, insert the line:

    append="mem=nopentium"

  8. Does this happen if kernel compiled for K7? by Nicolas+MONNET · · Score: 4, Interesting

    The article says it happens when the kernel is compiled for Pentium processors; but does this happen if the kernel is compiled for a K7?

    By the way, I had to shelve my nVidia card a couple months ago because of this ... I have an Athlon and it kept hard freezing. The bug doesn't happen with a Voodoo card.

  9. Re:NO AMD BASHING by spauldo · · Score: 5, Informative
    Why are you worried about running 32-bit code on a 64-bit processor?

    Just as an aside, if you ever deal with ultrasparcs, you'll quickly find that the majority of the code used is 32 bit.

    The reason for it is simple; most applications will run slower at 64 bit than at 32 bit. The ultrasparc chips were designed to take this into account. Hell, due to a firmware bug, solaris on my ultra 1 installs as a 32 bit kernel by defualt - and runs no slower because of it (although it can't run 64 bit apps that way). After a firmware patch, it is easy to change to running the 64 bit kernel though.

    In all reality, why would most apps need 64 bit integers and whatnot? Most don't, and doing so is a waste of memory. If the processor is designed right, it can handle 32 bit code with no problems whatsoever.

    --
    Those who can't do, teach. Those who can't teach either, do tech support.
  10. Buggy Features by Perdo · · Score: 5, Funny

    MShaft: "Not-a-bug-it's-a-feature"

    Intel: "Not a bug it's erratum."

    VIA: "We slowed it down to keep it cool."

    Nvidia: "That was a leak! We are not doing public driver beta testing!"

    ATI "Who the hell plays Quack3?"

    AMD "the patch is here"

    --

    If voting were effective, it would be illegal by now.

  11. Using Test Suites to Validate the Linux Kernel by goingware · · Score: 5, Informative
    Let me take this opportunity to plug Using Test Suites to Validate the Linux Kernel.

    Thank you for your attention.

    --
    -- Could you use my software consulting serv
  12. Quake 3 benchmarks by Sits · · Score: 5, Informative

    Quake 3 demo was run with \timedemo 1 and \demo DEMO001 . Each test was run three times. The system load average was < 0.5 before Quake 3 was run.

    Without mem=nopentium
    FPS = 79.4 (79.4, 79.4, 79.4)

    With mem=nopentium
    FPS = 79.2 (79.1, 79.3, 79.2)

    System tested:
    Athlon 850, 384MB RAM, Geforce 1 DDR, VIA KT133 Chipset
    Athlon/Duron/K7 optimised 2.4.17 kernel (optimising the kernel above pentium makes very little difference though)
    NVidia 1.0-2313 video drivers using agpgart
    Mandrake 8.0

    Quake 3 settings
    Texture depth = 16 bits
    Colour depth = 16 bits
    Geometric detail = High
    Texture detail = High
    Dynamic lights = On
    Video mode = 1024x768

    Looks like there is a difference but it's very slight (0.003%) but my benchmarks aren't very scientific. Either way, if there is an improvement in stability this tradeoff is easily worth it. Here's hoping that you don't run linux just for it's Quake 3 scores though...

  13. Re:Should AMD do the right thing? by Eric+Smith · · Score: 4, Informative
    That third article about the supposed "HCF" instruction on the 4004 is completely and utter BS. None of the instructions on the 4004 will cause it to burn up, even on the earliest production parts.

    Several processors had self-test instructions known as "HCF". The 6800 family and the 6502 had such instructions. They caused the processor to start fetching consecutive locations, thus continuously incrementing the address bus. Didn't damage the processor, even if you left it running that way. The "Catch Fire" was a figurative description of what was happening on the address bus, nothing more.

    On the original NMOS 6502, about 13 of the undefined opcodes had this effect. This was the most common cause of computer lockups if the code went into the weeds.

    On some of the later 6800 family members, the test instructions were actually documented, but Motorola's published description did not include any mnemonmic for them.

  14. Alternate, faster? workaround by jquirke · · Score: 5, Interesting

    The current workaround gets around this problem by disabling 4M (2M?) pages (PSE). Hence we go back to 4K pages, and mapping large slabs of VM is a little slower and wastes memory (we need another Page table for each slab of 4M) and obviously more TLB misses/space wasted, because to touch the whole 4M region, the CPU needs to do up to 1024 page table lookups instead of 1.

    As discussed this may have performance implications.

    According to the AMD docs, the problem is only when flushing TLB entries with INVLPG and the page is a 4M page, _and_ the virtual address's bit 21 is set (which does not affect the 4M block of memory the address is in - eg: 0x400000 (2^22) vs 0x600000 (2^22|2^21) are both in the second 4M block).

    Hence, when invlpg'ing a VA we just need to INVLPG(address&~(1 (leftshift) 21)). This only requires a single ANDL instruction. But we need to distinguish a 4M page first though, so I don't know?

    Heck maybe we should just do it the FreeBSD way and recursively map the Pagedir :-)

    Any ideas? Will this work?

    --JQuirke

  15. What bloody bug? by DABANSHEE · · Score: 5, Funny

    None of the Athlons or Durons I've built have had any problems with Tux Racer (Mostly on Man8.1 default install).

    My nephew spends hours Sliding that little penguin arround with that bloody elevator music going, & not once has there been a freeze or lockup, much to my dissapointment.

  16. Other Hackers did it better . . . by Jeff+Kelly · · Score: 5, Informative
    Here is a Posting from Terry Lambert on the FreeBSD -stable Mailing List regarding this "Bug".
    Maybe it sheds some light on this issue.


    > Recently I found Linux 2.4 kernel is affected by the
    > bug of extended paging in AMD Athlon through the
    > following link. I don't know if FreeBSD is also
    > affected.
    >
    > http://linuxtoday.com/news_story.php3?ltsn=2002-01 -21-001-20-NW-KN

    I am well aware of this bug.

    It does not affect FreeBSD, which only uses 4M pages for
    the first 4M of the kernel itself.

    I've worked on code that enables 4M pages on other memory
    used in FreeBSD, that had this problem, but only if you
    were really stupid in your allocation mechanism.

    There's a workaround for this problem which is fairly
    trivial to implement in software, and should probably be
    done when 4M pages are enabled, if you are using an Athlon,
    and are adding 4M pages.
    [...]
    In any case, this will not be a problem for FreeBSD, and is
    only a problem for Linux because of the strange way they
    initialize things.
  17. Re:NO AMD BASHING by mikera · · Score: 4, Interesting

    I've lost count of the number of times I wanted 64-bit integers, in pretty general purpose apps.

    Not because I do big databases or suchlike, but they let you do loads of optimisations that wouldn't otherwise be possible. For example, you can pass around 8-byte structures in a single register, which is damn useful given the lack of available registers in the x86 architecture.

    Example: I've recently been coding a large hexagonal grid component. Each point in the grid is indexed by 2 32-bit (x,y) integers. With a 64-bit register, you could put a full co-ordinate into a single register.

    Why is this useful? Well, one of my requirements was to be able to manage large sets of co-ordinates (think reachable spaces for an AI). You want to be able to combine sets of co-ordinates, which basically requires merging two lists. In order to merge lists efficiently, you need to sort them. And with the 64-bit representation, you can do this with just one subtraction and one branch rather than a combination of two subtracts
    and two branches. This is a definite speedup if you are hand-coding, and possibly an even bigger one if your compiler doesn't inline all the 32-bit code properly.

    Other example: 32-bits are large enough for most integer applications (you couldn't enumerate all the people on the plant though....) but they tend to fall down when you multiply, e.g. 100,000 * 100,000 has already blown the 32-bit limit, and neither of those are particularly big numbers. Whenever you start doing a reasonable amount of multiplication, 64-bit becomes useful.

    Also, 64-bits is big enough to encode the positions of pieces on a chess board. You can use bitwise logic to analyse and store positions. GNU chess certainly does it this way. I expect a *cosiderable* speedup in the top chess-playing algorithms when 64-bit becomes widespread.

    I'm really keen to se 256-bit arrive to be honest, 2^(2^3) has more elegance than 2^(2*3) and it would allow you to store a set of bytes in one register. Would allow some very cool text-processing tricks.

    Course, it might never happen - I predict a move towards massively parallel 64-bit computers rather than stonking 256-bit ones as the next major evolution in processor power.

  18. Re:Should AMD do the right thing? by flatrock · · Score: 5, Interesting

    First of all, this bug is not that significant performance wise. Very little software is going to use 4 MB pages. I don't think you even have an option of allocating memory with 4 MB pages in user space. This appears to be an issue with being able to optimise drivers, however, if AMD's processors can't do this, and Intel's can, why don't we see Intel's processors greatly outperforming AMD's in Win2k? This is a minor bug, and it's easily worked around without patching the kernel in both Win2k and Linux.

    The processors are basicly all their Athlon and Duron processors. For AMD or any chip maker to replace chips with bugs in them is VERY expensive. They already have a low profit margin. Replacing all "defective" Athlon and Duron processors would simply bankrupt AMD. Realisticly, all complex software or hardware has bugs. Bugs in hardware are much more difficult and expensive to fix. The truely significant hardware bugs are usually found early in testing. Other bugs are fixed in software, usually in the system BIOS, but sometimes in the OS code. This isn't something new. It's pretty much always been this way. Why has it been this way? Because no one wants to pay the outlandish prices that would result from trying to make hardware perfect. It costs a tremendous amount of money to reroll a processor. It's not as simple as making a quick code change and recompiling software. THERE WILL ALWAYS BE BUGS IN PROCESSORS! A truely significant bug like the Pentium floating point bug needs to be fixed in the hardware, and that one was even significant enough to deserve a recall of the processor. This bug is simple to work around, and isn't truely a significant problem.

    The question you asked in the subject is "Should AMD do the right thing?" The answer is yes, they should correct their Technology Bulletin to actually say what the processor bug is, rather than just say here's a workaround to a bug that effects Win2k.

    I'm really surprsed that someone at NVidia didn't pass this on to Linux kernel developers much sooner, since people at that company seem to have been aware of this for some time.

  19. Annoyed at something else. by Lemmy+Caution · · Score: 4, Insightful
    The article notes that AMD has been proclaiming the bug in public for a while.

    What irks me is this: I got hit with this bug. I posted bug reports to Debian, with NVidia, on different forum, report lock-ups in certain open-GL situations. I got generally hand-waving "read the fucking manual" responses.

    As the article notes, this isn't just a problem with AMD. It suggests that there's an ongoing problem with troubleshooting and resolving the sorts of issues that desktop users are going to have in Linux. (And "paying for support" would not have resolved much, would it have? The problem is the lack of coordination, not the lack of money.)