Slashdot Mirror


Major Linux/Athlon CPU bug discovered

GeorgeFrancisco writes "I recently installed the nVidia drivers so I could play TuxRacer on my Athlon. Problem is it kept inexplicably hanging Linux. Now I know why. The CPU bug affects Athlon/Duron/Athlon MP AGP users. Fortunately there's a way around it, and: "Alan [Cox] is going to try to add some kind of Athlon/AGP CPU bug detection code to the kernel so that it will be able to auto-downgrade to 4K pages when necessary." Read more on the Gentoo Linux site."

13 of 402 comments (clear)

  1. how lame by Jeff+Probst · · Score: 0, Insightful

    just so he could play tux racer. why not just play the windows version?

  2. Re:NO AMD BASHING by NanoGator · · Score: 4, Insightful

    AMD didn't turn interesting until the Athlon came out. The previous versions of its processors were decidedly inferior. This is *worse* than recalling for a bad, rarely used function call. I can't take a processor back 6 months after I bought it because it sucks, but I can get it replaced if it has a bona-fide bug.

    If this is a bug in the processor, AMD really should fix it and offer replacement processors to those who need it. If they don't, and they expect you to patch your OS instead, then that definitely shakes my faith in that company. When you're an artist dependent on OpenGL, you can't have problems like this.

    And finally...

    Why are you worried about running 32-bit code on a 64-bit processor? 64-bit processors are supposed to run 64-bit code. Intel's not marketing 64-bit processors to replace desktop computers (today), they're for servers and high-end graphics with custom code. They don't NEED to run 32-bit code. I hardly think that's a point against Intel, especially considering they don't make it a big secret that 32-bit code runs slower on it.

    --
    "Derp de derp."
  3. Should AMD do the right thing? by NanoGator · · Score: 2, Insightful

    I should start by saying I haven't read the article yet, can't get to it. *hopes the /. traffic dies down soon...*

    If it is a defect in the processor, I wonder if AMD will replace my existing processor. It may not seem like all that big of deal to most people here at Slashdot, but as a 3D artist I am *dependent* on OpenGL.

    Don't get me wrong, I'm not having this problem now. (I'm not a Linux user.) But when I built my Athlon I had to install a patch for a similar type of problem in order to get the machine to work. At what point do we say "it's no longer ok to work around a CPU bug"?

    If Intel has one set of bugs in their processors, and AMD has another, that divides the market. Software companies shouldn't have to put the effort into scrutinizing their code based on which CPU they are on, it's bad enough they are trying to optimize for one or the other. What happens when they get used to the workaround, but then it gets fixed? Worse yet, what happens when a company says "I'm sick of this, I'm only supporting one processor."

    So it's not so much that I think AMD should replace the processors with this specific bug, but I think we should be vigilent in not allowing them to let errors like that run rampant.

    --
    "Derp de derp."
  4. Re:Buggy Features by Anonymous Coward · · Score: 0, Insightful

    Linux Nut: "It's not a bug.. you're just too stupid... god you disgust me... fix it yourself moron.. wait, you're too stupid."

  5. Re:NO AMD BASHING by billcopc · · Score: 2, Insightful

    I'll second that : Iwill boards are consistently better than average in terms of both performance and stability. Abit sucks ass though, they try to push things too far and forget that a super-overclocked machine that hangs every hour isn't worth shit.

    ECS are off to a very impressive start with the K7S5A board. Using the SIS 735 chipset, it is unsurpassed in reliability and offers very decent performance as well. Overclocking isn't its strong point, but at a mere 65$ price tag you can invest the money saved on a faster CPU.

    (no I'm not sponsored by ECS, I just hate my Abit KT7-Raid and am jealous of all my friends who have the ECS board)

    --
    -Billco, Fnarg.com
  6. Re:Same as Intel's F00F problem by RKloti · · Score: 1, Insightful

    Probably for marketing reasons.
    (another buzzword)

  7. You want vectors not huge integers by yerricde · · Score: 2, Insightful

    For example, you can pass around 8-byte structures in a single register, which is damn useful given the lack of available registers in the x86 architecture.

    And when you want to use or change one byte in the structure, what do you do? Shift it out and put it in another register. You can beat the "lack of registers" argument by switching to any current architecture but x86; you'll get at least 16, most likely 32, or even 64 registers.

    And with the 64-bit representation, you can do this with just one subtraction and one branch rather than a combination of two subtracts and two branches.

    One problem with your algorithm: one subtraction will "carry" over into the next because the processor assumes you're subtracting whole integers. What you want isn't really 64-bit integers but rather vector SIMD as found in MMX, SSE, and 3DNow!. In fact, AltiVec on the G4 processor is 128-bit.

    --
    Will I retire or break 10K?
  8. This is flaw in how Linux is (not) managed by Anonymous Coward · · Score: 2, Insightful

    First off, yes, this is a rather major bug.

    But is it enough to warrant not buying the processor or flaming AMD???? Hardly!

    EVERY piece of hardware out there has some bug in it! Have any of you ever sat down and read the list of errata on Intel parts and the list of how many flaws are fixed in each stepping? The list of bugs fixed over the life of the P3/Celeron core is a rather lengthy document to say the least.

    And I can't really fault AMD at all on this one other than that they HAD a bug...for Win2K etc, they released a fix/patch in very short order and notified everyone rather quickly.

    And don't forget this was back in 2000! What version was the norm for deployed kernels back then (over a year ago!)???

    From what I gather, the 4Mb AGP paging didn't show up until kernel 2.4 builds -- which I do not think were final at that time. Regardless, I feel the Linux kernel community should have been a bit more proactive in noting a DOCUMENTED bug and correcting for it.

    Regardless, this bug in no way affects whether or not I would buy an Athlon/Duron. It is basically trivial to workaround and results in almost no performance loss. In essence, my Athlon XP 1900+ with the fix will still beat the crap of most P4 2Ghz machines in 90% of all applications (for half the price).

    This is basically a failing of the entire Linux concept more than a failing of AMD.
    Is there any central authority who regularly checks AMD, Intel, Via, Transmeta, etc. erratum sheets for bugs that might potentially affect the kernel? Based on this, I strongly doubt it.

    Don't get me wrong, Linux is a great OS, but the lack of centralized control and build management is starting to cause problems. There are so many changes to different modules that version dependencies crop up all the time and no one is managing them.

    I am not a big fan of Microshaft myself, but I would put money that they have at least one or two people whose job is to do nothing but monitor the processor manufacturers erratums to make sure no major problems submarine the sales of Windows XP! Bill Gates may be many things, but stupid is not one of them. H*ll, if I was Microshaft, I could have a marketing field day on this one -- it could be a very persuasive argument to lots of upper management types as to why Windows is better than Linux.

    Is this bug a problem? Yes.

    Was the original problem AMD's? Yes.

    Did they address it and notify people? Yes.

    Did anyone in the Linux community actually notice? NO

    Regardless, any bug that can be worked around this easily is not THAT big a deal people...but it does point out some serious flaws in how Linux kernel development is managed. If Linux is to survive, some order had better start arising out of the spiraling chaos!

    So to sum up the appropriate response to this bug: LEARN FROM YOUR MISTAKES AND GET OVER IT!!!!!!!!

    "The sky is falling! The sky is falling!"

    sheesh...

  9. Annoyed at something else. by Lemmy+Caution · · Score: 4, Insightful
    The article notes that AMD has been proclaiming the bug in public for a while.

    What irks me is this: I got hit with this bug. I posted bug reports to Debian, with NVidia, on different forum, report lock-ups in certain open-GL situations. I got generally hand-waving "read the fucking manual" responses.

    As the article notes, this isn't just a problem with AMD. It suggests that there's an ongoing problem with troubleshooting and resolving the sorts of issues that desktop users are going to have in Linux. (And "paying for support" would not have resolved much, would it have? The problem is the lack of coordination, not the lack of money.)

  10. Re:NO AMD BASHING by jelle · · Score: 2, Insightful

    "when you multiply, e.g. 100,000 * 100,000"

    When you multiply 2 32-bit numbers and really need the full precision of the 64-bit result, yes, then you need some 64-bit registers. However, that does not mean you need to have a multiply instruction that accepts 64-bit inputs. Also, often you don't need more than 32 bits of the result. In that case a barrel shifter in the chip right after the multiplier would already give you what you want without needing the large and slow 64x64 multiplier in the chip.

    On DSPs, you can often choose between 'integer mode' and 'fixed point mode'. In the former case they mean integer input values just like the CPU has, and in the latter case they mean values in the range [-1,1>, which places the decimal point 31 bits more towards the LSB. In 'fixed point mode', it's intuitively easier to stick with 32 bit precision if more precision is not needed.

    Additionally, DSPs have 'MAC' instructions: "accum out = accumin + (in1*in2)". Often, the number of bits in the 'accum' registers is larger than the number of bits in the 'in1' and 'in2' multiply inputs. A 16-bit DSP often (always?) has at least 32 bit wide 'accum' registers, often more than that, with up to 4 or 8 overflow bits in some cases. You need the overflow bits when you use the MAC instruction repeatedly (which is done often in typical DSP algorithms). With 4 overflow bits, you can use the MAC instruction 14=16 times and be guaranteed you'll never overflow 'accum'.

    Personally, I'd more prefer the CPUs to get more DSP features than a simple increase of 'bits'.

    --
    --- Hindsight is 20/20, but walking backwards is not the answer.
  11. what's the point of 4MB pages? by RelliK · · Score: 3, Insightful

    First, as has already been pointed out, there is no performance hit.

    But I still did not get the answer to my question. What is the purpose of having 4MB pages in the first place? It is inconceivable that an OS would use 4MB pages exclusively. The internal fragmentation would be enourmous.

    To give you an analogy, think of what would happen if your file system used 4MB blocks. When you create a file, space would be allocated 4MB at a time so a 1 byte file would waste (4MB - 1byte) of disk space; (4MB + 1byte) file would take up two blocks, also wasting (4MB - 1byte) of disk space. On average, each file wastes 1/2 of the last block. Similarly, each process wastes on average 1/2 of the last page. That's not a problem if the pages are 4KB in size, but with 4MB pages there's lots of space wasted. That's like throwing away paging altogether.

    So, I ask again, what is the point of having 4MB pages?

    --
    ___
    If you think big enough, you'll never have to do it.
  12. Re:NO AMD BASHING by mz001b · · Score: 3, Insightful

    As someone pointed out in elsewhere, this would make the processors too expensive, if the vendor had to ship replacement processors each time a bug was found. Lots of bugs exist in processors, and typically they are fixed with each new stepping. Look at /proc/cpuinfo and see how many bugs it checks for (fdiv_bug, hlt_bug, f00f_bug, coma_bug on my system). This bug will probably be just another line. There is a simple workaround for it too, so it is not that bad. The real problem (as may people state) is that AMD did not inform the kernel developers about this problem long ago, so a fix could already be implemented.

  13. Re:Incredible as it may... by Hoser+McMoose · · Score: 2, Insightful

    Of course it's due to money! Coming up with a fix to a bug like this doesn't just happen overnight, and since the errata in processors barely ever effect anyone and can usually be easily worked around in software (and the software fix for this bug is trivial), most companies have better things to do with their time. As it turns out, AMD did eventually get around to fixing this issue with Stepping A5 of the AthlonXP/MP core.

    In the same vein, out of the 83 bugs that Intel currently has listed for their Pentium III processor, quite a bit more then 50% of them are listed as "NoFix", ie Intel has no plans on ever fixing these bugs.

    The real question I have to ask is why no one caught this earlier? This bug is well documented in AMD's errata list, complete with a workaround. AMD's Athlon chips only have something like 10-15 known bugs listed, which is quite a few less then the 59 known bugs for Intel's P4 or the 83 known bugs for Intel's PIII processors, so going through the list of AMD bugs should be a fairly easy thing to do (aside: one could argue either that AMD chips have fewer bugs then Intel or simply that Intel documents their chips better.. I don't want to take either side on that flame war though).

    If anyone is really interested in this sort of thing, both AMD and Intel have their list of known bugs up on their website under "specification updates" for each of their processors.