Slashdot Mirror


AMD Confirms CPU Bug Found By DragonFly BSD's Matt Dillon

An anonymous reader writes "Matt Dillon of DragonFly BSD just announced that AMD confirmed a CPU bug he found. Matt quotes part of the mail exchange and it looks like 'consecutive back-to-back pops and (near) return instructions can create a condition where the processor incorrectly updates the stack pointer.' The specific manifestations in DragonFly were random segmentation faults under heavy load."

292 comments

  1. This isn't nearly as bad as the division bug by Omnifarious · · Score: 4, Insightful

    Though, it's still very serious. At least it generally causes your program to crash rather than spitting out a wrong answer. And it sounds like the sequence of instructions that causes it is not commonly found.

    I can well understand the guy who found it being all excited. The CPU is the last place you'd look for a bug, and finding one is pretty impressive, especially a really elusive one like this.

    1. Re:This isn't nearly as bad as the division bug by XDirtypunkX · · Score: 4, Insightful

      Either are equally bad from the perspective of a software developer who spends a month trying to work out just exactly what is wrong with their code, especially if something like this occurs on a test machine but not on a development machine.

    2. Re:This isn't nearly as bad as the division bug by icebike · · Score: 3, Informative

      And it sounds like the sequence of instructions that causes it is not commonly found.

      Really?
      Pop two off the stack and ret to the calling routine seems fairly common to me. Lots of functions use two arguments and are called with near calls in various programming languages.

      --
      Sig Battery depleted. Reverting to safe mode.
    3. Re:This isn't nearly as bad as the division bug by Anonymous Coward · · Score: 0

      Well, I may have come across thus bug a few times already.

      Phenom X4 820

      make -j4 seems to trigger it inside GCC. GCC dies with "internal error" compiling some of my software. Running make again, no problem, intermittent.

      IMHO, a test case for CPU errors would be something run under DOS (FreeDOS, etc.) so you have 100% control over the CPU at all times.

    4. Re:This isn't nearly as bad as the division bug by GoodNewsJimDotCom · · Score: 4, Funny

      I found out about the division bug as a beginner programmer! I was trying to write the first MMORPG using Quick Basic. I remember division not being exactly accurate, so the solution I needed to use was to round up and down results that are really close. It fixed it, but new programmers shouldn't be forced to deal with stuff like that.

      I've preferred AMDs to Intels because AMD was one of the first sponsors to Esports back in 99. Too bad Columbine happened and I suspect they wanted to distance themselves from Quake tournaments. Another thing I like about AMD was that their processors don't melt if they get hot because they have a self preservation shutdown mode. People said Intel had this, but I melted a processor just a few months ago on SWTOR.

    5. Re:This isn't nearly as bad as the division bug by sjames · · Score: 4, Insightful

      Crash bugs are frustrating, but nowhere NEAR as scary as a bug that results in an incorrect but plausible computation. If the program crashes, you KNOW it crashed and you know the runs before that didn't crash are OK.

      Note that IRL the two cases can overlap. That is, a bug that might trigger a crash or might trigger an incorrect computation that might be plausible depending on luck of the draw.

    6. Re:This isn't nearly as bad as the division bug by Smauler · · Score: 5, Funny

      I was trying to write the first MMORPG using Quick Basic.

      Sounds like the division bug was the least of your problems....

    7. Re:This isn't nearly as bad as the division bug by Corbets · · Score: 4, Funny

      I found out about the division bug as a beginner programmer! I was trying to write the first MMORPG using Quick Basic.

      I've never heard "choosing the wrong programming language" described as a bug, but hey, however you want to play it off, man.

    8. Re:This isn't nearly as bad as the division bug by Anonymous Coward · · Score: 5, Insightful

      Floating point operations are never fully precise. Simple numbers such as 4.0 would be represented as 4.0000000000000213 or 3.99999999999973 if you arrive at this after doing a bunch of calculations.

      This is an inherent limitation of how floating point works, and not something that has been "fixed". Programmers still have to worry about this.

    9. Re:This isn't nearly as bad as the division bug by synthesizerpatel · · Score: 4, Insightful

      If your program is 'the kernel' then that qualifies as 'as bad as the division bug' && 'it's a big deal'.

    10. Re:This isn't nearly as bad as the division bug by phantomfive · · Score: 1

      I remember division not being exactly accurate, so the solution I needed to use was to round up and down results that are really close.

      Just so you know, division is never accurate in floats, even when the CPU doesn't have bugs. If you're using doubles you'll get better accuracy, but with a 32 bit floating point number, you shouldn't be surprised to find errors in the third digit after the decimal point.

      --
      "First they came for the slanderers and i said nothing."
    11. Re:This isn't nearly as bad as the division bug by tlhIngan · · Score: 0

      Crash bugs are frustrating, but nowhere NEAR as scary as a bug that results in an incorrect but plausible computation. If the program crashes, you KNOW it crashed and you know the runs before that didn't crash are OK.

      Well, there are several problems - in a production environment, you're looking at a possible DoS issue for this. If it happens in the kernel, it can BSoD or kernel panic - putting the whole system offline. Or even worse, it'll continue for a little while and corrupt data in weird and wonderful ways before the misaligned stack finally causes the CPU to walk off the plank.

      If it's an application, having some Line of Business app continually crashing causes its own share of problems. Especially since developers may not realize it's a CPU issue and spend weeks debugging lines of code "that should work". And probably not found if you single-step.

      Even worse, it happens during heavy load. I'm sure if Anonymous decides to DDoS you, having your server crash just adds icing to the cake. Or if it experiences heavy load during some of the bigger shopping days.

      On the bright side, it probably happens very rarely so most production servers probably WON'T see it.

    12. Re:This isn't nearly as bad as the division bug by Forever+Wondering · · Score: 4, Interesting

      Though, it's still very serious. At least it generally causes your program to crash rather than spitting out a wrong answer. And it sounds like the sequence of instructions that causes it is not commonly found.

      I can well understand the guy who found it being all excited. The CPU is the last place you'd look for a bug, and finding one is pretty impressive, especially a really elusive one like this.

      Actually, it could be occurring in other places/programs that aren't crashing but are [silently] producing bad results. The floating point bug, once isolated, could be probed for, and compensated for.

      From what I can tell from reading the assembly code, the function is unremarkable except for the fact that it's recursive. It isn't doing anything exotic with the stack (e.g. just pushes at prolog and pops at epilog). The epilog is starting at +160 and the only thing I notice is that there are several conditional jumps there and just above it is a recursion call with a fall through. But, from the AMD analysis, it appears that it's the specific order of the push/pops that is the culprit. In this instance, it's r14, r13, r12, rbp, rbx

      The workaround for this bug might be that the compiler has to put a nop at the start of all function epilogs (e.g. a nop before the pop sequence) on every function because you can't predict which function will be susceptible. Or, you have to guarantee that the push/pop sequence doesn't emit the sequence that causes the problem (e.g. move the rbp push to the first in sequence as I suspect that putting it in the middle is what is causing the problems)

      --
      Like a good neighbor, fsck is there ...
    13. Re:This isn't nearly as bad as the division bug by bzipitidoo · · Score: 5, Interesting

      Oh, I've found CPU bugs before. But I never found one others hadn't already found. The 16MHz 80386 had a bug with counters. If you did a REP MOVSW or similar instruction in a 16 bit mode, starting on an odd address, and you made the pointer registers roll over, the CPU would lock up. Couldn't handle the transition from 0xFFFF to 0x0001 in either direction. That was fixed in all the faster 386's. As I recall, there were about a dozen bugs in the 386. Of course later processors were all checked for those specific bugs, so they never happened again.

      Then there's unintended features such as pipeline oddities. If you have self modifying code, and it changes the destination of a jump instruction immediately before executing it, the computer will jump to the old address. Step through those same instructions in a debugger, and it will jump to the new address. Strictly speaking, jumping to the old address is incorrect, but it doesn't break any good code and fixing it would wreck pipelining. This behavior has been known for a long time, and every CPU from at least the 386 to the Pentium 4 behaves this way. It wasn't an important problem because so little code was self modifying. Wasn't any good as a copy protection method either, as only an amateur would be fooled by it. I think it's been resolved in at least 2 ways. First, by amending the documentation for the instruction set to expressly state that behavior is undefined in such a case, and second, by proving that there is never any need for self modifying code. And making the separation between code and data explicit. Now we have No eXecution bits.

      There are sometimes even Easter eggs. For some processors, a few unassigned opcodes performed a useful operation. It wasn't by design. Is that a bug? Another case was the use of out of bounds values. For instance, the ancient 6502 supports this packed decimal arithmetic mode, in which 0x99 meant 99. So what happened when some joker gave it an illegal value such as 0xFF? 0xFF was interpreted as 15*10+15 = 165, and one could perform some math on it and get correct results. Divide 0xFF by 2 (shift right), and it would compute the correct result of 0x82. That sort of thing makes life tough for emulators, and I have yet to find an Apple II emulator that reproduces that behavior faithfully.

      --
      Intellectual Property is a monopolistic, selfish, and defective concept. It is "tyranny over the mind of man"
    14. Re:This isn't nearly as bad as the division bug by sjames · · Score: 5, Insightful

      Imagine, there is a tiny bug that makes your floating point results just slightly wrong once in 1000 times. You run an iterative dynamic simulation of a bridge under load that runs for a million cycles. The results LOOK right...

    15. Re:This isn't nearly as bad as the division bug by Anonymous Coward · · Score: 1

      Hmmm... I don't want to sound cocky, but when a *BSD crashes... it's usually the hardware.

      I remember when my FreeBSD based Server crashed regularly, I figured out that the Xeon CPU was broken (cache defect, appeared only under very heavy load).

    16. Re:This isn't nearly as bad as the division bug by mysidia · · Score: 2

      Though, it's still very serious. At least it generally causes your program to crash rather than spitting out a wrong answer. And it sounds like the sequence of instructions that causes it is not commonly found.

      It may be uncommon to be found... but that doesn't equate to not exploitable

    17. Re:This isn't nearly as bad as the division bug by Darinbob · · Score: 3, Insightful

      CPUs have plenty of bugs. It's not necessarily the last place to look, especially for less popular processors. The only reason it's rarer with Intel and Intel-copying CPUs is because the market is so much bigger and therefore the resources for QA. Actually the bigger and more complex the processors are becoming the more likely it is to have bugs. Of course most are things people don't worry about or that can be worked around by following advice in the errata.

      In fact enough people assume CPUs have bugs only in the rarest of cases makes it hard to convince others that you have actually found a bug that's not in the errata. The same thing happens with compilers, you tell people that the bug must be in the compiler and they roll their eyes at you.

    18. Re:This isn't nearly as bad as the division bug by JWSmythe · · Score: 4, Informative

          Anyone who's programmed long enough has found unexplainable bugs that are eventually traced down to some bad hardware. :)

          I've preferred AMD over Intel for years. Long ago, in a distant computer store, far away.... We sold 386s, 486s, and Pentiums (or their reasonable clone) from Intel, IBM, AMD, and Cyrix. At the time, I didn't really care who made the chip, they were just built out for the customer.

          Over the years, I learned to prefer AMD for both the price and performance. Plenty of people will argue "but this Pentium is faster than that AMD". Well, it's all nice, but I don't *have* to stay bleeding edge. I never liquid cooled my CPU, video card, and memory. Friends did. I was always impressed with how much they wasted. I'd just wait 6 months or so, and get something better, faster, and cheaper. :) I do like having a high performance computer, so I upgrade every year or so.

          For example, I just set up a couple servers from COTS parts. They used AMD FX-8120's (8 core, 4.0Ghz turbo) for $199.99/ea. It seems the comparable Intel is the i7-980 (6 core, 3.6Ghz), which is selling at $589.99. For the difference in price, I could build out a 3rd server, and still have money left over. Toms hardware suggests the i5-2500K (4 core, 3.7Ghz turbo) for $224.99 or i7-2600K (4 core 3.8Ghz turbo) for $324.99 as comparable. If I wanted to spend a little more, I could have gone with the AMD FX-8150 (8 core, 4.2Ghz turbo) for $249.99. Was $50 for .2Ghz worth it? Not really. Something bigger, better, and faster will be out next year, and the year after, and then I'll buy something new.

            I used newegg.com for all the prices, so it would be fairly even.

          The servers actually use as many cores as I can throw at them, so it's extremely beneficial to have more cores at high speeds.

          My desktop/gaming machine still has a Phenom IIx6 1100T in it. All the games I play, I can leave all the settings turned all the way up. Maybe if I ran benchmarks, I'd see something else gets a slightly faster frame rate, but I can't see any difference. As we all know, various benchmarks show different things.

      --
      Serious? Seriousness is well above my pay grade.
    19. Re:This isn't nearly as bad as the division bug by GoodNewsJimDotCom · · Score: 5, Interesting

      Heh. I coded a nice tile based RPG out of it, but I couldn't make it MMOG because there is no socket code in Quick Basic. The trick to making big games in Quick Basic is to write your own Virtual Disk so you can get past the 640k memory limit. Once you have a virtual disk, you can write an interpreted language inside Quick Basic, then your code is simply loaded up in a custom database. I rewrote the whole thing in C/C++ because people told me I could get socket libraries in it, but I gave up on my game entirely when Ultima Online came out because I felt I wouldn't be able to build up a market because my graphics are so bad. I was partially right in thinking there is only enough room for one MMORPG at a time back in 97, but I think I shouldn't have gave up after having coded for thousands of hours with things like Farmville succeeding today.

    20. Re:This isn't nearly as bad as the division bug by Anonymous Coward · · Score: 0

      Your comment could be misinterpreted. I can't tell if that's because you misunderstand, or because it was just ambiguously stated.

      Division is very accurate in floats. IEEE division returns the closest float to the correct answer. The same rules apply for addition, subtraction, multiplication, and square root. The problem I think you are referring to is that the float format, having a finite number of bits, cannot exactly represent the infinite range of real numbers. Thus, while the result of an IEEE float division will be the closest float possible, it is usually not the exact real number that you wanted.

      This happens with IEEE floats, base-10 floats, and indeed any finite representation tasked with performing arbitrary arithmetic.

      The error after a single operation should be on the order of 1/16 million. If you are finding errors in the third digit after the decimal point then either you have done a lot of operations, perhaps using an unstable algorithm, or you answer is 10,000 and your error is 0.001.

    21. Re:This isn't nearly as bad as the division bug by drolli · · Score: 0

      What is the problem with Quick Basic? It came for free and it was quite ok.

    22. Re:This isn't nearly as bad as the division bug by Daniel+Phillips · · Score: 1

      Just so you know, division is never accurate in floats, even when the CPU doesn't have bugs.

      What kind of lather are you people working up? The subject was, a division bug. Out of spec operation. Not normal IEEE precision issues.

      --
      Have you got your LWN subscription yet?
    23. Re:This isn't nearly as bad as the division bug by Simon+Rowe · · Score: 1

      It fixed it, but new programmers shouldn't be forced to deal with stuff like that.

      You're new here aren't you? Software always has to fix up the screwups the hardware engineers made.

    24. Re:This isn't nearly as bad as the division bug by Anonymous Coward · · Score: 0

      The CPU is the last place you'd look for a bug

      If you were in the HW business you'd know better:
      CPU bugs happen all the time (e.g., here's a 30+ bug list for the core: http://blog.pi3.com.pl/?p=55).
      Now a CPU bug that is actionable in user space, that is indeed not easy to find, but I'm not even sure that it is the case here (heck, the guy said he made his own OS to demonstrate it).

    25. Re:This isn't nearly as bad as the division bug by phantomfive · · Score: 3, Insightful

      Just because you find an error in a division when you were programming your MMORPG in visual basic doesn't mean you've found the pentium bug. If you noticed it happening a lot, it probably wasn't the bug, just normal IEEE precision issues.

      --
      "First they came for the slanderers and i said nothing."
    26. Re:This isn't nearly as bad as the division bug by billcopc · · Score: 2

      Pardon my curiosity, but it sounds like you're building ~$400 servers out of basic desktop components. What kind of workload are you putting on these boxes that scales so well, yet doesn't justify the added expense of high-end server class hardware ? Maybe I'm at the other end of the spectrum, but I wouldn't dream of running a server without redundant power supplies and premium boards that have been built and tested to rigorous specs. The added hardware expense more than makes up for decreased maintenance and downtime.

      I used to work for a guy who built servers out of whatever spare parts he had lying around - obsolete desktops, refurbs, ebay junk, whatever. For a while, we were spending at least 10-15 hours a week keeping those things up, or driving down to the datacenter to physically reboot them. I eventually convinced him to spend a LOT more money on fancier hardware, with IPMI, redundant everything and high-efficiency power supplies. Spending that extra thousand up-front meant we could boot them up and practically forget them, uptimes have gone way up and we were logging more billable hours instead of juggling cheap gear. The results spoke for themselves.

      --
      -Billco, Fnarg.com
    27. Re:This isn't nearly as bad as the division bug by Pieroxy · · Score: 3, Informative

      What is the problem with Quick Basic? It came for free and it was quite ok.

      No network access? Might be fine for you, but for an MMORPG programmer on the other hand...

    28. Re:This isn't nearly as bad as the division bug by Mprx · · Score: 1

      Running under DOS typically does not give you 100% control over the CPU:

      http://en.wikipedia.org/wiki/System_Management_Mode

    29. Re:This isn't nearly as bad as the division bug by Anonymous Coward · · Score: 2, Insightful

      > It came for free

      You meant QBasic.
      QuickBasic is for money.

    30. Re:This isn't nearly as bad as the division bug by billcopc · · Score: 1

      there is never any need for self modifying code

      There is when you're on a memory-constrained platform, which admittedly the PC is not. Selfmod code is still used in demo coding, especially with 256-byte and 4096-byte competitions, but that is exclusively an academic exercise.

      On an embedded system with just a few kbytes of memory, like say an ARM-powered gadget, self-modifying code is still relevant, even in 2012. Just because we can put 4 gigs of Ram in a toaster doesn't mean we should.

      --
      -Billco, Fnarg.com
    31. Re:This isn't nearly as bad as the division bug by AmiMoJo · · Score: 3, Interesting

      Most of the undocumented op-codes on older CPUs were down to the fact that they were designed by hand rather than having the circuits computer generated. A computer will make sure all illegal op-codes are caught and generate an exception, but human beings didn't bother. Designers put in test op-codes as well which were usually just left in there for production. Even the way humans design circuits makes them more likely to produce useful undocumented op-codes and side-effects.

      It was somewhat risky to use them though because the manufacturer might decide to change CPU. The Z80 design was licensed out and any number of companies could supply them, all with their own unique bugs. Some games like to used these features for copy protection and then broke when the producer switched supplier.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    32. Re:This isn't nearly as bad as the division bug by wvmarle · · Score: 3, Insightful

      Google is known to build their servers from cheap parts.

      Like a RAID, but then a RAIS (Redundant Array of Independent Servers). Load distribution may be an issue as it has to seamlessly reassign tasks when a server is down for whatever reason. But for sufficiently large operations (five servers or more) this sounds to me like the way to go. Instead of trying to make every individual server highly reliable, go with the still very reliable user-grade stuff and get your reliability by redundancy. And companies like Google need more than one server anyway.

    33. Re:This isn't nearly as bad as the division bug by the+linux+geek · · Score: 1

      Take a look at the benchmarks. The FX-8150 really doesn't come out looking good against the 2500k, much less against the i7-980.

    34. Re:This isn't nearly as bad as the division bug by Anonymous Coward · · Score: 0

      Couldn't handle the transition from 0xFFFF to 0x0001 in either direction. That was fixed in all the faster 386's

      IIRC even 80286 would fire a GPF for accessing a word at 0xffff since it technically accesses 65537th byte of the segment and the limit in real mode is set at 65536.

      and second, by proving that there is never any need for self modifying code.

      Even on pentium it was still important: add reg,[mem] is 3 times slower than add reg,const. This meant quite a bit in a trifiller. Only later it started to incur such devastating penalties that it didn't pay off anymore.

      For instance, the ancient 6502 supports this packed decimal arithmetic mode, in which 0x99 meant 99.

      Similar well known thing is that some x86 instructions which have "decimal" in the name contain an interesting 0xa byte in the opcode.. turns out it doesn't need to be 0xa.

    35. Re:This isn't nearly as bad as the division bug by Anonymous Coward · · Score: 0

      "For some processors, a few unassigned opcodes performed a useful operation. It wasn't by design. Is that a bug?"

      No, it was luck. Those opcodes are not unheard of at all - they were described to me in my computer science degree during a course on designing CPUs (we designed our own very simple one). Essentially the opcodes all get check through combinations of NAND/NOR etc gates to do the useful things. Any unused opcodes have undefined behaviour and the circuitry is simpler (smaller, faster, cheaper) if you use fewer gates so some of the combinations will lock an operation down to a single defined opcode, but the gates can overlap with undefined opcodes too because you haven't bothered adding extra gates that rule the undefined ones out. Some of the undefined opcodes will just duplicate other opcodes, while others may overlap with multiple defined opcodes and as a result do 2 operations in a single opcode which are the useful ones. Problem is some of them won't be that useful because they'll interfere with each other, but others can indeed be very useful. You're always running a risk though - they're completely undefined and just working because of the combination on gates on your particular CPU. Your program'll work fine on that CPU and possibly other similar ones but eventually it'll run on another CPU that uses a slightly different gate layout and your code'll suddenly stop working.

    36. Re:This isn't nearly as bad as the division bug by Anonymous Coward · · Score: 0

      "proving that there is never any need for self modifying code"
      I'm curious about where the hell this is proved.

    37. Re:This isn't nearly as bad as the division bug by Anonymous Coward · · Score: 0

      With the division bug, it was possible to detect it and automatically switch over to a software stack when it was found.

      Is there any way to do the same for this bug?

      Congratulations to Matt Dillon on finding it. Debugging race conditions and issues that only crop up under severe loads are a real bear to debug.

    38. Re:This isn't nearly as bad as the division bug by Rockoon · · Score: 1

      Pop two off the stack and ret to the calling routine seems fairly common to me. Lots of functions use two arguments and are called with near calls in various programming languages.

      The target function of the call has no business pushing or popping its arguments, ever. It doesnt work. Never has. The caller pushes the arguments and then in some calling conventions (such as STDCALL) the target function removes them from the stack using the return instruction itself ("ret 8" will remove 8 bytes of parameters) while in others the caller itself is responsible for removing the parameters (such as CDECL)

      Let me repeat that what you are describing is not possible. When the target function begins executing, the top of the stack is the return address. A pop will be popping that valuable return address, not the first or last parameter which are under it. To be specific, [esp] is the return address and [esp + 4] is the first or last parameter.

      Now don't speak when you dont know what you are talking about. Lets be honest here.. you knew that you didn't... now get off my lawn.

      --
      "His name was James Damore."
    39. Re:This isn't nearly as bad as the division bug by Joce640k · · Score: 1

      with a 32 bit floating point number, you shouldn't be surprised to find errors in the third digit after the decimal point.

      So working with millimeters is completely impossible then?

      Bummer. There goes my plan to write a CAD system using metric measurements.

      --
      No sig today...
    40. Re:This isn't nearly as bad as the division bug by Kjella · · Score: 2

      For example, I just set up a couple servers from COTS parts. They used AMD FX-8120's (8 core, 4.0Ghz turbo) for $199.99/ea. It seems the comparable Intel is the i7-980 (6 core, 3.6Ghz), which is selling at $589.99.

      Modded informative? Only on slashdot... Also you compare turbo speeds (and GHz is silly anyway due to the difference in IPC), yet say:

      The servers actually use as many cores as I can throw at them, so it's extremely beneficial to have more cores at high speeds.

      If all cores are 100% loaded, you're not going to get anywhere close to max turbo. That's the extra boost it can give if only one core is working.

      Toms hardware suggests the i5-2500K (4 core, 3.7Ghz turbo) for $224.99 or i7-2600K (4 core 3.8Ghz turbo) for $324.99 as comparable.

      Tomshardware never tested the FX-8120, so that's a lie. They tested the FX-8150 and found:

      In the very best-case scenario, when you can throw a ton of work at the FX and fully utilize its eight integer cores, it generally falls in between Core i5-2500K and Core i7-2600K

      The FX-8120 has 500 MHz lower base frequency which is far more significant than the 200 MHz lower max turbo. Not many have tested it but xbitlabs did:

      Slower eight-core modification, AMD FX-8120, looks even less convincing, because it has significantly lower clock frequencies. In terms of performance, this processor ranks even below the quad-core competitor solutions. Moreover, FX-8120 is also slower than the top previous-generation AMD CPU - Phenom II X6 1100T.

      So just admit it, you use AMD because you like AMD but clearly you have no clue what the competition offers.

      --
      Live today, because you never know what tomorrow brings
    41. Re:This isn't nearly as bad as the division bug by Anonymous Coward · · Score: 3, Informative

      What is the problem with Quick Basic? It came for free and it was quite ok.

      NO it did NOT. What came free was QBasic, which was a stripped-down version of Quick Basic. The full Quick Basic did not have the 640k memory limitation, was able to fully link/compile stand-alone executables, and had a host of other Professional features that QBasic lacked.

      Don't get me wrong- QBasic was great for a free environment (at the time). But it was severely limited, and all the references to "quick basic" in this thread appear to be referring to shortcomings in QBasic, which were not present in Quick Basic.

    42. Re:This isn't nearly as bad as the division bug by hairyfeet · · Score: 1

      But there is a REALLY important question that needs to be asked here, which is "What are the REAL odds that a normal user will hit this?" I mean is it like the Pentium bug back in the day where unless you were working on some very specific math problems you'd never hit it in a million years, or is it like the Phenom I where even on BE chips you'd often have Core 2 BSOD the system if you went past 2.4GHz?

      Because with all the problems AMD has had getting GloFlo up to speed the LAST thing they need is a chip recall, hell depending on the severity it could torpedo their profits for a year. Now TFA is probably about the most light on details we have seen in awhile but the fact he mentions 48 cores i can only assume he is talking Opteron. Does this effect the desktops? what about the mobile? AM3, or only the newer chips?

      Without more information all I can go by is my own experience and so far slamming the hell out of AM3 and Brazos chips I haven't run into anything causing crashes, its all smooth sailing. So maybe someone who actually knows more details can fill us in?

      --
      ACs don't waste your time replying, your posts are never seen by me.
    43. Re:This isn't nearly as bad as the division bug by Joce640k · · Score: 1

      Just so you know, division is never accurate in floats

      Um, yes it is.

      --
      No sig today...
    44. Re:This isn't nearly as bad as the division bug by Anonymous Coward · · Score: 0

      When the target function begins executing, the top of the stack is the return address. A pop will be popping that valuable return address, not the first or last parameter which are under it.

      We are not talking about the start of the function, but the end. At that time, anything the function has pushed onto the stack will be on the top of the stack, before the return functions. Often there are faster instructions than individual pops, but in a calling convention where some registers are "callee-saved", you want to pop those registers. The advantage of callee-saved registers is that the function knows which registers it overwrites, and thus which registers it needs to save. Instructions such as pusha/popa don't really fit the case where you need to save only a few registers.

    45. Re:This isn't nearly as bad as the division bug by Joce640k · · Score: 2

      I used to work for a guy who built servers out of whatever spare parts he had lying around - obsolete desktops, refurbs, ebay junk, whatever. For a while, we were spending at least 10-15 hours a week keeping those things up, or driving down to the datacenter to physically reboot them.

      YMMV but I've got junker machines which have been running as servers 24/7 for years without a glitch. I'm just about to replace a few of them with Intel Atom boxes to save power/eardrums and I'm worrying about the reliability of the new machines. I'm going to keep the old machines lying around for at least a couple of months.

      --
      No sig today...
    46. Re:This isn't nearly as bad as the division bug by Joce640k · · Score: 1

      And it sounds like the sequence of instructions that causes it is not commonly found.

      Really?
      Pop two off the stack and ret to the calling routine seems fairly common to me. Lots of functions use two arguments and are called with near calls in various programming languages.

      So how come AMD systems aren't crashing all over the place? Why did it take him a year to be able to reproduce it reliably? I think my desktop machine has one of those chips in it but it's never had an unexplained crash.

      --
      No sig today...
    47. Re:This isn't nearly as bad as the division bug by Anonymous Coward · · Score: 1

      Some floating point operations are precise - if the numbers and the results are fully representable, then the operation is precise.
      4.0 / 2.0 is fully precise.

    48. Re:This isn't nearly as bad as the division bug by JWSmythe · · Score: 5, Interesting

      Well, $604.94/ea. The memory came with an 8GB Class 4 micro SD, and we got a $10 newegg gift card each. I forget what that was bundled with. If you consider a gift card as cash, they were under $600/ea.

      13-131-767 @$94.99/ea ASUS M5A97 AM3+
      17-822-008 @$24.99/ea DIABLOTEK PSDA500 500W RT
      19-103-961 @$199.99/ea AMD 8-CORE FX-8120 3.1G
      20-220-609 @$84.99/ea 4Gx4 PATRIOT PGD316G1600ELQK
      22-148-725 @$99.99/ea Seagate 1.5TB ST1500DL003 (x2)

      All of those are quantity 1, except the hard drives. They are 8 core 4Ghz (always running in Turbo mode), with 16GB ram, and RAID 1 on the drives. I opted to go more like Google's topless server. I used cable ties to mount up everything on wire racks from Home Depot. Ya, the same plastic/rubber coated ones you'd use in your closet. This is serving out of my house on a business FiOS line, so no one at a datacenter can complain. :) They're running amazingly cool. Because there's nothing interrupting normal convection air currents, all the heat sinks and drives are cool to the touch. They're a bit quieter than my desktop PC, because I don't require an extra fans to pull the hot air out of the case. My regular desktop has a 250cfm fan on it to keep it cool. Without it, and with the side on, it can overheat in a few minutes when gaming.

      The room does have an air conditioning return in it, which helps keep the room cool. The only fan I added was a HEPA filter. It's oversized for the room, but it'll help keep dust off the machines. The room is the same temperature as the rest of the house, so I'm happy with it. It serves no purpose for cooling the machines, since it's not even pointed at them. :)

      I have some pretty low load servers. Rather than buying a dozen of anything, I opted for using virtual machines. These two servers are hosting 4 VMs at this time, and there will be more. It's a young setup, and I have a lot of work to do on it. I opted to use VirtualBox. It works very well. I had intended trying VMWare ESXi or Citrix XenServer. unfortunately, neither would use the crappy software RAID that the boards provide, and I wasn't willing to drop money on real RAID controllers. I looked around a bit, and it seems that you can try to use some workarounds, but I didn't have the time or inclination to do it, where I could have VirtualBox going in less than an hour.

      The VMs are redundant between servers. Further on, you can read more about how I did it in the past between physical boxes. So if a single VM crashes, who cares. If a VM host crashes, well, it's reduced redundancy, but I'm still operating. I'm going to put out more VM hosts, and increase the redundancy. 4 machines with 6 VMs each is like 24 physical boxes. That's a serious savings, especially where the VM host costs about $600.

      Let me give you a little history. :)

      Long before Google made the pictures of the way they do servers, the company I was at was using COTS parts. That was voyeurweb.com (NSFW). They were hosting with a company not to be named (as in, I can't remember), who sold them on a $50k investment of a Sun server. They promised it was more power than anyone could ever want. That lasted about 3 days. It was after this, I got involved with them. We dropped about $15k on 10 servers. They were fairly cheap machines. Asus gaming motherboards, AMD K6/2 300 CPU, 512MB RAM, 8GB and 20GB IDE drives. The most expensive part at the time was the cases. It was pretty much what you'd be using at home at the time.

      We had the occasional failures, but they were usually due to load or CPU fan failures. At the time, they had under 1 million daily viewers, so we could handle that load on 4 of the 10 machines. Load balancing was done with DNS round robin. I know people say it's a poor system, but it worked well. There was typically a 3 second delay if you happened to hit a bad server, and then you'd roll off to the nex

      --
      Serious? Seriousness is well above my pay grade.
    49. Re:This isn't nearly as bad as the division bug by JWSmythe · · Score: 2

          It's too late in the evening for me to go chase down the Tom's Hardware link. I closed that tab a while ago.

          As for the rest... It works. It works well. It's cheaper. I don't upgrade desktops or servers every day. No one does, unless you're filthy rich and don't know where to spend your money. If that's the cast, you can send me some of that via PayPal on my site.

            This round of upgrades is replacing dual Opteron 1.4Ghz boxes, putting their responsibilities on VMs on these hosts. They're doing great. Not just great. Really great. Each VM is set to use 2 cores, and up to 90% of the CPU. That was an arbitrary limit I imposed on my expectation of growth on the servers, and my next planned server purchases.

          Why should I spend extra money on marginally faster equipment? In what world does that make sense? The customers don't care. They only care that I offer fair pricing and reliable service. For my desktop, I want it to run, and keep running, where I don't have to worry about random crashes while I'm working. In that, it does it perfectly.

            Next time around, I'll probably be looking at 16 cores and 6Ghz. The difference between a few Mhz that can be argued til you turn blue and are spitting at me, doesn't matter in the least. People argued the wonders of the 486/50 versus my 486/33. Other people argued that we'd never go higher because of radio and television interference. They swore we'd never get Ghz CPUs because microwave radiation would kill us all. Oh, how I miss FidoNet.

          Spending extra money on the latest, greatest, bleeding edge, fast as I can get, is a game for the insanely rich and foolish. Which one are you?

         

      --
      Serious? Seriousness is well above my pay grade.
    50. Re:This isn't nearly as bad as the division bug by JWSmythe · · Score: 0

          Ok, I skimmed it. I see the results going either way, depending on the benchmark tested. Are you still saying I was wrong for saving several hundred dollars?

      --
      Serious? Seriousness is well above my pay grade.
    51. Re:This isn't nearly as bad as the division bug by TheRaven64 · · Score: 1

      GCC dies with "internal error" compiling some of my software

      Is this different from the expected behaviour in some way?

      --
      I am TheRaven on Soylent News
    52. Re:This isn't nearly as bad as the division bug by TheRaven64 · · Score: 1

      Can I come and live in your world where FreeBSD 5.x never happened please?

      --
      I am TheRaven on Soylent News
    53. Re:This isn't nearly as bad as the division bug by complete+loony · · Score: 1

      Not all numbers can be accurately converted to the form; +/- (1 [+ 1/2] [+ 1/4] [+ 1/8] .... ) * 2^n. Basically floating point can't accurately represent any rational number that has a prime factor other than 2. Of course 4 is a number that *can* be accurately described.

      --
      09F91102 no, 455FE104 nope, F190A1E8 uh-uh, 7A5F8A09 that's not it, C87294CE no. Ah! 452F6E403CDF10714E41DFAA257D313F.
    54. Re:This isn't nearly as bad as the division bug by Anonymous Coward · · Score: 0

      I was trying to write the first MMORPG using Quick Basic

      Interesting. Seeing as how QBasic didn't exist until 1985, and MUD1 / EssexMUD came online in 1980. Many MUD's and MUSHes supported player counts - especially in a given instance/dungeon/room/area/server - just as high if not higher than most modern MMO's do, so the "massively" portion definitely isn't a disqualifier.

      Oh, and if "graphics" are your requirement, we had a MUSH on MidnightEscapeBBS in Atlanta supporting up to 32 simultaneous players with a GUI and rough nethack-like ANSI graphics in 1994.

    55. Re:This isn't nearly as bad as the division bug by Anonymous Coward · · Score: 0

      Actually, 4 is represented exactly, as is any power of 2 within the range that can be expressed by that size of floating-point number.

      Where things get ugly is when you need to express something that's not a sum of halves, quarters, eights, etc. Thirds fail in this regard, as do (unfortunately) tenths.

    56. Re:This isn't nearly as bad as the division bug by Anonymous Coward · · Score: 0

      Sure you can, measure everything in millimeters as long values and there is no single decimal point to worry about...

    57. Re:This isn't nearly as bad as the division bug by Anonymous Coward · · Score: 0

      Imagine, there is a tiny bug that makes your floating point results just slightly wrong once in 1000 times. You run a rocket motor. Your satelitte orbit is 50 km off. Is it worse than a crash.

    58. Re:This isn't nearly as bad as the division bug by Anonymous Coward · · Score: 0

      Intel and AMD procs both do thermal shutdown. To that end, if you remember fiddling with AMD XP-era chips, they were quite prone to death if they weren't adequately cooled. Stock cooling didn't always qualify for adequate. It's quite a bit different today, however I really doubt you had the same experience with a modern Core-i Intel chip...

    59. Re:This isn't nearly as bad as the division bug by Anonymous Coward · · Score: 0

      You must be a serious sysadmin hacker. Please tell us about your cache defect troubleshooting ways!

    60. Re:This isn't nearly as bad as the division bug by Rakishi · · Score: 1

      You saved $25. Actually since you're not OCing you could have gone with a 2500 for $210.

      So no you didn't save hundreds.

    61. Re:This isn't nearly as bad as the division bug by Anonymous Coward · · Score: 0

      I think my desktop machine has one of those chips in it but it's never had an unexplained crash.

      Sure, but are you running a MMORPG written in QuickBASIC? Huh? Obviously not.

    62. Re:This isn't nearly as bad as the division bug by Anonymous Coward · · Score: 0

      And, of course, this is in Computer Programming 101, Math 101, Physics 101... (they teach it to you in at least three places because error margins ARE really important).

      But yeah, I'd not expect teenagers to know about it, even in the already restricted demographics of "people who at least know how to use basic QBASIC"). I do recall I was really excited when I figured state machines when I was 12 years old, while reading z80 assembly to write trainers for Konami MSX games. I learned event-driven programming and state machines by partially reverse engineering Konami games in order to find the place where lives/ammo/health was decreased :-)

    63. Re:This isn't nearly as bad as the division bug by kevingolding2001 · · Score: 1

      especially if something like this occurs on a test machine but not on a development machine.

      I'd rather that than the other way around.

    64. Re:This isn't nearly as bad as the division bug by kevingolding2001 · · Score: 1

      Wait, never mind. Just re-read it properly. Forget I posted.

    65. Re:This isn't nearly as bad as the division bug by Anonymous Coward · · Score: 0

      No, due to stack framing and the ia32 ABI, it is NOT that common at all.

    66. Re:This isn't nearly as bad as the division bug by Rockoon · · Score: 4, Informative

      We are not talking about the start of the function, but the end.

      Who is this "we" .. are you, the anonymous coward, teamed up with icebike (68054)? Clearly you shouldn't be, since he most definitely stated his belief that a two-parameter function would pop its two input parameters near its final return statement.

      anything the function has pushed onto the stack will be on the top of the stack, before the return functions.

      What you are saying is not news to me. The problem with your argument is that in the x86-64 calling conventions (which is what the article is talking about) there are plenty of volatile registers to use. To be specific there are 7 general purpose registers (64-bit) as well as 6 SSE registers (128-bit) that are considered volatile. If a function really uses so many registers that it requires saving a few of the non-volatile registers, then the function is also most often going to be so non-trivial that it must maintain 16-byte stack alignment.

      Only leaf functions can safely violate the 16-byte alignment rule and are allowed to push and pop willy-nilly, but leaf functions also dont need non-volatile registers themselves because they arent calling anything that might destroy the registers they use. So we are talking about a very narrow situation where the function is (a) A leaf function and (b) Takes many parameters (more than 4, certainly) in order to create the register pressure required to need to spill some of them onto the stack someplace other than the mandatory scratch stack space (for the first 4 arguments) required by the calling convention.

      My lawn...

      --
      "His name was James Damore."
    67. Re:This isn't nearly as bad as the division bug by Anonymous Coward · · Score: 0

      A little candy for a guy who goes on his own way. This would be worthwhile even if DF-BSD was a folly, but it's actually turning out to be one hell of an OS. Good on him, and his peeps.

    68. Re:This isn't nearly as bad as the division bug by Anonymous Coward · · Score: 0

      Another thing I like about AMD was that their processors don't melt if they get hot because they have a self preservation shutdown mode. People said Intel had this, but I melted a processor just a few months ago on SWTOR.

      Video of Intel and AMD processors when heatsink removed (Tom's hardware vid at youtube)

    69. Re:This isn't nearly as bad as the division bug by Anonymous Coward · · Score: 0

      You're an idiot. You'll get 6 GHz NEVER. Try to understand the physical limits.

    70. Re:This isn't nearly as bad as the division bug by machine321 · · Score: 1

      For example, I just set up a couple servers from COTS parts. [...] for $199.99/ea. It seems the comparable Intel is the i7-980 (6 core, 3.6Ghz), which is selling at $589.99. For the difference in price, I could build out a 3rd server, and still have money left over.

      if 3 x $199.99 $589.99, I think maybe you hit the Pentium division bug discussed earlier.

    71. Re:This isn't nearly as bad as the division bug by fnj · · Score: 1

      If the program crashes, you KNOW it crashed and you know the runs before that didn't crash are OK.

      Sorry, no. Just no. The real world is not that simple. If there are very low runner random stack corruptions, some of the time you'll just get bad data with no crash, and sometimes you'll get bad return addresses resulting in a crash. So you don't know the runs that didn't crash are automatically "OK". You only know that by running repeated identical tests of different hardware and comparing the results.

    72. Re:This isn't nearly as bad as the division bug by dzfoo · · Score: 1

      I think the point is that finding a CPU or compiler bug is a more rare occurrence than finding a bug in your own higher-level code.

      This is why people roll their eyes at you when you claim to have found a compiler bug. It's not a belief that such a thing is impossible. It's just that you must prove exhaustively that you have accounted for and discarded all other possibilities.

      The same people will admire you and sing your praises when you show them thorough proof of your discovery.

              -dZ.

      --
      Carol vs. Ghost
      ...Can you save Christmas?
    73. Re:This isn't nearly as bad as the division bug by aaron552 · · Score: 1

      The workaround for this bug might be that the compiler has to put a nop at the start of all function epilogs

      More likely it'll be "fixed" in microcode and made available in a firmware update by motherboard manufacturers. It's a fairly small subset of all available CPUs.

      --
      I had a sig once. It was lost in the great storm of '09.
    74. Re:This isn't nearly as bad as the division bug by dalias · · Score: 4, Informative

      This is not insightful; it's wrong. Floating point on any modern system conforms, or at least is intended and assumed to conform, to IEEE 754. There are exact answers specified for every basic arithmetic operation and non-transcendental functions. Of course there are decimals that have no representation in binary, but 4.0 is not one of them.

    75. Re:This isn't nearly as bad as the division bug by DuckDodgers · · Score: 1

      Interesting post, thanks for sharing.

    76. Re:This isn't nearly as bad as the division bug by Fizzl · · Score: 1

      /facepalm

      The amount of fuck in just this single slashdot thread offends my eyes. I browsed reddit couple of years ago daily, but it seems the glory days are over.

      Unless GoodNewsJimDotCom is trolling, slashdot needs an enema.

      Replying to this specific reply because I am impressed by the amount of idiocy one can contain in four words of less than four letters each.

    77. Re:This isn't nearly as bad as the division bug by dalias · · Score: 1

      This is completely wrong. Corrupting the stack pointer is not a "crash bug". It's a code execution vulnerability. If there's a pattern to the corruption that happens and an attacker can control contents elsewhere on the stack, it's likely to turn into arbitrary code execution.

    78. Re:This isn't nearly as bad as the division bug by Anonymous Coward · · Score: 0

      Or maybe he meant Micro$oft QuickBASIC...

    79. Re:This isn't nearly as bad as the division bug by Anonymous Coward · · Score: 0

      Really, the first MMO? You were writing it in QB? Was C not invented yet (wait for it...)? You *melted* a processor playing starwars? Have you even seen a modern processor data sheet?

      Your processor must be accumulating some excess heat from the giant load of sh*t you just posted.

    80. Re:This isn't nearly as bad as the division bug by Anonymous Coward · · Score: 0

      Depends on what your base unit is.

    81. Re:This isn't nearly as bad as the division bug by Anonymous Coward · · Score: 0

      Another thing I like about AMD was that their processors don't melt if they get hot because they have a self preservation shutdown mode. People said Intel had this, but I melted a processor just a few months ago on SWTOR.

      So you took up Slicing or Scavenging on SWTOR, what's that got to do with anything?

      Intel CPU's have thermal shutdown too. If your melted a CPU, it's possible you had a defective sensor, or some overvoltage, but don't blame the design.

    82. Re:This isn't nearly as bad as the division bug by Anonymous Coward · · Score: 0

      The people who said you could get socket connectivity on pre-XP Blowdose stack never apparently tried it themselves.

    83. Re:This isn't nearly as bad as the division bug by drinkypoo · · Score: 1

      Just because we can put 4 gigs of Ram in a toaster doesn't mean we should.

      if we can do it more cheaply (in all senses of the word) than making our toast any other way (what, are you going to overclock the ram and use it as the heating element?) then we should.

      --
      "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
    84. Re:This isn't nearly as bad as the division bug by Eskarel · · Score: 1

      Even if that were true, and it's not, stack pointer errors can potentially cause issues other than segfaults, the bigger difference is very simple. Intel in the early 90's could afford a CPU recall. AMD in 2012 cannot, nor can they afford the hit to their reputation. This is a big deal for AMD, and indirectly a big deal for the PC market. I don't much like AMD anymore, their chips aren't as good as the competition and they're not all that much cheaper, but I do like competition existing.

    85. Re:This isn't nearly as bad as the division bug by Anonymous Coward · · Score: 0

      Technically, he was right. Division, using floating-point variables *is* accurate. It's just not *precise*. Accuracy is a measure of the consistency of the result. Precision is a measure of the deviation of the result from the actual answer.

    86. Re:This isn't nearly as bad as the division bug by juancn · · Score: 1

      It might be way worse. Since this updates the stack pointer, it might be exploitable.

    87. Re:This isn't nearly as bad as the division bug by Anonymous Coward · · Score: 0

      For that I would imagine whoever wrote the program would an arbitrary precision library.

    88. Re:This isn't nearly as bad as the division bug by Anonymous Coward · · Score: 0

      Yeah, never saw my nineties-end era Coppermine pc's certain undocumented BIOS call listed running DOS nor Windows, until I installed Linux.
      You know, (since forgot the address) the one called CIA BOM. Hooray for SMM.
      Wonder, though, if those nasty DRAM flash operations are still available today? Whatcha bet :) ?

    89. Re:This isn't nearly as bad as the division bug by Joce640k · · Score: 1

      /facepalm

      The amount of fuck in just this single slashdot thread offends my eyes. I browsed reddit couple of years ago daily, but it seems the glory days are over.

      Unless GoodNewsJimDotCom is trolling, slashdot needs an enema.

      Replying to this specific reply because I am impressed by the amount of idiocy one can contain in four words of less than four letters each.

      Please explain how dividing 100 by 10 is inaccurate in floating point.

      --
      No sig today...
    90. Re:This isn't nearly as bad as the division bug by Joce640k · · Score: 1

      True, but you don't need to go that far. There's plenty of pairs of numbers which can be divided exactly in floating point. The claim that it's "never accurate" is laughable.

      --
      No sig today...
    91. Re:This isn't nearly as bad as the division bug by wbr1 · · Score: 1

      I actually coded a multi-player tetris in QB. Even though there is no socket code, there is file access. It was done simply by reading and writing to a pair of files in a network mapped drive. The only reason it was a pair of files was that one file could be locked briefly when a machine was reading/writing to it. It worked quite well on the 486's of the day. I have no idea how it would scale though.

      --
      Silence is a state of mime.
    92. Re:This isn't nearly as bad as the division bug by ifiwereasculptor · · Score: 1

      Yes, but which family of CPUs was affected? I couldn't find that information in TFAs. Bulldozers? Llanos? A specific subset of either?

    93. Re:This isn't nearly as bad as the division bug by AvitarX · · Score: 1

      Interesting, I'd say it scales poorly, but it doesn't really matter.

      "under 1 million" = 10 300 MHz servers
      1-2 million = 25-50 1Ghz servers
      7-11 million = 150 even faster servers

      A solution that scaled well would look more like 110 300 MHz servers for 11 million visitors, and still have 2.5x the number of servers needed. Obviously some extra is needed on the back-end, but I am assuming that even the original site used database and scripting.

      --
      Wow, sent an e-mail as suggested when clicking on "use classic" banner, and got a fast response that addressed my msg
    94. Re:This isn't nearly as bad as the division bug by Galactic+Dominator · · Score: 1

      I agree these high-end server class machines have their place and in some situations are a requirement. However functionality like KVM over IP isn't limited to simply IPMI. Anyways, if you need BIOS and power level access remotely on a regular basis I would go as far as saying perhaps you should reconsider the arrangement.

      without redundant power supplies and premium boards that have been built and tested to rigorous specs

      All that gets you is a huge pricetag and is only worthwhile in absolutely mission-critical no downtime environments. The components in that class of hardware still fail, IME with almost the same regularity as consumer level components(and statistically this observation is correct concerning consumer/enterprise HDDs). The one thing the server class hardware can provide is on-site, next day replacement. Consumer level NIC's can't handle sustained high throughput so that's the one thing which may need upgrading if performance is an issue.

      The mindset of new sys admin is sometimes mind boggling. They will spend so much time and money to ensure a system is able to handle a power supply and NIC failure that it would simply be easier and cheaper to setup the service on white box hardware and make it HA.

      we were spending at least 10-15 hours a week keeping those things up

      I kind of find this hard to believe. If a unit was reliable in desktop life, it's almost universally reliable in server life. But if you put junk in your servers, they won't work well. I've seen companies pay big money for hardware that did this. Of course they got warranty support, but the point remains.

      --
      brandelf -t FreeBSD /brain
    95. Re:This isn't nearly as bad as the division bug by Jamu · · Score: 1

      You could use a different representation of course. Decimal floating-point or even a base-60 based representation would handle some numbers better. However you'll not be able to represent as many numbers exactly in the same storage space (98% as many for example). However if the numbers you do want to use are better represented, it might make sense.

      --
      Who ordered that?
    96. Re:This isn't nearly as bad as the division bug by bill_mcgonigle · · Score: 1

      The difference between a few Mhz that can be argued til you turn blue and are spitting at me, doesn't matter in the least.

      heh, some people try to eek out those extra few MHz on some part and wind up introducing additional wait-states. AMD is usually good about offering parts that are a true multiple of commonly available memory speeds.

      There are many dimensions to building a good system - as you point out price and suitability aren't to be ignored.

      I'd suggest that somebody who has extra money for the highest speed CPU take one step down instead and use the difference to buy a few goats for third-world families.

      --
      My God, it's Full of Source!
      OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
    97. Re:This isn't nearly as bad as the division bug by Mr+Z · · Score: 1

      Modern compilers also don't tend to push/pop anyway. They tend to atomically allocate a frame with a single SP update, and use frame-pointer-relative addressing to manipulate the stack frame. This allows for more parallelism, since any moves into/out-of the frame can run in parallel, and the only serialization is the single SP update. For example: (and pardon Slashdot's mandatory mangling of the formatting...)

      pushq %rbp #
      movq %rsp, %rbp #,
      subq $64, %rsp #,
      movq %rdi, -40(%rbp) # spec, spec
      movl %esi, -44(%rbp) # source_x, source_x
      movl %edx, -48(%rbp) # source_y, source_y
      movl %ecx, -52(%rbp) # target_x, target_x
      movl %r8d, -56(%rbp) # target_y, target_y
      movl %r9d, -60(%rbp) # bpp, bpp

      That's a function entry with a pretty beefy stack frame. All RBP-relative addressing and a single stack move with the SUBQ. The other end of the function releases the stack frame with only a single "pop":

      leave
      ret

      That moves RBP into RSP and pops the old RBP. No need to explicitly pop all those passed-on-the-stack arguments. This function isn't even a leaf function--it calls calloc, at least.

      Now I do see some functions that push/pop RBX, so there are a few push/pops out there, and they also happen to be leaf functions. In fact, I do see an example of a "pop, pop, ret" right here:

      gfx_scale_set_palette:
      pushq %rbp #
      movq %rsp, %rbp #,
      pushq %rbx #

      ....

      popq %rbx #
      leave
      ret

      Iiiiiinteresting. I don't even see RBX getting used explicitly elsewhere in the body of the function. I wonder what that is all about? RBX isn't used implicitly by anything that I remember offhand. That's usually AX/DX (or EAX/EDX, RAX/RDX). Hmmm.

    98. Re:This isn't nearly as bad as the division bug by mcgrew · · Score: 4, Funny

      Ah, the memories...

      Though, it's still very serious. At least it generally causes your program to crash rather than spitting out a wrong answer.

      At Intel, Quality is Job 0.99989960954

      Q: What is a mad scientist?
      A: A researcher with a Pentium

      Q: How many Pentium designers does it take to screw in a light bulb?
      A: 1.99904274017, but that's close enough for non-technical people.

      Q: What's another name for the "Intel Inside" sticker they put on Pentiums?
      A: The warning label.

      Q: Why didn't Intel call the Pentium the 586?
      A: Because they added 486 and 100 on the first Pentium and got 585.999983605.

      Q: Did you hear about the new "morning after" pill being developed as a replacement for RU-486???
      A: Its called RU-Pentium. It causes the embryo to not divide correctly.

    99. Re:This isn't nearly as bad as the division bug by DocSavage64109 · · Score: 1

      Maybe I'm nitpicking, but you sure that wasn't 8MB on those K6/2s instead of 8GB? Otherwise, thanks for the awesome writeup!

    100. Re:This isn't nearly as bad as the division bug by bill_mcgonigle · · Score: 1

      neither would use the crappy software RAID that the boards provide

      Generally you don't want to use that - just configure them as normal drives and use linux RAID-1 in software. It's usually faster and almost always more reliable.

      If you ever need to rebuild these boxes, have a look at Xen 4 on linux (Fedora works now, but CentOS 6 should be coming fairly soon - reminds me I need to send some patches upstream...). You wind up with all kinds of flexibility. The last one I built uses some big SATA drives with four SSD's in front of them, using Facebook's flashcache for performance and energy benefits.

      --
      My God, it's Full of Source!
      OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
    101. Re:This isn't nearly as bad as the division bug by phantomfive · · Score: 1

      Suggest not using floats. Also, if you are writing a financial program, or anything keeping track of money, or anything requiring 100% accuracy to a specific tolerance, don't use floats.

      --
      "First they came for the slanderers and i said nothing."
    102. Re:This isn't nearly as bad as the division bug by Anonymous Coward · · Score: 0

      Not true. Floating point can represent all 24-bit/53-bit integers (float vs double). So out goes your claim that there can't be a prime factor other than 2. In fact, lots of rational numbers are rational as floating point. It's just that the two differ on which are rational and which are irrational. What you described is not floating point.

      Read http://en.wikipedia.org/wiki/Single_precision for more information on how it's stored (double slightly increases the exponent but mostly increases the fraction).

    103. Re:This isn't nearly as bad as the division bug by dubbreak · · Score: 1

      Toms hardware suggests the i5-2500K (4 core, 3.7Ghz turbo) for $224.99 or i7-2600K (4 core 3.8Ghz turbo) for $324.99 as comparable.

      Tomshardware never tested the FX-8120, so that's a lie. They tested the FX-8150 and found:

      In the very best-case scenario, when you can throw a ton of work at the FX and fully utilize its eight integer cores, it generally falls in between Core i5-2500K and Core i7-2600K

      The FX-8120 has 500 MHz lower base frequency which is far more significant than the 200 MHz lower max turbo. Not many have tested it but xbitlabs did:

      I hope Tom's hardware wasn't recommending the 2500K or 2600K for virtualization. Those models are unlocked for over clocking but have the virtualization extensions disabled. I just ran into that when evaluating processors for running vmware.

      --
      "If you are going through hell, keep going." - Winston Churchill
    104. Re:This isn't nearly as bad as the division bug by m.dillon · · Score: 2

      The pushes and pops involved are for call-saved registers, not for arguments. Over the years GCC has kinda flip-flopped over the best way to handle that... whether to use PUSH and POP or to use SUB/MOV/MOV/MOV/... the MOV sequences produce much longer instructions, so if you are space-concious (e.g. -Os), you are more likely to get PUSH/POP.

      Intel and AMD cpus, over the years, have been better or worse at optimizing instructions which adjust the stack pointer. These days PUSH/POP sequences should be as fast as MOV sequences... maybe slightly slower fully cached but they'd get it back with reduced L1 instruction cache misses. I haven't done any exhaustive testing, however. Modern cpus can have so many instructions in-flight at once that simple non-dependent sequences such as PUSH/PUSH/PUSH or MOV/MOV/MOV are generally going to not bottleneck anything.

      -Matt

    105. Re:This isn't nearly as bad as the division bug by phantomfive · · Score: 1

      Can you even get an ARM chip with less than 1MB RAM anymore?

      --
      "First they came for the slanderers and i said nothing."
    106. Re:This isn't nearly as bad as the division bug by Anonymous Coward · · Score: 0

      I used to work for a guy who built servers out of whatever spare parts he had lying around - obsolete desktops, refurbs, ebay junk, whatever. For a while, we were spending at least 10-15 hours a week keeping those things up, or driving down to the datacenter to physically reboot them.

      There's a difference between "cheap crap" and truly broken crap. If you were spending 15 hours a week fixing the computer, your guy was using actual crap (that is, hardware which simply doesn't work and wasn't even suitable for desktops). "Cheap crap" a.k.a. basic desktop components, OTOH, are awesome value and usually pretty reliable. (Especially true nowdays; hardware has gotten just plain better (not just faster) in the last 20 years.)

      I'm glad you ended up being able to "boot them up and practically forget them" but with desktop components that's usually what happens too. I'm not putting down redundant power supplies (really!) and there are situations where you really need the possibility of a machine failing to be astronomically low, but for most servers it's ok if it crashes every ten years because of a bad power supply.

    107. Re:This isn't nearly as bad as the division bug by BitZtream · · Score: 0

      No, not really. It could result in 2.00000001 as the result, thats the way floating point works. Its different depending on each IMPLEMENTATION. AMD cpu's don't even return the same results as intel CPUs. Two different Intel model lines (pentium versus core2 for instance) may not even produce the same results.

      1.0 - 1.0 != 0.0 in floating point in almost any CPU I've ever worked on.

      --
      Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
    108. Re:This isn't nearly as bad as the division bug by X0563511 · · Score: 1

      Visual Basic? Where the hell did that come from?

      --
      For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
    109. Re:This isn't nearly as bad as the division bug by sjames · · Score: 1

      So, a stack corruption bug is one of those overlap cases. At least it signals it's presence with an occasional crash.

    110. Re:This isn't nearly as bad as the division bug by Anonymous Coward · · Score: 0

      This was an awesome post. Thanks a lot for sharing!

    111. Re:This isn't nearly as bad as the division bug by X0563511 · · Score: 1

      Just make sure he doesn't get caught up thinking you are asking to divide 100 by 10.0000000000000235. After all, you said "10" specifically.

      --
      For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
    112. Re:This isn't nearly as bad as the division bug by sjames · · Score: 1

      Unfortunately, if they do that, the simulation won't be done until they're ready to decommission the bridge.

      That and they would then be depending on a less well tested library rather than the FPU.

    113. Re:This isn't nearly as bad as the division bug by phantomfive · · Score: 1

      Or any other kind of basic :)

      --
      "First they came for the slanderers and i said nothing."
    114. Re:This isn't nearly as bad as the division bug by sjames · · Score: 1

      *IF* the corruption follows a particular pattern reliably enough it could possibly be used as a remote execution bug, but TFA doesn't show if that is the case or not. It looks more like a possible DOS bug.

    115. Re:This isn't nearly as bad as the division bug by Anonymous Coward · · Score: 0

      If the visitors, with time, get more bandwidth and the size of pictures or videos are getting bigger, we can conclude that the solution scaled well.

    116. Re:This isn't nearly as bad as the division bug by hairyfeet · · Score: 1

      Blowdose? What are you, 9? And since you must be too young to remember your history WinNT which came out in 1993 (Man has it been that long? Seems like yesterday I was supporting NT 3.x and NT 4) had the BSD network stack until MSFT wrote their own for Win2K and BSD most certainly DID have socket connectivity so you are beyond wrong and into completely off base. And if you are talking win9X as the Linux guys point out with their OS it was just a DE over a separate core, in this case DOS, and one of the nice things about it was you could simply choose which portions you wanted to run and just FYI from Win95 SR2 one could use Winsock which again was based on BSD and did most certainly have socket support. Kids these days with not knowing their history, geez.

      --
      ACs don't waste your time replying, your posts are never seen by me.
    117. Re:This isn't nearly as bad as the division bug by Anonymous Coward · · Score: 0

      >Diablotek PSU
      Enjoy your explosing servers.

    118. Re:This isn't nearly as bad as the division bug by JWSmythe · · Score: 3, Interesting

      It got more complex as it grew. This is all from memory, so there may be some inconsistencies with what I'm saying.

      The main site primarily used 3 subdomains. There were also 3 pay sites. For a while there were 6 machines for video streaming. The video streaming site went away, mostly due to cost vs income. We added free hosting, which was small to start, and picked up popularity and grew to dozens of servers. There were side projects launched on their own servers. Some worked. Some were dropped.

      Masterstats.com was an example of that. In function, it was similar to Google Analytics. It started as 1 DB and 2 web servers. Because of the load thrown at it, it became 3 DB, 4 web, and 2 offline calculation servers. The offline machines were just to process stats that were too intensive to do live, and created an undue load on the web servers.

      Another example was our backups. If you go digging through my journal posts, you'll find me talking about multi TB arrays, back when the largest drives on the market were 250GB. Back in the first iteration, it was fairly simple to keep backups. That grew as we kept throwing more into it.

      These are the rough counts just for the main sites.

      Before I took over, there were 3 subdomains (www, voy, and ww3.) Each was on one server. If one server failed, that part of the site stopped. Needless to say, that was bad if (and when) www. went down.

      4 servers in one city. All 4 machines had all 3 subdomains.

      8 servers in two cities. That's 4 machines in 2 cities. Each set could handle the full load, in the event we lost a city. That meant each city handled 50% of the traffic normally, or 100% in a city failure. We could operate on 2 per city, but the extra 2 provided for redundancy.

      When we scaled out to 3 cities, we split the 3 subdomains between groups of servers in each city. We also divided up the load equally, so each city got 33% of the traffic. Having a city fail increased the traffic to the other cities by 17%. The sites data existed on all the servers, so in the event of a failure of one server, we could distribute that load. So now we had 3 to 5 servers per group, in 3 cities. Newer hardware was faster, so those would have combined duties. Clearly a dual 1.4Ghz machine could handle more than a single 350Mhz machine. At this point, we had retired almost all of the 350Mhz machines, except for a few that were recycled to do DNS and other low-load tasks.

      When we got to 5 cities (New York, Los Angeles, San Diego, Tampa, and Frankfurt), the loads got divided up differently. Frankfurt had one 100Mb/s circuit. San Diego had one 100Mb/s circuit. New York had two GigE circuits. Los Angles had 2 GigE circuits, which grew to 3 when we retired San Diego. Tampa had 2 GigE circuits. Frankfurt was retired, and the traffic was just added to New York. That barely made a blip on the bandwidth graphs.

      Members sites had a warm friendly map so they could pick their city to view from. We looked at other options, but it was simple to give them a map to choose where to serve from, and they could pick another city any time they wanted. Forums had people discussing which ones were "faster". It was very subjective. People in the West would pick Los Angeles or San Diego, with most preferring Los Angeles. People in the East picked Tampa or New York. People in Europe said Frankfurt was too slow, and they had better access via New York or Tampa.

      Each city had different load characteristics. Free hosting servers were deployed in 3 cities. There were some special case servers too. For example, where someone had a very high load "free hosting" site that made a lot of money via the AVS, could get their own server or servers.

      Pet projects got their own servers as needed.

      I

      --
      Serious? Seriousness is well above my pay grade.
    119. Re:This isn't nearly as bad as the division bug by drerwk · · Score: 1

      Crash bugs are frustrating, but nowhere NEAR as scary as a bug that results in an incorrect but plausible computation. If the program crashes, you KNOW it crashed and you know the runs before that didn't crash are OK.

      Alleluia friend. The number of times management has suggested I just turn off assertions so a program will not crash is amazing to me. Long discussion ensues - but it only crashes if there is a bug in the code or my understanding of the code....but we don't want it to crash...but do you want it to work correctly????

    120. Re:This isn't nearly as bad as the division bug by Cute+Fuzzy+Bunny · · Score: 1

      You prefer AMD to Intel due to thermal protection? You do realize that Intel cpu's have always had thermal protection, while AMD didnt include it for many, many years to save money? The number of intel cpu's that die from excess heat caused by the cpu is extremely low, while there have been thousands of AMD cpu's that have died from excess heat.

      Its one thing to grasp at weak features to outline a preference. Its another thing to claim a benefit where there isnt any.

      The cpu you had that 'melted' probably experienced user failure.

    121. Re:This isn't nearly as bad as the division bug by JWSmythe · · Score: 1

      Actually, thinking about it, I was off on the spec.

          512Mb RAM (I think).
          one 8 GB drive for the OS, system logs, etc.
          one 80 GB drive for the site data.

          I can't be much more precise. It's been an awful long time. All those drives were destroyed long ago, and the parts given away or recycled.

      --
      Serious? Seriousness is well above my pay grade.
    122. Re:This isn't nearly as bad as the division bug by alanebro · · Score: 2

      He said he was trying to make, "the first MMORPG using Quick Basic."

      He didn't say he was trying to make, "the first MMORPG ever created, while also using Quick Basic."

    123. Re:This isn't nearly as bad as the division bug by DeadCatX2 · · Score: 1

      The answer to self-modifying code problems with the cache is to flush and invalidate the cache. I know that for a PowerPC architecture, if you modify an assembly instruction, you should use the following sequence of instructions to flush and invalidate.

      dcbf
      sync
      icbi
      isync

      --
      :(){ :|:& };:
    124. Re:This isn't nearly as bad as the division bug by Waffle+Iron · · Score: 1

      Not according to this:

      The number type represents real (double-precision floating-point) numbers. Lua has no integer type, as it does not need it. There is a widespread misconception about floating-point arithmetic errors and some people fear that even a simple increment can go weird with floating-point numbers. The fact is that, when you use a double to represent an integer, there is no rounding error at all (unless the number is greater than 100,000,000,000,000). Specifically, a Lua number can represent any long integer without rounding problems. Moreover, most modern CPUs do floating-point arithmetic as fast as (or even faster than) integer arithmetic.

      Perhaps you have a code snippet for a specific IEEE 754 machine that can prove otherwise?

    125. Re:This isn't nearly as bad as the division bug by Anonymous Coward · · Score: 1

      No, not really. It could result in 2.00000001 as the result, thats the way floating point works. Its different depending on each IMPLEMENTATION.

      Division between powers of two should always result in an exact (correct) result. 4/2 should equal 2 no matter what implementation you are using.

      1.0 - 1.0 != 0.0 in floating point in almost any CPU I've ever worked on.

      I find that really hard to believe, too. The mantissas and exponents are identical, so the result should be exactly zero. Unless one of the numbers which you think is "1.0" was actually the result of floating-point computations that didn't quite result in 1.0, but that's a completely separate issue.

    126. Re:This isn't nearly as bad as the division bug by Anonymous Coward · · Score: 0

      Integer arithmetics can be done exactly with IEEE-754.

    127. Re:This isn't nearly as bad as the division bug by JWSmythe · · Score: 1

          Ya, I'm a huge fan of the Linux raids. I ended up going with Slackware64 for the host system, and running mdadm before the setup to get the arrays going. They've stood up better than all kinds of other RAID solutions. The nicer part has been, if the machine fails, I can just stick the drives in any other machine, and bring the array up. It hasn't happened to me, but it has happened to friends on "server" hardware. They were very thankful when I could just stick the drives in another server, and go right back to work.

          I should be purchasing 2 more of these machines in the next month or so, personal funds permitting. I broke a tooth, and my work doesn't offer dental insurance any more, so fixing the one tooth will probably cost as much as the two servers. When they get here, I should have more time to experiment with other options. These first two had to be done quick. I ordered the line when my old server (on "server" hardware) started taking a dump. I ordered the hardware once they got the install done right, I got the hardware shipped. It took them a while to figure out how to provision more than one IP to me. {sigh}. The day the hardware arrived, my old hardware crashed 4 times during the install. well, not "crash". It just hung up requiring a hard reboot each time. It was well beyond EOL, so I just attribute it to being ancient. I'll have time to experiment with the next two, and migrate things over if I choose to use a different platform.

         

      --
      Serious? Seriousness is well above my pay grade.
    128. Re:This isn't nearly as bad as the division bug by Anonymous Coward · · Score: 0

      I think you missed his point. The difference for two servers is 2 x $380, or $760. With $760, he can build a third server with money left over.

    129. Re:This isn't nearly as bad as the division bug by TheRealMindChild · · Score: 1

      I spent a summer of working and saving to buy myself a copy of PDS (QuickBasic) 7.1. Back then, I preferred it, as I previously grew with GWBasic->QBasic->QuickBasic. Severely limited, I wouldn't call it. Its graphic routines were amazingly simple and was taken care of by the runtime. The floating point on it, while slow, didn't suffer from what C/C++ and likes had with inaccurate representations. With the right compiler settings and good structure, your executable could be "C/C++ fast", or likely came close enough. Also, you can take a C/C++ .lib file and convert it to a .qlb file, which imported the libraries functions right into the project. And ultimately, if you REALLY needed to get nitty gritty, You could have always use CallAbsolute() to call in memory segments of code (granted, you had to compile asm outside of the project, but the fact that a BASIC could do such was good enough for me). The only real thing I longed for was an actual class type definition.

      --

      "When life gives you lemons, don't make lemonade. Make life take the lemons back!" -- Cave Johnson
    130. Re:This isn't nearly as bad as the division bug by JWSmythe · · Score: 1

          Really? Are they bad? I hadn't used them before, but the price was right. At least it's commodity equipment, I can go to the store and pick up a new PS and be back up quick. Well, more like I'd jack the PS out of my desktop to bring the server back up, and then go buy a new PS for my desktop. :)

      --
      Serious? Seriousness is well above my pay grade.
    131. Re:This isn't nearly as bad as the division bug by TheRealMindChild · · Score: 2

      Yes really. From http://en.wikipedia.org/wiki/IEEE_754:

      Finite numbers, which may be either base 2 (binary) or base 10 (decimal). Each finite number is described by three integers: s = a sign (zero or one), c = a significand (or 'coefficient'), q = an exponent. The numerical value of a finite number is (1)^s × c × b^q
      where b is the base (2 or 10). For example, if the sign is 1 (indicating negative), the significand is 12345, the exponent is 3, and the base is 10, then the value of the number is 12.345.


      The coefficient can always represent a whole integer without rounding errors.

      --

      "When life gives you lemons, don't make lemonade. Make life take the lemons back!" -- Cave Johnson
    132. Re:This isn't nearly as bad as the division bug by Rockoon · · Score: 1

      In two cases, the "undocumented" opcodes were really a normal opcode with a "hidden" operand .. "hidden" in the sense that an assembler (such as MASM) normally didnt give you the option to set the operand and filled it in for you (in these cases, the value '10' was forced, as the instructions were typically used for operations on binary coded decimals)

      I cut my first x86 assembler teeth on the 80386, and even at that time these "undocumented" instructions were well known.

      --
      "His name was James Damore."
    133. Re:This isn't nearly as bad as the division bug by enrevanche · · Score: 1

      While this is true, it is only useful in trivial cases. The point is that when using floating point to calculate things, in practice you can rarely assume exact results, only close ones. That it is exact sometimes is irrelevant. For example with a valid value x, (1.0/x)*x will not result in exactly 1.0 in many cases. The situation gets worse with more complex calculations.

    134. Re:This isn't nearly as bad as the division bug by Onymous+Coward · · Score: 1

      Idiocy is when you don't realize 6 GHz is beyond some kind of physical limit for microprocessor clock speeds?

      And yet stating that as if it were common knowledge, not explaining it for the public benefit, and just being abusive are, I assume, somehow supposed to be a smart practice?

    135. Re:This isn't nearly as bad as the division bug by TD-Linux · · Score: 1

      On an embedded system with just a few kbytes of memory, like say an ARM-powered gadget, self-modifying code is still relevant, even in 2012.

      I work on embedded systems and can tell you that this is not the case. Generally on a system that is very limited in RAM (talking in the range 128 bytes to 128KB), the programs are stored and executed on flash ROM and are never copied into RAM, preventing the use of self-modifying code. Even on a system that could, you never write self-modifying code. The tiny performance gains you might see just aren't worth the incredible pain and risk. The exception being JIT compilers, which usually only run on machines with a lot of RAM anyway.

    136. Re:This isn't nearly as bad as the division bug by TD-Linux · · Score: 1

      Can you even get an ARM chip with less than 1MB RAM anymore?

      Quite easily. Cortex-M series microcontrollers often have 8KB of RAM or less. See the STM32 for an example. NXP also has some extremely small Cortex-M0 chips.

    137. Re:This isn't nearly as bad as the division bug by Forever+Wondering · · Score: 2

      The workaround for this bug might be that the compiler has to put a nop at the start of all function epilogs

      More likely it'll be "fixed" in microcode and made available in a firmware update by motherboard manufacturers. It's a fairly small subset of all available CPUs.

      I know that with modern Intel CPU chips, some have a processor microcode patch capability. But, this must be reloaded on every reboot, and [usually] it's done by the OS [from disk] in direct communication with the CPU. The BIOS (and thus the motherboard vendor) is not involved. Under linux, the utility is /sbin/microcode_ctl and the patch file is /etc/microcode.dat

      When I did my first post, I was not aware that AMD had also implemented an update capability but I just checked the kernel source and it has references to AMD having it as well. So, yes, you're right. That is the preferred way to handle the problem.

      But, Intel/AMD processors are not fully microprogrammable the way IBM mainframe architectures are. So, it also depends upon whether the bug can be fixed by an update.

      --
      Like a good neighbor, fsck is there ...
    138. Re:This isn't nearly as bad as the division bug by Galactic+Dominator · · Score: 1

      You sound like a person who should subscribe to Dillon's mailing list.

      Save yourself some money:

      http://leaf.dragonflybsd.org/mailarchive/users/2011-05/msg00063.html

      --
      brandelf -t FreeBSD /brain
    139. Re:This isn't nearly as bad as the division bug by Anonymous Coward · · Score: 0

      Just saying. But there are EMS/XMS libraries and TCP/IP libraries for Qbasic 4.5. http://www.uncreativelabs.net/programming/tcpip.php http://www.phatcode.net/articles.php?id=155&action=print

    140. Re:This isn't nearly as bad as the division bug by idontgno · · Score: 1

      Not all numbers can be accurately converted to the form; +/- (1 [+ 1/2] [+ 1/4] [+ 1/8] .... ) * 2^n

      Not literally true. All you need is the ability for n to be infinite... in other words, arbitrary precision.

      Of course, (A) the same can be said of decimal, since the decimal expansion of a lot (an infinite number) of rational numbers are also non-terminating; and (B) the idea of infinite precision is pretty much not subject to real-world (algorithmic) implementation.

      --
      Welcome to the Panopticon. Used to be a prison, now it's your home.
    141. Re:This isn't nearly as bad as the division bug by Score+Whore · · Score: 1

      I enjoy reading about Google's data center operations and architecture. But I know that, for the most part, it's useless for a typical enterprise's data center. Why? Because that vast majority of Google's computing capability don't actually have a "correct" answer to give. It doesn't actually matter if your search results are delivered from an index that has a missing 2% of indexes pages.

      On the other hand, I bet their payroll system runs on bog standard enterprise class equipment purchased from one of the big vendors (IBM, Oracle, HP, or Dell.) And they are running JD Edwards or something similar.

    142. Re:This isn't nearly as bad as the division bug by Anonymous Coward · · Score: 0

      Way to both go overboard and POS on the PSU. If anything fails...it'll be that and probably take your system with it

    143. Re:This isn't nearly as bad as the division bug by LocalH · · Score: 2

      Actually, I think you're wrong, based on this:

      I was partially right in thinking there is only enough room for one MMORPG at a time back in 97

      Maybe one "modern" MMO, but I'm pretty sure there were several MUDs and MUSHs online at the time.

      --
      FC Closer
    144. Re:This isn't nearly as bad as the division bug by ThurstonMoore · · Score: 1

      Did you really put $25 power supplies in your servers?

    145. Re:This isn't nearly as bad as the division bug by BitZtream · · Score: 0

      especially with 256-byte and 4096-byte competitions, but that is exclusively an academic exercise.

      Naw, at those sizes, its not about the academic exercise anymore, its just pure art at this point.

      I mean the code itself, not the output, but doing anything particularly impressive in those limits makes it art to me.

      --
      Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
    146. Re:This isn't nearly as bad as the division bug by Anonymous Coward · · Score: 0

      heh, some people try to eek out those extra few MHz on some part and wind up introducing additional wait-states. AMD is usually good about offering parts that are a true multiple of commonly available memory speeds.

      Uh, you might want to update your knowledge a bit. That is solidly in the territory of "yesteryear's problem". Neither AMD nor Intel cares about keeping CPU and memory clocks at an integer ratio any more.

      In fact, AMD's CPUs these days run the integrated northbridge at a different frequency from the CPU core, and it's almost never a nice clean multiple. On top of that, the northbridge is usually 2.0 or 2.2 GHz, which isn't particularly friendly to lots of standard memory frequencies. So, modern AMD processors probably have two async domain crossings between the CPU core and DDR3.

      It just doesn't matter, though. From the perspective of a CPU core at 3+ GHz, DDR3 (any speed) is going to take upwards of 150 cycles to return data, approaching 200 as you creep up towards 4 GHz. The tiny latency penalty for crossing clock domains is lost in the noise of 150-200 cycles. It's not like the old days when eliminating it could remove a substantial amount of the total latency.

    147. Re:This isn't nearly as bad as the division bug by _0xd0ad · · Score: 1

      Multiplication and division are actually pretty accurate - it's addition and subtraction that you really have to watch out for.

    148. Re:This isn't nearly as bad as the division bug by Mr+Z · · Score: 1

      Then there's unintended features such as pipeline oddities. If you have self modifying code, and it changes the destination of a jump instruction immediately before executing it, the computer will jump to the old address.

      Long before CPUID, something like this was used to detect whether you were running an 8088 or an 8086. If I recall correctly, the BIU would fetch up to 3 "bus widths" ahead, which on the 8088 is 3 bytes, but on the 8086 is 6 bytes. If you modified some code in that 3 byte window of difference, the 8088 would see it but the 8086 would not. A branch, though, would flush the BIU.

    149. Re:This isn't nearly as bad as the division bug by AlienIntelligence · · Score: 1

      Another thing I like about AMD was that their processors don't melt if they get hot because they have a self preservation shutdown mode. People said Intel had this, but I melted a processor just a few months ago on SWTOR.

      Sorry, without you specifying the Intel processor, I call BS on that.

      If it was one of the modern chips on the 32nm or less die, it didn't happen.

      The Intel thermal protection works quite well... to the point that if the stupid
      holddowns don't do the trick on the stock cooler and your chip is just sitting
      there with no cooler... it won't kill itself.

      -AI

      --
      For me, it is far better to grasp the Universe as it really is than to persist in delusion
    150. Re:This isn't nearly as bad as the division bug by AlienIntelligence · · Score: 1

      But yeah, I'd not expect teenagers to know about it, even in the already restricted demographics of "people who at least know how to use basic QBASIC"). I do recall I was really excited when I figured state machines when I was 12 years old, while reading z80 assembly to write trainers for Konami MSX games. I learned event-driven programming and state machines by partially reverse engineering Konami games in order to find the place where lives/ammo/health was decreased :-)

      No shit, I asked a neighbor "kid" about to graduate HS about
      Pythagoras' Theorem and she was like, what is that?

      Wow...

      just wow...

      -AI

      --
      For me, it is far better to grasp the Universe as it really is than to persist in delusion
    151. Re:This isn't nearly as bad as the division bug by AlienIntelligence · · Score: 1

      I hope Tom's hardware wasn't recommending the 2500K or 2600K for virtualization. Those models are unlocked for over clocking but have the virtualization extensions disabled. I just ran into that when evaluating processors for running vmware.

      Link? Not doubting, but I just got a i7 the other day, love it so much, was thinking of building
      a low cost server with a i5. But now?

      Googled, couldn't find anything that mentioned the VT disabling.

      -AI

      --
      For me, it is far better to grasp the Universe as it really is than to persist in delusion
    152. Re:This isn't nearly as bad as the division bug by AlienIntelligence · · Score: 1

      Another thing I like about AMD was that their processors don't melt if they get hot because they have a self preservation shutdown mode. People said Intel had this, but I melted a processor just a few months ago on SWTOR.

      Video of Intel and AMD processors when heatsink removed (Tom's hardware vid at youtube)

      Filmed with a potato?

      -AI

      lol, gave me fond memories of when we were testing bias currents in digital class to show where thermal meltdown occurred.

      --
      For me, it is far better to grasp the Universe as it really is than to persist in delusion
    153. Re:This isn't nearly as bad as the division bug by eoinatstraylight · · Score: 1

      Just because you have two systems which conform to IEEE 754 doesn't guarantee you'll get the same answer for every floating point operation. You're easily able to produce different answers depending on the width of the registers in the FPU. Take a single-precision device which has a 64-bit internal floating point register vs one with a 32-bit one, both conform to 754, however due to rounding the one with a 32-bit register can provide a subtly different answer. The book Accuracy and Stability in Numerical Algorithms is pretty good on this.

    154. Re:This isn't nearly as bad as the division bug by dubbreak · · Score: 1

      Just check Intel's site and the specs for each chip. I.e. 2500 vs 2500K.

      --
      "If you are going through hell, keep going." - Winston Churchill
    155. Re:This isn't nearly as bad as the division bug by dubbreak · · Score: 1

      OK, had a spare second. Both the k and non-k models have VT-x but only the non-K has VT-d.

      So it depends on whether you need I/O MMU virtualization or not.

      --
      "If you are going through hell, keep going." - Winston Churchill
    156. Re:This isn't nearly as bad as the division bug by dalias · · Score: 1

      In any case where there's a bug that causes modification to the stack in ways that could affect the return address, in the absence of proof that it cannot happen, you must consider this a full code execution vulnerability.

    157. Re:This isn't nearly as bad as the division bug by dalias · · Score: 1

      This is a C issue, not an IEEE 754 one; C allows extra precision for intermediate results. If however you put a cast on every intermediate result back to the correct nominal type, and if your compiler respects these casts as required by the C standard (unfortunately, gcc does not...), then results are completely well-defined for arithmetic and non-transcendental functions.

    158. Re:This isn't nearly as bad as the division bug by dave87656 · · Score: 1

      That's a floating point issue and it's because you can't directly represent every floating point number in binary. Java, for example, provides the BigDecimal package for absolute precision but it is a huge PIA. I assume C++, .NET and so on also offer decimal alternatives to floating point. One unfortunate hole is the inability of some databases to provide variable precision. If you sell things of various units of measure (pieces, weight, length) you need variable precision to deal with that. The only thing you can do with databases like MySQL is to define an arbitrarily large precision and size. PostgreSQL does allow you to specify a decimal value which can vary by row.

    159. Re:This isn't nearly as bad as the division bug by lsatenstein · · Score: 1

      A CPU vendor will have a backdoor capability to update the CPU. Intel does it with all their processors, and normally the bios vendors insert the corrective code into the cpu at boot time. AMD will have a bios update or a windows / linux/ Mac /whatever patch forthcoming. Otherwise they will have to ask compiler writers to insert a no-operation instruction to resolve the problem, which is probably due to timing.

      Failing that kind of update, the alternative is to replace the CPU. CPUs are not normally soldered to motherboards.

      --
      Leslie Satenstein Montreal Quebec Canada
    160. Re:This isn't nearly as bad as the division bug by tzot · · Score: 1

      The thing is: you replied "Um, yes it is", while you meant "Um, sometimes it is".

      --
      I speak England very best
  2. He sounds wicked smart. by Zorque · · Score: 1

    I wonder if AMD likes apples.

    1. Re:He sounds wicked smart. by Anonymous Coward · · Score: 0

      I'm inclined to say AMD doesn't need a hit to its rep right now. I don't like AMD, but Intel needs competition pronto, and I'd prefer the lesser of two evils.

    2. Re:He sounds wicked smart. by Anonymous Coward · · Score: 0

      You're thinking of the wrong Matt.

    3. Re:He sounds wicked smart. by Anonymous Coward · · Score: 1

      Matt has been a personal hero of mine since he wrote the DICE compiler for the Amiga (late 80's, 90's). He produced so much code back then, that I didn't really think it was a single person producing it all. I think he did some RF work with the 68000 back then too. Really talented person!

    4. Re:He sounds wicked smart. by utoddl · · Score: 1

      I was a big fan of the text editor "DME" that he wrote and maintained in those days. Good to know he's still kickin' bits.

    5. Re:He sounds wicked smart. by Zorque · · Score: 1

      Oh, really? Obviously I'd assumed the actor had taken some time off between making millions of dollars on screen to maintain a BSD distro.

  3. Microcode patch by zonker · · Score: 0

    I assume they'll be able to fix it via a microcode patch. Intel had to learn that the hard way...

    1. Re:Microcode patch by Omnifarious · · Score: 3, Informative

      I'm wondering if they will. This seems like a very odd timing issue that may be a problem in the electronics. Of course, I suppose they could just put in some microcode to wait after certain operations to make sure things settle and so avoid the hardware bug.

    2. Re:Microcode patch by Anonymous Coward · · Score: 1

      Interestingly enough after mentioning this to my dad, his reply was 'It sounds like crosstalk in the decoder logic (this may be slightly inaccurate since my memory is lousy). So hopefully it's microcode fixable, but given how long it's taken to track down, I assume it hasn't bitten nearly as many people as it should have. Although I have had a number of quirky crashes running an overclocked sempron that sounds very similiar to what this was doing.

    3. Re:Microcode patch by lightknight · · Score: 1

      Indeed. I instantly thought of the microcode update as well.

      Their other options are to do a processor recall (like Intel + the infamous Pentium bug), or inform the compiler manufacturers that 'there be the dragons' (special case inserted into the code for the affected processor / architecture to bypass the affliction).

      --
      I am John Hurt.
    4. Re:Microcode patch by dargaud · · Score: 1

      So hopefully it's microcode fixable

      Another even easier fix could be to add a (3000 thousandth) compiler option to insert a NOP or two before the RET.

      --
      Non-Linux Penguins ?
    5. Re:Microcode patch by TheRaven64 · · Score: 1

      Okay, you do that in your compiler first, and then we'll add more binary size and i-cache usage benchmarks to our marketing materials...

      --
      I am TheRaven on Soylent News
    6. Re:Microcode patch by ratboy666 · · Score: 1

      It affects the Opteron -- used in, for example, the SUN X series servers (Dell and HP also use it).

      So, these are in a lot of "mission critical" financial and other applications. The preferred fix is a microcode update. A recall would be an alternative, but the costs of that would be enormous. And AMD would be expected to cover it.

      If a suitable microcode update cannot be pushed, I would short AMD. This has the potential to kill the company, especially considering the other problems they are having now.

      I don't believe that a compiler modification would be a suitable fix. That would require building and QA'ing the entire software stack (as it is a "software change"). A very expensive proposition. My recommendation would be to lean on AMD, and get a "hardware change", either as a microcode patch or a new processor. That way, the bulk of the expense gets pushed back to AMD.

      --
      Just another "Cubible(sic) Joe" 2 17 3061
    7. Re:Microcode patch by Mr+Z · · Score: 1

      Given the range of clock speeds and devices (a 1.9GHz Opteron and a 3GHz Phenom II), that seems unlikely to me. If I'm not mistaken, they're even in different technology nodes (65nm vs. 45nm). So, an electrical issue seems unlikely to me. It seems more likely to me that it's a return-stack related logic issue, since it affects a deeply recursive function.

      Since it happens only when the system is under load, I'm guessing you need a well timed cache miss or TLB miss to make it happen. The NOP that "fixes things" may be forcing some serialization which prevents the code above it from interfering with the function epilog.

  4. cool, but...? by bcrowell · · Score: 1

    This is cool, but...?

    Why does it matter that it's the lead developer of DragonflyBSD?

    1. Re:cool, but...? by drobety · · Score: 2

      I suppose to be sure he is not confused with the other Matt Dillon.

    2. Re:cool, but...? by icebike · · Score: 1

      Because if it were Joe Random Programmer AMD would not have even listened to him?

      --
      Sig Battery depleted. Reverting to safe mode.
    3. Re:cool, but...? by Provocateur · · Score: 1

      Because people read the name and they think it's Matt "There's Something About Mary", "Wild Things" oh ya with the two babes at once Dillon.

      Same reason they have to specify millionaire-playboy Bruce Wayne.

      --
      WARNING: Smartphones have side effects--most of them undocumented.
    4. Re:cool, but...? by ffflala · · Score: 3, Insightful

      It matters because it's impressive. It also seems fair to associate some of the positive impression with DragonflyBSD, and I cannot see any downside to throwing good PR at any BSD flavor.

    5. Re:cool, but...? by wrook · · Score: 5, Interesting

      Matt Dillon is a rather famous programmer (as programmers go). I assume that's why they mention him by name. I think a very large percentage of old Amiga hackers know who he is. He's also done work on the Linux kernel. Despite all that, he's best known for his work on FreeBSD and on his DragonflyBSD project. While a lot of old timers will know that, not everyone else will.

    6. Re:cool, but...? by phantomfive · · Score: 2

      Because now we know not only that something cool was done, but also who did it. Both are relevant.

      --
      "First they came for the slanderers and i said nothing."
    7. Re:cool, but...? by maroberts · · Score: 1

      This is cool, but...?

      Why does it matter that it's the lead developer of DragonflyBSD?

      Because they want to ensure that the world famous programmer and developer is not confused with the little known actor with a similar name,

      --

      Donte Alistair Anderson Roberts - hi son!
      Karma: Chameleon

    8. Re:cool, but...? by MichaelSmith · · Score: 1

      Why does it matter that it's the lead developer of DragonflyBSD?

      As a kernel developer he works on code which manipulates CPUs at a low level. Thats why he found the bug.

    9. Re:cool, but...? by jones_supa · · Score: 1

      Why does it matter that it's the lead developer of DragonflyBSD?

      It's nice to mention what the guy is known for. Besides, the bug came up when he was tinkering with DragonFly BSD.

      If we were completely pedantic, it ultimately does not matter. Anyone interested in computers and programming with enough talent could have found the bug. But yeah.

    10. Re:cool, but...? by Anonymous Coward · · Score: 0

      Yes. Matt is a well known code hacker in the Amiga arena, and we respected the hell out of him while he was cranking the hell out of Amiga code while attending UC Berkeley. Everyone shut up and listened when Matt spoke up in the BIX (Byte Information Exchange) forums, and when he went on to work on BSD, we all knew he'd shine in that arena as well. Rock on, Matt.

       

    11. Re:cool, but...? by Anonymous Coward · · Score: 0

      Thanks from all the old-timers :) Good work, Mr. Dillon!

      -- amigamouse

    12. Re:cool, but...? by m.dillon · · Score: 4, Funny

      What's really amusing is that I've been on the scene for so long if you google my name 'Matthew Dillon', the first entry is actually... me! And not the actor(s). I'm sure that grinds a bit but I do bask in the occasional fan mail reaching my inbox, just before I hit the 'delete' key.

      In recent years its started to flip back and forth, and I expect Hollywood will again take over the top spot after things die down again :-)

      -Matt

    13. Re:cool, but...? by Anonymous Coward · · Score: 0

      What I know him for was his awesome contribution to the Amiga scene.... All this BSD stuff is relatively recent :-)

  5. Nice work by gkndivebum · · Score: 1

    Nice work tracking that one down. It must have been very frustrating - what we used to call a "ring-tailed b1tch"

    --
    Breathe continuously
  6. Re:another horrible cpu bug by Taco+Cowboy · · Score: 4, Insightful

    What has Taiwan got to do with this ?

    I mean, was the CPU bug somehow introduced by TSMC ?

    --
    Muchas Gracias, Señor Edward Snowden !
  7. hidden bugs by AbrasiveCat · · Score: 1

    After reading the links and knowing how I have time trying to find out why something doesn't work right I think I understand why he is so stoked at finding the root of the problem. Good for Matt, maybe they will send you a fixed processor someday.

  8. you are mistaken by Anonymous Coward · · Score: 0

    You were not hitting the division bug, it happens only with certain combinations of numbers and is quite rare.
    You probably just had the rounding mode set wrong.

    1. Re:you are mistaken by Sir_Sri · · Score: 4, Informative

      A floating point precision error. Floating points cannot represent quite a diverse collection of numbers, this is especially problematic when you're doing intersections with small objects. Say a ray projected from an object will, because of the minute errors in floating point, collide with the same object (which produces some cool patterns).

      Floating points are kind of crappy. Not that I have a better option with viable performance on a desktop machine. That's not a division bug, that's just the nature of representing numbers in binary with a fixed number of bits.

    2. Re:you are mistaken by Anonymous Coward · · Score: 0

      Floating points are kind of crappy. Not that I have a better option with viable performance on a desktop machine.

      How about fixpoint or even integer?
      512bit calculations aren't that expensive and should deal with a reasonably sized universe in planck-length presicion.

    3. Re:you are mistaken by Rockoon · · Score: 3, Insightful

      512bit calculations aren't that expensive

      Yes they are.

      --
      "His name was James Damore."
    4. Re:you are mistaken by Anonymous Coward · · Score: 0

      A floating point precision error. Floating points cannot represent quite a diverse collection of numbers, this is especially problematic when you're doing intersections with small objects. Say a ray projected from an object will, because of the minute errors in floating point, collide with the same object (which produces some cool patterns).
      Floating points are kind of crappy. Not that I have a better option with viable performance on a desktop machine.

      Back when the Intel bug was an issue, you would have been better off using fixed point math, which is what most of the top performing games and raytracing programs were using.

      These days, most of those calculations are going to be handled on a GPU which should be much more capable of dealing with the levels of precision you require.

    5. Re:you are mistaken by neokushan · · Score: 4, Insightful

      Except I very much doubt that would solve whatever "problems" this guy was having. As a newbie programmer, it's entirely understandable that he wouldn't know about the fun you can (or can't) have with floating point operations. However, I very much doubt that sheer accuracy was the issue, rather he was probably making assumptions such as 1.0 - 1.0 == 0.0, when in reality the result isn't necessarily exactly 0.0. Considering it's an MMO, he probably had something like "Why is this guy not dying, he has 4 HP left and this attack does exactly 4 damage? Must be a bug!".
      Really, it doesn't matter a huge amount, if such "accuracy" is important to your game then instead of doing "if(Health is less than 0.0) /* die */", you do something like "if (Health is less than 0.0 + epsilon) /* die */", with "epsilon" being a very small number (such as 0.00000001).
      The real fun with floats, however, is that each platform does something different. It's possible that the OP ran the game on Intel hardware and got one result (which may have seemed more "correct"), then ran it on an AMD machine and got a different (seemingly less-correct) result - you can see why he naturally jumped to the conclusion that the AMD system had a bug.
      In reality, chances are both systems were "wrong" anyway, they just happen to use different implementations for floating-point logic. To solve this, once again higher rates of calculations aren't the answer, but rather there's a compiler switch (/fp:strict in VS) that will use the ISO standard floating point model. It's not as fast as the other methods, but you will at least game the same results across different platforms (assuming that CPU has implemented the standard correctly which these days is almost certain).

      There's LOTS of fantastic info on this here: http://gafferongames.com/networking-for-game-programmers/floating-point-determinism/

      --
      +1 IDisagreeSoHeMustBeATrollOrAnAstroturferOrAShill
    6. Re:you are mistaken by cnettel · · Score: 1

      The same binary on the same platform (x86 in this case) will render the same results. That's part of the contract of the x86 ISA. However, unless you're using /fp:strict or equivalent in your compiler, just recompiling with ANY kind of changes can alter behavior.

    7. Re:you are mistaken by Mr+Z · · Score: 4, Informative

      I'm pretty sure it was with the introduction of the Pentium (which had the famous FDIV bug) that John Carmack officially made the switch to single precision FP for most things because it was finally fast enough. FP wasn't cheap, per se, but the simplification it brings over keeping track of binary points and precision/range tradeoffs in integerized algorithms should not be underestimated either.

      For example, if I want to do a floating point multiply and add, I just say: f3 = f0 * f1 + f2. Before I even start writing a fixed-point multiply and add, I need to ask what the Q points (binary points) are for each of the terms, what Q point you'd like for the result, and what sort of rounding (if any) the result requires for stability. You can end up with a monstrosity like this, assuming all four numbers are at the same Q point:

      x3 = (int)(((long long)x0 * x1 + (1LL > Q) + x2;

      Ok, maybe you hide that behind a macro, but what about cases where some of the terms are at different Q points? A fully general macro (which is no fun to write, BTW) would also have a ton of arguments, and only reduce you to something like x3 = FXMULADD(x0, Q0, x1, Q1, x2, Q2, Q3); which won't win you any awards in the clarity department.

      And look at the operations themselves, too. You have type promotion, extra adds and shifts... the instruction sequence itself isn't super efficient. It pays off when floating point takes 10s and 100s of cycles, but is a dubious win when most of the core FP starts coming down into the single digits. With the Pentium's dual pipes and the fact you could keep integer instructions flowing in parallel to the float, that's effectively what happened. And notice we haven't even talked about dynamic range and overflow errors and how they screw you up. If you have to add tests for that... yuck. With floating point, you degrade gracefully if your dynamic range spikes a little higher than you expect.

      Anyway, getting back on topic: This isn't the first time an x86 has had a stack-pointer related bug. I remember the 80386s that had the so-called "POPAD bug". That one was a bit easier to hit.

      Hopefully, AMD will be able to publish a microcode update or something to work around theirs. That's one thing modern x86s have over their predecessors: A good number of CPU bugs can be patched around with microcode updates. I believe Intel added that with the Pentium Pro, and AMD followed suit. I believe my Phenom is one of the affected parts. I guess I'll have to keep an eye out for such a patch.

    8. Re:you are mistaken by Mr+Z · · Score: 1

      Ugh... this expression was correct when I typed it. Part of it got eaten due to missing HTML escapes. Grrr...

      x3 = (int)(((long long)x0 * x1 + (1LL << (Q-1)) >> Q) + x2;

    9. Re:you are mistaken by Sir_Sri · · Score: 1

      Um.... gpu's work on floats. I specifically know about the problem because I discovered it using GPU's for something. I only noticed it because I had the same problem ray tracing.

      Floats are floats. the IEEE specifies how they behave. GPU's happen to be able to crunch a lot of them compared to a CPU, even when it's actually implementing the floating point specification correctly (early GPU's didn't quite). It's a matter of the capacity to represent decimal numbers in binary. When you only have X bits for mantissa, Y bits for an exponent there's only so much you can do. It's not a matter of a particular brand of hardware having precision, they're all the same, or supposed to be.

      And yes, fixed point was the way to go, but for today, floats are probably your best bet performance wise, you just need to be aware that they will have errors because you just can't represent some numbers differently in floating point.

    10. Re:you are mistaken by Frnknstn · · Score: 1

      then ran it on an AMD machine and got a different (seemingly less-correct) result

      You are mistaken. The floating point bug was in Intel processors, not AMD.

      http://en.wikipedia.org/wiki/Pentium_FDIV_bug

      --
      If it's in you sig, it's in your post.
    11. Re:you are mistaken by neokushan · · Score: 2

      No, you are mistaken because you didn't read my post (or any of the posts above it). We're talking about floating point rounding irregularities that are present in ALL modern processors, not the floating point bug you're referring to.

      In any case, there is a different floating-point bug that affected some AMD CPU's as well - http://www.reghardware.com/2006/04/28/amd_opteron_fpu_bug/

      --
      +1 IDisagreeSoHeMustBeATrollOrAnAstroturferOrAShill
    12. Re:you are mistaken by Frnknstn · · Score: 2

      I stand by my original post.

      I did not take issue with the floating point irregularities. In fact, I also believe that the issues he experienced were not due to the FDIV problem he believed to be the cause. I probably would have used the fact that the last release of QuickBasic was in about 1989, before the widespread inclusion of FPUs in PCs, and that QuickBasic would almost certainly use software emulation for floating point arithmetic. It therefore would not have triggered a bug with the FDIV instruction.

      What I did take issue with was your notion that he would have run on the AMD chip and seen a less accurate result. As I said, the bug he was talking about was the FDIV bug.

      The idea that the QuickBasic would trigger an overheating-related bug on a 2006 Opteron is even more laughable than the OP's original troll post. :p

      --
      If it's in you sig, it's in your post.
    13. Re:you are mistaken by Mitchell314 · · Score: 1

      I would argue that if 32 bit (or maybe 64 bit) isn't enough, then you might want to be partitioning up your space anyways. Actually, if you're implementing more than rudimentary physics then it's a good idea to do anyways for optimization's sake.

      --
      I read TFA and all I got was this lousy cookie
    14. Re:you are mistaken by neokushan · · Score: 1

      I never said that he would have seen a less accurate result, I said he would (or may have) have seen a seemingly less accurate result simply due to the fact that different hardware will give slightly different results. In the above example of 4.0 (health) - 4.0 (damage), it's entirely possible that one system would truly calculate 0.0, while another would calculate 0.00000000000000001. The point is, when dealing with floating point numbers if you get 2 different results and are unaware of the complexities of floating point operations, it's a logical conclusion to believe that one may be right and one may be wrong (even though both are technically as right/wrong as each other). I wasn't saying that the AMD system gave an incorrect result, just that it may have appeared to from his perspective. A quick google of "AMD CPU floating point incorrect" would have certainly brought up an article or two detailing a known bug, even though it didn't actually affect him.

      However, you raise a point about QuickBasic. If it truly does use software floating point calculations, then the above point is moot. Since it's entirely speculative anyway, it doesn't really matter. All I was doing was demonstrating as to how an inexperienced programmer can come to a certain conclusion through no real fault of their own.

      --
      +1 IDisagreeSoHeMustBeATrollOrAnAstroturferOrAShill
    15. Re:you are mistaken by AlienIntelligence · · Score: 1

      Anyway, getting back on topic: This isn't the first time an x86 has had a stack-pointer related bug. I remember the 80386s that had the so-called "POPAD bug". That one was a bit easier to hit.

      Hopefully, AMD will be able to publish a microcode update or something to work around theirs. That's one thing modern x86s have over their predecessors: A good number of CPU bugs can be patched around with microcode updates. I believe Intel added that with the Pentium Pro, and AMD followed suit. I believe my Phenom is one of the affected parts. I guess I'll have to keep an eye out for such a patch.

      I was thinking... how the heck does he remember that flaw.
      Then I saw your UID. Good job, carry on sir.

      -AI

      --
      For me, it is far better to grasp the Universe as it really is than to persist in delusion
    16. Re:you are mistaken by dave87656 · · Score: 1

      We're talking about floating point rounding irregularities that are present in ALL modern processors

      No, only those that work in binary ;-)

  9. Kudos by Mannfred · · Score: 5, Insightful

    I can only imagine the time and effort spent on tracking down this problem - a rare CPU condition is exponentially more difficult to narrow down than most programming mistakes. A lot of progress in IT depends on engineers like this, who obsessively solve problems even when it's much easier to just ignore them, try to hack around them or pass the buck around. Kudos.

    1. Re:Kudos by justforgetme · · Score: 1

      Yep, indeed. Kudos to Matt and his insight.

      --
      -- no sig today
    2. Re:Kudos by paradigm82 · · Score: 2

      I agree but remember that must engineers are working on company time. For most companies it wouldn't be rational to have an engineer working months to isolate/reproduce this CPU bug. After all, this work will particularly benefit this company over all the other companies and at any rate it would be much cheaper to just do the workaround (which might be necessary anyway). However, a good engineer probably couldn't resist looking into this in his free time (and maybe in company time with nobody looking!) at least to prove that he was right. Those engineers are usable so much more valuable than the average engineer, that even if they sometime spent their time on things that are not rational for the company to spend their time on, it is still worth it to have them on the payroll :)

    3. Re:Kudos by fa2k · · Score: 1

      a rare CPU condition is exponentially more difficult to narrow down than most programming mistakes.

      You could say that the difficulty is proportional to the amount of possibility you have to check (~complexity). If the bug is at a lower level of abstraction, you have to check all the possible errors in that underlying platform as well as your own code.

      Thus you can indeed say something like "difficulty ~ exp(highest level of abstraction - level of abstraction where the bug appears)"

  10. Affected CPUs by Anonymous Coward · · Score: 5, Informative

    A pertinent addition to the submission would be which CPUs have been found to be affected.
    The second link says Opteron 6168 and Phenom II X4 820. For a second I thought that bulldozer hasn't managed to do anything right, but these two examples are pre bulldozer.
    No doubt this is not an exhaustive list.

    1. Re:Affected CPUs by unixisc · · Score: 1

      That's exactly what I was wondering. The e-mail exchange didn't seem specific about which CPU was impacted. Please don't tell me it's every Opteron or Phenom in AMD's lines.

    2. Re:Affected CPUs by oldhack · · Score: 1

      Mod the parent up. Any link to full AMD announcement with the list of CPU models affected and the status on any workaround in the works?

      --
      Fuck systemd. Fuck Redhat. Fuck Soylent, too. Wait, scratch the last one.
    3. Re:Affected CPUs by fa2k · · Score: 1

      I don't even know where to check for more info... I definitely want to get my CPU replaced if it has this bug. I reckon AMD will not be eager to publicize this (if they linked to it from their support page, they would already be better than Intel)

    4. Re:Affected CPUs by dargaud · · Score: 3, Interesting
      I just went and checked on their microcode page, but the last download is fairly old. Anyway, the explaination on how to update on Linux is not clear:

      Support for updating microcode for the AMD processors listed above will be available starting with kernel version 2.6.29. Microcode update for AMD processors uses the firmware loading infrastructure.

      Does that mean that the kernel uploads the new microcode on boot ? How does it get it ?

      --
      Non-Linux Penguins ?
    5. Re:Affected CPUs by scheme · · Score: 4, Informative

      Does that mean that the kernel uploads the new microcode on boot ? How does it get it ?

      The microcode module loads the microcode for the cpu from /lib/firmware/amd if it's newer than the one on the cpu. You can download and place new microcode updates from amd in this directory if needed or just let your distro provider update the microcode files when they push new packages out.

      --
      "When you sit with a nice girl for two hours, it seems like two minutes. When you sit on a hot stove for two minutes, it
    6. Re:Affected CPUs by dargaud · · Score: 1

      OK, thanks. I'm on one of the affected processors, but there's no /lib/firmware/amd directory. Have to wait I guess.

      --
      Non-Linux Penguins ?
    7. Re:Affected CPUs by Lisandro · · Score: 1

      It certainly seems to be that way :( The fact that Matts' tests on both Opterons and Phenoms broke in the exact same way is a bad indicator.

    8. Re:Affected CPUs by fnj · · Score: 1

      I have an Intel CPU and run RHEL6[*], but I suspect it's handled similarly. You can see it loading firmware at boot time. If I run dmesg I see a boot-time record containing, among a bunch of other lines:


          platform microcode: firmware: requesting intel-ucode/06-2a-07
          microcode: CPU0 sig=0x206a7, pf=0x2, revision=0x18 ...

      once for each core.

      I'm not sure where it gets the data file, but if you go to this download page at Intel, choose "Processors" in column 1, "Desktop" in column 2, and "Intel Core i3 Desktop Processor" in column 3, it takes you to a new page where you can enter "Linux" in column 1; column 2 will automatically be set to "Firmware" Download Type, and the first line of the results will be "Linux Processor Microcode Data File, 12/12/2011". If you go there, you can press "Download" and end up with tarball named "microcode-20111110.tgz", which extracts to a single big text file "microcode.dat". Actually, regardless of what you entered along the way, it appears the file covers every Intel x86 processor (server, desktop, and mobile).

      The file contains a big bunch of hex numbers and some unilluminating comments and tags.

      I assume the distro packager gets updates periodically from the same underlying source.

      ~~~~~
      [*] Actually a free repackaging, PUIAS.

    9. Re:Affected CPUs by blade8086 · · Score: 1

      Latest info from the thread on the DragonFly kernel dev list -

      http://permalink.gmane.org/gmane.os.dragonfly-bsd.kernel/14523

      has some more specifics - with more details about AMD's plans for followup.

    10. Re:Affected CPUs by fa2k · · Score: 1

      Thanks for the link. I hope this can be fixed by microcode.

  11. And Linux? by aglider · · Score: 1

    Windows? Does this mean that those users and devs aren't so important as far as total CPU load?

    --
    Sent as ripples into the electromagnetic field. No single photon has been harmed in the process.
  12. Test case by geminidomino · · Score: 1

    Anyone know if that "Test case" image is available? I'd like to check if my Phenom II x6 is affected.

  13. Re:another horrible cpu bug by justforgetme · · Score: 1

    Ohh, I'm sure AMD will want you to believe that :-)

    --
    -- no sig today
  14. Confirmed CPUs by Jah-Wren+Ryel · · Score: 4, Informative

    FWIW:

    The failure has been observed on three different machines, all running AMD cpus. A quad opteron 6168 (48 core) box, and two Phenom II x4 820 boxes.

    --
    When information is power, privacy is freedom.
    1. Re:Confirmed CPUs by DigiShaman · · Score: 1

      Interesting. That sounds like a logical bug that can quickly be patched with a microcode update (BIOS or OS level).

      --
      Life is not for the lazy.
  15. security exploit? by Anonymous Coward · · Score: 3, Interesting

    I have to worry about stack smashing bugs here... can there be a way for (say) a data pattern in a media file, or carefully crafted javascript or java code that's been JIT-compiled, to break out of its sandbox? What about a hostile OS kernel running inside a VPS container taking over the hypervisor or bare iron? Hmm.

  16. This is why calculators use decimal arithmetic by perpenso · · Score: 1

    I remember division not being exactly accurate, so the solution I needed to use was to round up and down results that are really close.

    Just so you know, division is never accurate in floats, even when the CPU doesn't have bugs. If you're using doubles you'll get better accuracy, but with a 32 bit floating point number, you shouldn't be surprised to find errors in the third digit after the decimal point.

    Just to be clear its not limited to division. Hell, errors can creep in just by converting a decimal number to floating point. This is why calculators use decimal arithmetic, well some of them - like Perpenso Calc for iPhone iPad RPN Scientific Stats Business Hex. Try "0.5 - 0.4 - 0.1" in your favorite calculator app, it might indicate whether the app is using the FPU or decimal arithmetic. Of course the app may be doing something naive like the "BASIC MMORG", rounding results. Its naive because it is another source of rounding error, some results are legitimately a little bit off from a nicely round number.

    1. Re:This is why calculators use decimal arithmetic by sgunhouse · · Score: 2

      Division is division, regardless of the base used. The issue is that in base 10 (aka decimal numbers), division by 2 and 5 always comes out to a finite decimal; in binary numbers only division by 2 comes out to a finite decimal. Dividing by any primes other than 2 and 5 (and numbers involving those primes) will require rounding in both bases (and they may not necessarily round the same way). That is, unless you're only dividing by combinations of 2 and 5, there really is no preferred base.

      The main problem with a QBASIC "single" (a 32-bit float) is the extremely limited precision of that type, and not so much how rounding is done. Most calculators these days can handle 8-12 digits, you have to use 64-bit floats (a QBASIC "double") to get anything like that from your program.

    2. Re:This is why calculators use decimal arithmetic by Joce640k · · Score: 2

      Multiplying and dividing are the least of your worries in floating point. Adding and subtracting are where the real problems happen.

      eg.

      float a = 0.1;
      float b = 0.2;
      if (a == b) {
          print("Before the add, a is equal to b");
      }
      float c = 10000000;
      a += c;
      b += c;
      if (a == b) {
          print("After the add, a is equal to b");
      }

      What's the output?
      What happens if you multiply by c instead of adding it?

      --
      No sig today...
    3. Re:This is why calculators use decimal arithmetic by rrohbeck · · Score: 1

      Well I learned in my first semester that thou shalt not compare reals for equality, at least not if they are the result of an arithmetic operation.
      There are languages that don't have an == operator for reals. Good idea if you ask me.

  17. Matt Dillon of Dragon Fly by hcs_$reboot · · Score: 4, Funny

    Matt Dillon, desperate after chasing unsuccessfully mary in Something about Mary radically changed jobs and started to study computer science...

    --
    Slashdot, fix the reply notifications... You won't get away with it...
    1. Re:Matt Dillon of Dragon Fly by Forever+Wondering · · Score: 1

      Matt Dillon, desperate after chasing unsuccessfully mary in Something about Mary radically changed jobs and started to study computer science...

      Matt Dillon and Cameron Diaz had been dating for three years prior to "There's Something About Mary", but split up shortly thereafter. So, maybe, not totally unsuccessful [from a certain point of view--Obiwan] ...

      --
      Like a good neighbor, fsck is there ...
    2. Re:Matt Dillon of Dragon Fly by midom · · Score: 1

      I had it opposite, once I watched Matt Dillon movie, and "oh wow, now he is acting!" :-)

  18. x86_64 ABI by DrYak · · Score: 4, Interesting

    Pop two off the stack and ret to the calling routine seems fairly common to me. Lots of functions use two arguments and are called with near calls in various programming languages.

    That might have been true on 386s.

    But currently we're in 2012 and the most widely used instruction set for Linux on AMD processors is x86_64. Because these 64bit processors feature a big number of registers, the two arguments will be passed as registers, not on the stack. So the sequence of instructions isn't indeed common.

    --
    "Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
    1. Re:x86_64 ABI by TheRaven64 · · Score: 2

      Not true. pop, pop, ret is a sequence that you are likely to see in any function that makes use of two or more callee-save registers - it will push them before using them and then pop them at the end. If you're lucky, the register allocator will have done some peephole optimisation and moved the pops earlier...

      --
      I am TheRaven on Soylent News
    2. Re:x86_64 ABI by Anonymous Coward · · Score: 0

      If it were that common, AMD would probably have caught the bug before mass production.

    3. Re:x86_64 ABI by TheRaven64 · · Score: 2

      Nonense. It doesn't happen every time, or it would have been trivial to find. It happens under load, so it is most likely dependent on some other factor (e.g. CPU temperature, cache miss ratio, context switch timing) which may not have appeared in testing.

      --
      I am TheRaven on Soylent News
    4. Re:x86_64 ABI by Anonymous Coward · · Score: 0

      Parent is wrong about 32-bit cdecl or stdcall being uncommon, but he's accidentally right because those functions typically end with "LEAVE; RET."

      The LEAVE instruction is effectively "MOV ESP, EBP; POP EBP", so you don't end up with back-to-back pops before a return instruction.

    5. Re:x86_64 ABI by BitZtream · · Score: 0

      64 bit x86 processors have a lot of registers? Since when? Compared to what?

      x86 has always been register hindered, between the lack of registers and lack of orthagonality (not sure if thats actually a word, but you should get the point) makes working with the x86 processor at an assembly level so painful that I personally would rather start from scratch with pretty much any other processor on the planet than deal with x86 assembly.

      Contrary to popular ignorance, x86_64 just added more registers with the same retarded issues as before, but it STILL ISN'T anywhere near even some simple 8bit CPUs I work with.

      Second, there are multiple calling conventions, there may be one that GCC uses as a default type, but there are 4 different call types used, and 1 of them would result in neither arguments being stored in registers. Internally for non-exported functions a good compiler (I doubt GCC does it) will use the fastest convention internally that suits it AND the code around it.

      As far as how common that sequence is, well, I think you might want to go look at your favorite OSes assembly code and come back with corrected details.

      --
      Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
  19. Linux - not so much concerned by DrYak · · Score: 1

    Linux in 64bit mode use register to pass arguments to functions.
    So the most common sequence at the end of a function isn't "bunch of pops then a (near) return", but "move the results into target registers and the return".
    Thus the bug sequence doesn't happen that often.

    --
    "Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
    1. Re:Linux - not so much concerned by dragonk · · Score: 2

      This would be insightful and all -- except that it isn't -- because DragonFly BSD uses the same x86-64 calling conventions as Linux.

    2. Re:Linux - not so much concerned by Rockoon · · Score: 2

      Indeed, even WIN64 uses the same essential calling convention because the hardware itself is designed with it in mind.. specifically the 64-bit structured exception handling requires 16-byte aligned read and write operations by design.

      --
      "His name was James Damore."
  20. War story from the network trenches by Anonymous Coward · · Score: 0

    In a way this bug reminds me of a problem I found on a certain software-hardware combination involving a ten gigabit Ethernet card. We were seeing mysteriously corrupted SSL connections when there was no reason them to be corrupted. This condition occurred roughly every couple dozen gigabytes of transferred data on a connection that both MSS and very short data segments. After couple weeks of debugging, I was able to conclude that the TCP receive offload engine on the card occassionally injected Ethernet padding as data into the TCP stream that SSL library eventually processed as its' input! Neither Ethernet nor TCP checksums saved from this extra data, since it was conceptually generated on a higher layer; SSL caught it, though. Only God knows how many connections this semi-hardware (but at least software-fixable) bug had corrupted in a less obvious manner elsewhere before it was found out...

    1. Re:War story from the network trenches by m.dillon · · Score: 2

      Once in the late 1990's we had a weird bug where FTPing or RCPing a particular file between two offices would often result in a corrupt file on the other end. We kept scratching our heads trying to figure out what could possibly be corrupting the file. FTPing it anywhere else succeeded... no corruption. Everything else between the offices seemed to work ok.

      It wound up being a hardware issue with the T3 between the two offices. The hardware would corrupt the bitstream in a manner that tended to PASS the TCP/IP checksum, resulting in corrupted data. It required a particular pattern of 1's and 0's for the bitstream to be corrupted in a manner that passed the checksum, which this particular file happened to have.

      These days, of course, I use scp to transfer files whenever possible. SSH will detect that sort of corruption and fail with a protocol error. Encryption has certain uses beyond just encrypting the data, it seems!

      -Matt

    2. Re:War story from the network trenches by Anonymous Coward · · Score: 0

      This is a classical example of how individual flaky or failing gates or memory cells combine poorly with naive checksum algorithms (potentially with dedicated hardware-based implementations), and are able to create conditions where potentially critical memory corruption goes unnoticed.

      This is one of the reasons why iSCSI uses CRC-32C. It's not as robust as cryptographic hashing, but it's vastly more robust against minor hardware flakiness than many earlier checksum algorithms.

    3. Re:War story from the network trenches by Mr+Z · · Score: 1

      TCP's checksum is notoriously weak though. For example, if you send a file filled with 0xFF, 0x00, 0xFF, 0x00.., but that triggered a 1 bit shift framing error so it came out, say, 0xFE, 0x01, 0xFE, 0x01, the TCP checksum would be the same for both. Heck, you can even swap bytes and TCP won't notice, since the TCP checksum is just a 1s complement 16-bit sum.

      I'm going to guess that just about any fancier checksum probably would have caught the problem. Even a plain 16-bit CRC.

  21. So is this the fanboy way to deflect from it? by Sycraft-fu · · Score: 4, Interesting

    You try and find something that "the other guy" had a problem with and bring it up as worse so as to try and "protect" the thing you are a fan about? Because I see nothing about the FDIV bug anywhere but your post.

    Oh and you know what that bug applied to, right? The Intel Pentium, the ORIGINAL Pentium. Not the Pentium MMX, not the Pentium Pro, not the Pentium II, not the Pentium III, not the Pentium 4, not the Core, not the Core 2, not the Core i, not the second generation Core i. And yes, that's how many major processor versions from Intel there have been since then (with another to launch in the next couple weeks). The original Pentium chips that had this problem came out almost 2 decades ago, 1993.

    So seriously, leave off it. I get tired of any time there is a problem with $Product_X fans of it will point out how $Product_Y had a similar or worse error way back in the day and that somehow changes things.

    No it doesn't. The story is about the AMD chips, nobody gives a shit about the FDIV bug and I'll wager there are people reading Slashdot who weren't alive when it happened.

    The good news for AMD is that processors can often patch around this shit in microcode these days so a recall may not be needed. Have to see, but the potential is there for a software (so to speak) fix.

    1. Re:So is this the fanboy way to deflect from it? by Daniel+Phillips · · Score: 1

      I don't disagree with your rant, however it is just not a good idea to dismiss a processor bug as "happened a long time ago". The point is, it happened. Processor bugs happen. And here is one that happened last year if you must.

      --
      Have you got your LWN subscription yet?
    2. Re:So is this the fanboy way to deflect from it? by gnasher719 · · Score: 1

      So seriously, leave off it. I get tired of any time there is a problem with $Product_X fans of it will point out how $Product_Y had a similar or worse error way back in the day and that somehow changes things.

      Unless the $Product_X fans also point out that the maker of $Product_Y paid an awful lot of money to replace the broken CPUs.

    3. Re:So is this the fanboy way to deflect from it? by c · · Score: 2

      > I get tired of any time there is a problem with $Product_X fans of it will point
      > out how $Product_Y had a similar or worse error way back in the day and that
      > somehow changes things.

      The FDIV bug was really in a class of its own as CPU bugs go; it was trivially user accessible. You could test for the presence of the bug using a *spreadsheet*. This differs from pretty much every other CPU bug where you pretty much have to be cranking out some odd code before you see anything.

      Being as accessible as it is, it's always going to be the first thing someone thinks of when you say "CPU bug". Even though, when you get right down to it, it really doesn't represent a typical CPU bug.

      --
      Log in or piss off.
    4. Re:So is this the fanboy way to deflect from it? by msauve · · Score: 1
      Aw, hell. You don't need to go back 20 years to find bugs in Intel processors, just look at the errata for any current one. Same for almost ANY significantly complex processor.

      For example, a simple Google for "intel i7 errata" produces a link to this document, which has a whole errata section. First one:

      A single Data Translation Look Aside Buffer (DTLB) error can incorrectly set the Overflow (bit [62]) in the MCi_Status register. A DTLB error is indicated by MCA error code (bits [15:0]) appearing as binary value, 000x 0000 0001 0100, in the MCi_Status register.
      Implication: Due to this erratum, the Overflow bit in the MCi_Status register may not be an accurate indication of multiple occurrences of DTLB errors. There is no other impact to normal processor functionality.
      Workaround: None identified.
      Status: For the steppings affected, see the Summary Table of Changes.

      There are others that are more severe in their effects ("processor hangs"). AMD has similar errata, that's not news. Processors have bugs. The FDIV one just happens to be the most well known because it came along at the right time (as personal computers were becoming widely used) to make the general public aware that computers aren't perfect. That's what makes it a candidate for comparison.

      --
      "National Security is the chief cause of national insecurity." - Celine's First Law
    5. Re:So is this the fanboy way to deflect from it? by Rockoon · · Score: 1

      The FDIV bug is the most well known because it effected pretty much everything that did lots of FPU work. Everything from games to business applications.

      --
      "His name was James Damore."
  22. Wouldn't worry about it by Sycraft-fu · · Score: 2

    Presumably AMD will announce affected CPUs fairly soon, after they get done testing. This isn't the kind of thing they would be able to sit on, even if they wanted to. If your CPU has been working for you in general it isn't like it is going to suddenly go and beat up your cat or something, it'll be fine for a bit longer while AMD figures out which ones are all affected and figures out how to fix it.

    As I noted in another post, depending on it may be possible to fix it via microcode. CPUs aren't "pure" hardware these days. They have a bit of software that tells them how to do things and on some of them (Intel CPUs I know for sure) it is field upgradable. So they may find a way to patch out the bug.

    Just keep an eye on their page, maybe send them an e-mail saying you'd like a notice when they know. Should be soon I'd imagine.

  23. A guy like matt by maestroX · · Score: 1

    I'm pretty stoked... it isn't every day that a guy like me gets to find an honest-to-god hardware bug in a major cpu!

    for a guy like me, this is pretty much an honest-to-god bughunt ;)

  24. So BSD is dying because... by vs · · Score: 1

    ...the CPUs suck :-)

    Actually how (Free)BSD made use of available hardware resources back then in the 90s was the big reason for me to use it instead of Linux.

  25. Re:another horrible cpu bug by fuzzyfuzzyfungus · · Score: 1

    I'm pretty sure that TSMC doesn't even fab AMD CPUs. They did GPUs for ATI, and likely still handle some portion of the discrete GPU parts under AMD; but I believe that AMD CPU production is all still in their formerly-in-house-now-spun-off fabs...

  26. 4 isn't a normalised floating point by Anonymous Coward · · Score: 0

    Since adding up terms of half the power of the previous number goes up only to 2, 4 cannot be represented.

    You have to increase the exponent.

  27. Wow... by raehl · · Score: 0

    Can we say "WOOOOSH!"

  28. No list of affected CPU models? by Lisandro · · Score: 1

    I'm very interested on this, since the original posting by Matt Dillon hints at the bug being present in all Opteron and Phenom models. The bug seems hard enough to replicate, but still, corrupting the stack is no minor detail.

    1. Re:No list of affected CPU models? by blade8086 · · Score: 1

      More details available here:

      http://permalink.gmane.org/gmane.os.dragonfly-bsd.kernel/14523

      Also - this issue was underway in ~late december / early january and if you check
      the dragonfly kernel list from that time, it specifically notes this does *not* affect all
      opterons/amd64's what-have-you .

    2. Re:No list of affected CPU models? by Lisandro · · Score: 1

      So its all AMDs 10h and 12h. Thank you!

  29. Then why only DragonFly BSD? by DrYak · · Score: 1

    Then why has it only been seen on DragonFly BSD?
    My opinion is that the sequence that leads to the bug is rare. DragonFly BSD happen to have it somewhere (due to sheer luck, because otherwise the sequence is uncommon) and under some stress condition, the bug got triggered.
    Under the same stress condition Linux didn't barf, simply because the sequence is so rare that the Linux kernel didn't have it.

    Can someone do some screening of the binary image of a few kernels and see how common is the pop-pop-nearret sequence ?
    (Lazy and don't have the material to do it here now :-P )

    --
    "Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
    1. Re:Then why only DragonFly BSD? by TheRaven64 · · Score: 1

      Then why has it only been seen on DragonFly BSD?

      Who says it is? It's only been reproduced consistently on DragonFly BSD. There's no reason to believe that it hasn't caused stack corruption on other systems...

      --
      I am TheRaven on Soylent News
    2. Re:Then why only DragonFly BSD? by blade8086 · · Score: 1

      Correction - DragonFlyBSD did have it.

      After this issue was found, the offending code was coded around to prevent the crash
      issue, and the bug discussed with AMD, leading to this news article

  30. Bulldozer not effected. by m.dillon · · Score: 5, Informative

    AMD has indicated to me that the Bulldozer is not effected, which is a relief.

    I guess I should have realized this would get slashdotted. In anycase, it took quite a bit of effort to track the bug down. It was very difficult to reproduce reliably. It isn't a show stopper in that it really takes a lot of work to get it to happen and most people will never see it, but it's certainly a significant bug owing to the fact that it can be reproduced with normal instruction sequences.

    I began to suspect it might be a cpu bug last year and after exhaustive testing I posted my suspicions in December:

    http://leaf.dragonflybsd.org/mailarchive/kernel/2011-12/msg00025.html

    Older versions of GCC were more prone to generate the sequence of POP's + RET, coupled with a deep recursion and other stack state, that could result in the bug. It just so happened that DragonFly's buildworld hit the right combination inside gcc, and even then the bug only occurred sometimes and only one a small subset of .c files being compiled (like maybe 2-3 files). The bug never manifested anywhere else, doing anything else, running any other application. Ever.

    In particular the bug disappeared with later versions of GCC and disppeared when I messed with the optimizations. We use -O by default, not -O2. The bug disappeared when I produced code with gcc -O2 (using 4.4.7).

    It is really unlikely that Linux is effected... the sensitivity to particular code sequences laid out in the compiler is so fine that adding a single instruction virtually anywhere could make the bug disappear. Even just shifting the stack pointer a little bit would make it disappear.

    In anycase, for a programmer like me being able to find an honest-to-god cpu bug in a modern cpu is very cool :-)

    -Matt

    1. Re:Bulldozer not effected. by m.dillon · · Score: 5, Interesting

      Since the cat is out of the bag some further clarification is required so I will include some more of the email I received. I didn't quite mean for it to explode onto the scene this quickly, but oh well.

      Again, note that this is *NOT* an issue with Bulldozer. And they will have a MSR workaround for earlier models.

        >> quote
      "AMD has taken your example and also analyzed the segmentation fault and the fill_sons_in_loop code. We confirm that you have found an erratum with some AMD processor families. The specific compiled version of the fill_sons_in_loop code, through a very specific sequence of consecutive back-to-back pops and (near) return instructions, can create a condition where the processor incorrectly updates the stack pointer.

      AMD will be updating the Revision Guide for AMD Family 10h Processors and the Revision Guide for AMD Family 12h Processors, to document this erratum. In this documentation update, which will be available on amd.com later this month, the erratum number for this issue will be #721. The revision guide will also note a workaround that can be programmed in a model-specific register (MSR)."
          end quote

      They go on to document a specific workaround when the MSR is not programmed, which is basically to add a nop for every five pop+return instructions (though I'm not sure if the nop must occur between sequences or within the sequence). I will note that just the presence of 5xPOP + RET does not trigger the bug alone, it requires a very specific set of circumstances setup prior to that (that gcc's fill_sons_in_loop() procedure was able to trigger when gcc 4.7.x was compiled -O, when compiling particular .c files).

      As I said, this bug was very difficult to reproduce. It took a year to isolate it and find a test case that would reproduce it in a few seconds. Until then it was taking me upwards of 2 days to reproduce it on a 48-core and much longer to reproduce it on a 4-core.

      Since the bug was stack pointer address is sensitive the initial stack randomization that DragonFly does multiplied the time it took to reproduce the bug. But without the stack randomization the bug would NOT reproduce at all (I would never have observed it in the first place). In otherwords, the bug was *very* stack address sensitive on top of everything else.

      I was ultimately able to improve the time it took to reproduce the bug by pouring over all my previous buildworld runs and finding the .c files that gcc had compiled that were most statistically likely for gcc to seg-fault in. Then once I isolated the files I iterated all possible starting stack offsets and eventually managed to reproduce the bug within 10 seconds using a gcc loop (10-20 gcc runs on the same file).

      Changing the stack offset by a mere 16 bytes and the bug went away completely. The one or two particular stack offsets that reproduced the bug could then be further offset in multiples of 32K and still reproduce the bug at the same rate. Using a later version of gcc and the bug disappeared. Compiling with virtually any other options (turning on and off optimizations)... the bug disappeared.

      On the bright side, I thought this was a bug in DragonFly for most of last year and set about 'fixing' it, and wound up refactoring most of DragonFly's VM system to get rid of SMP bottlenecks and making it perform much better on SMP in the face of a high VM fault rate. So even though we wound up not doing the 2.12 release the eventual 3.0 release (that we just put out recently) has greatly improved cpu-bound performance on SMP systems.

      -Matt

    2. Re:Bulldozer not effected. by Lisandro · · Score: 1

      +1. Thank you for your great work on this.

    3. Re:Bulldozer not effected. by Anonymous Coward · · Score: 0

      I'm being pedantic, but you mean affected, not effected.

      Anyway, very impressive work.

    4. Re:Bulldozer not effected. by m.dillon · · Score: 1

      Yah, I get those two mixed up all the time, and will continue to probably for the rest of my life. On the bright side people know it's actually me doing the posting when they read that and a few other grammatical mistakes that I often make.

      -Matt

    5. Re:Bulldozer not effected. by Mr+Z · · Score: 1

      Multiples of 32K, eh? This smells like an issue that needs a well timed cache miss to trigger. IIRC, the L1D is 64K, two-way associative. Addresses that are multiples of 32K apart will map to the same cache line.

      For the failing alignments, does one of the POP instructions cause %rsp to cross cache line boundaries?

    6. Re:Bulldozer not effected. by Mr+Z · · Score: 1

      I should say, does %rsp cross cache line boundaries in that particular pop sequence at the time of the crash?

  31. self modifying code by Chirs · · Score: 1

    "there is never any need for self modifying code"

    I'm sorry to burst your bubble, but the linux kernel uses self-modifying code to provide the ability to build a kernel that will boot on many machines but can still be tweaked at boot to run faster depending on what the cpu supports.

    1. Re:self modifying code by Anonymous Coward · · Score: 0

      That's not an example of self-modifying code. That's an example of conditional branching. That's been a staple of computers for like, forever.

  32. Re:another horrible cpu bug by hairyfeet · · Score: 1

    Actually I believe some of the APUs such as the Brazos platform is done by TSMC. Yes according to this link they inked a deal with TSMC to use their SOI process for Fusion APUs and since i can't find any links saying this deal was later canceled i have to assume that is what they did. Makes sense as TSMC has the experience with ATI GPUs and the whole point of Fusion is wedding the ATI GPU to various AMD CPUs such as Stars (Liano), Bobcat (Brazos) and Bulldozer (FX) and since the overall performance of those chips would depend on how well they could get the GPUs to come out since they can't really bin the GPUs like they normally do without ending up with a bazillion variations it makes sense they went with the one that has more GPU experience.

    So does anybody have any REAL info on what's going on? What chips? What sockets? Is it only on server, or is desktop and mobile affected as well? Is it a one in a million bug or an easy to hit bug? Considering CPU bugs can go from the Pentium I "you'll likely never hit it" to the Phenom I quad "Good luck getting above 2.4GHz because the third core (Core 2) is buggy and unstable" there can be a pretty wide variance when it comes to odds. As someone who owns several AMD units and have sold even more some more info would be nice.

    --
    ACs don't waste your time replying, your posts are never seen by me.
  33. Re:This is why I use Intel, Quality by m.dillon · · Score: 4, Insightful

    Intel has had quite a few serious chip bugs too, all in errata. A number of new cpu bugs in both AMD and Intel chips always appears in new generations, but both companies have very large test suites and the number of new bugs goes down in every generation.

    Don't forget that Intel had to recall a sandybridge chipset early in the sandybridge cycle, which cost them something like a billion dollars because the related motherboards had to be thrown away and replaced. That was due to internal on-chip circuitry related to a SATA port burning out.

    Right at this moment AMD has two issues facing it in order to compete on workstations: (1) Power and (2) Performance. Their initial bulldozer release clearly depends too much on compiler optimizations to make full use of the architecture. They will clearly have to bulk-up some of the simplifications they made that made their cpu cores a little too sensitive to instruction sequences generated by compilers and I hope their next few releases will do better.

    On power consumption it comes down to the Fab as much as anything else. Their dependence on the Fab is clearly a problem and they've made a break for it to try to solve it, even though it is costing them dearly. At the same time Intel has made some major advances in their three fabs, to the point where Intel can do their entire production on just two of those three fabs now but they decided to keep the third fab because they think they can 'grow into' it.

    So AMD definitely has some work ahead of it, and I am hoping they reserve some of their focus for the high-end and don't concentrate entirely on laptops. I always like to say that I love AMD, but in the stock market I invest in Intel. That's just business. But I got on the AMD bandwagon big-time when they got to 64-bit first and I stuck with them all the way through the Phenom II.

    Now, at this moment, Intel's SandyBridge has the best value and AMDs bulldozer is quite far behind, so new purchases for me right now are Intel. That may change in the next year or two and when it does my new purchases will happily be in the AMD camp again. Frankly, AMD only has to get within shouting distance (~8%) of Intel and I will happily use AMD. AMD doesn't have to beat Intel.

    I think there are a number of things AMD can do right now to compete better with Intel. One of the biggest is in the mini-server department (albeit clearly with lower volumes than their current focus on laptops & integrated graphics). AMD consumer cpus (aka Phenom II) always had ECC support but very few motherboards actually supported it, which made it difficult to use AMD for mini-servers and avoid the Intel Xeon tax to get ECC. If AMD worked on the mobo vendors to ALWAYS support an ECC option that would allow them to compete against Intel Xeons on price, even if they are unable to compete on performance.

    On the opterons AMD clearly has the right idea going with high-core-count cpus, but the memory subsystem is lagging too much to really be able to make use of all those cores. That seems to be low-hanging fruit to me, something which should be readily addressable by AMD. The opterons still have a lot of value and potentially can have a radical improvement in value with Bulldozer, but only if AMD can push the core count and improve the memory subsystem.

    On large multi-core boxes AMD also needs to improve CMPXCHG and other atomic instructions in situations where contention is high. Right now multi-chip opteron systems seriously lag Intel on contended latency due to cache coherency inefficiencies. Will Bulldozer fix those latency issues? I don't know.

    AMD only needs to get within shouting distance of Intel for me to buy their chips, and work their mobo producers a bit more to get better overall support for their chip's capabilities. They don't have to beat Intel.

    -Matt

  34. The significant question: *which* AMD processors? by Anonymous Coward · · Score: 0

    I'm getting ready to rebuild my system. Before I buy, can someone tell me *which* AMD "processor families" have this bug?

                    mark

  35. When it is my code the problem is always... by banetbi · · Score: 1

    a hardware issue. When it is developed by someone else, then it is a software issue, at least that is what I tell my boss.

  36. Re:This is why I use Intel, Quality by Anonymous Coward · · Score: 0

    Yeah, Intel. Quality, so long as you don't need to get the right answer from floating point math or locking operations to not eat data. Hardware bugs happen to both vendors, you're a fool if you think Intel are infallible.

    * http://en.wikipedia.org/wiki/Pentium_F00F_bug
    * http://en.wikipedia.org/wiki/Pentium_FDIV_bug

  37. Exact? by Anonymous Coward · · Score: 0

    If you have float scale = 0.1; in your source code, I'd like to know how you get the binary representation to be "exact" while using IEEE 754. We wouldn't need numerical analysis if computers were able to accurately represent real numbers, rather than a subset of the rationals.

    Yes, there are rules for dealing with things, but if you're trying to test if two floats are equal, you're probably doing it wrong if you're not including a bit of a fudge factor, because all sorts of rules that apply to real numbers no longer come out exactly right when you're dealing with numbers the computer can actually represent.

  38. Re:another horrible cpu bug by dave87656 · · Score: 1

    The AMD CPU was probably designed in the US and made in Germany.