AMD Confirms CPU Bug Found By DragonFly BSD's Matt Dillon

← Back to Stories (view on slashdot.org)

AMD Confirms CPU Bug Found By DragonFly BSD's Matt Dillon

Posted by Soulskill on Monday March 5, 2012 @05:14PM from the it's-not-me-it's-you dept.

An anonymous reader writes "Matt Dillon of DragonFly BSD just announced that AMD confirmed a CPU bug he found. Matt quotes part of the mail exchange and it looks like 'consecutive back-to-back pops and (near) return instructions can create a condition where the processor incorrectly updates the stack pointer.' The specific manifestations in DragonFly were random segmentation faults under heavy load."

46 of 292 comments (clear)

Min score:

Reason:

Sort:

This isn't nearly as bad as the division bug by Omnifarious · 2012-03-05 17:24 · Score: 4, Insightful

Though, it's still very serious. At least it generally causes your program to crash rather than spitting out a wrong answer. And it sounds like the sequence of instructions that causes it is not commonly found.
I can well understand the guy who found it being all excited. The CPU is the last place you'd look for a bug, and finding one is pretty impressive, especially a really elusive one like this.

--
Need a Python, C++, Unix, Linux develop
1. Re:This isn't nearly as bad as the division bug by XDirtypunkX · 2012-03-05 17:33 · Score: 4, Insightful
  
  Either are equally bad from the perspective of a software developer who spends a month trying to work out just exactly what is wrong with their code, especially if something like this occurs on a test machine but not on a development machine.
2. Re:This isn't nearly as bad as the division bug by icebike · 2012-03-05 17:35 · Score: 3, Informative
  
  And it sounds like the sequence of instructions that causes it is not commonly found.
  Really?
  Pop two off the stack and ret to the calling routine seems fairly common to me. Lots of functions use two arguments and are called with near calls in various programming languages.
  
  --
  Sig Battery depleted. Reverting to safe mode.
3. Re:This isn't nearly as bad as the division bug by GoodNewsJimDotCom · 2012-03-05 17:40 · Score: 4, Funny
  
  I found out about the division bug as a beginner programmer! I was trying to write the first MMORPG using Quick Basic. I remember division not being exactly accurate, so the solution I needed to use was to round up and down results that are really close. It fixed it, but new programmers shouldn't be forced to deal with stuff like that.
  
  I've preferred AMDs to Intels because AMD was one of the first sponsors to Esports back in 99. Too bad Columbine happened and I suspect they wanted to distance themselves from Quake tournaments. Another thing I like about AMD was that their processors don't melt if they get hot because they have a self preservation shutdown mode. People said Intel had this, but I melted a processor just a few months ago on SWTOR.
  
  --
  God spoke to me
4. Re:This isn't nearly as bad as the division bug by sjames · 2012-03-05 17:44 · Score: 4, Insightful
  
  Crash bugs are frustrating, but nowhere NEAR as scary as a bug that results in an incorrect but plausible computation. If the program crashes, you KNOW it crashed and you know the runs before that didn't crash are OK.
  Note that IRL the two cases can overlap. That is, a bug that might trigger a crash or might trigger an incorrect computation that might be plausible depending on luck of the draw.
5. Re:This isn't nearly as bad as the division bug by Smauler · 2012-03-05 17:48 · Score: 5, Funny
  
  I was trying to write the first MMORPG using Quick Basic.
  Sounds like the division bug was the least of your problems....
6. Re:This isn't nearly as bad as the division bug by Corbets · 2012-03-05 17:50 · Score: 4, Funny
  
  I found out about the division bug as a beginner programmer! I was trying to write the first MMORPG using Quick Basic.
  
  I've never heard "choosing the wrong programming language" described as a bug, but hey, however you want to play it off, man.
7. Re:This isn't nearly as bad as the division bug by Anonymous Coward · 2012-03-05 17:52 · Score: 5, Insightful
  
  Floating point operations are never fully precise. Simple numbers such as 4.0 would be represented as 4.0000000000000213 or 3.99999999999973 if you arrive at this after doing a bunch of calculations.
  This is an inherent limitation of how floating point works, and not something that has been "fixed". Programmers still have to worry about this.
8. Re:This isn't nearly as bad as the division bug by synthesizerpatel · 2012-03-05 17:59 · Score: 4, Insightful
  
  If your program is 'the kernel' then that qualifies as 'as bad as the division bug' && 'it's a big deal'.
9. Re:This isn't nearly as bad as the division bug by Forever+Wondering · 2012-03-05 18:19 · Score: 4, Interesting
  
  Though, it's still very serious. At least it generally causes your program to crash rather than spitting out a wrong answer. And it sounds like the sequence of instructions that causes it is not commonly found.
  I can well understand the guy who found it being all excited. The CPU is the last place you'd look for a bug, and finding one is pretty impressive, especially a really elusive one like this.
  Actually, it could be occurring in other places/programs that aren't crashing but are [silently] producing bad results. The floating point bug, once isolated, could be probed for, and compensated for.
  From what I can tell from reading the assembly code, the function is unremarkable except for the fact that it's recursive. It isn't doing anything exotic with the stack (e.g. just pushes at prolog and pops at epilog). The epilog is starting at +160 and the only thing I notice is that there are several conditional jumps there and just above it is a recursion call with a fall through. But, from the AMD analysis, it appears that it's the specific order of the push/pops that is the culprit. In this instance, it's r14, r13, r12, rbp, rbx
  The workaround for this bug might be that the compiler has to put a nop at the start of all function epilogs (e.g. a nop before the pop sequence) on every function because you can't predict which function will be susceptible. Or, you have to guarantee that the push/pop sequence doesn't emit the sequence that causes the problem (e.g. move the rbp push to the first in sequence as I suspect that putting it in the middle is what is causing the problems)
  
  --
  Like a good neighbor, fsck is there ...
10. Re:This isn't nearly as bad as the division bug by bzipitidoo · 2012-03-05 18:22 · Score: 5, Interesting
  
  Oh, I've found CPU bugs before. But I never found one others hadn't already found. The 16MHz 80386 had a bug with counters. If you did a REP MOVSW or similar instruction in a 16 bit mode, starting on an odd address, and you made the pointer registers roll over, the CPU would lock up. Couldn't handle the transition from 0xFFFF to 0x0001 in either direction. That was fixed in all the faster 386's. As I recall, there were about a dozen bugs in the 386. Of course later processors were all checked for those specific bugs, so they never happened again.
  Then there's unintended features such as pipeline oddities. If you have self modifying code, and it changes the destination of a jump instruction immediately before executing it, the computer will jump to the old address. Step through those same instructions in a debugger, and it will jump to the new address. Strictly speaking, jumping to the old address is incorrect, but it doesn't break any good code and fixing it would wreck pipelining. This behavior has been known for a long time, and every CPU from at least the 386 to the Pentium 4 behaves this way. It wasn't an important problem because so little code was self modifying. Wasn't any good as a copy protection method either, as only an amateur would be fooled by it. I think it's been resolved in at least 2 ways. First, by amending the documentation for the instruction set to expressly state that behavior is undefined in such a case, and second, by proving that there is never any need for self modifying code. And making the separation between code and data explicit. Now we have No eXecution bits.
  There are sometimes even Easter eggs. For some processors, a few unassigned opcodes performed a useful operation. It wasn't by design. Is that a bug? Another case was the use of out of bounds values. For instance, the ancient 6502 supports this packed decimal arithmetic mode, in which 0x99 meant 99. So what happened when some joker gave it an illegal value such as 0xFF? 0xFF was interpreted as 15*10+15 = 165, and one could perform some math on it and get correct results. Divide 0xFF by 2 (shift right), and it would compute the correct result of 0x82. That sort of thing makes life tough for emulators, and I have yet to find an Apple II emulator that reproduces that behavior faithfully.
  
  --
  Intellectual Property is a monopolistic, selfish, and defective concept. It is "tyranny over the mind of man"
11. Re:This isn't nearly as bad as the division bug by sjames · 2012-03-05 18:31 · Score: 5, Insightful
  
  Imagine, there is a tiny bug that makes your floating point results just slightly wrong once in 1000 times. You run an iterative dynamic simulation of a bridge under load that runs for a million cycles. The results LOOK right...
12. Re:This isn't nearly as bad as the division bug by Darinbob · 2012-03-05 19:04 · Score: 3, Insightful
  
  CPUs have plenty of bugs. It's not necessarily the last place to look, especially for less popular processors. The only reason it's rarer with Intel and Intel-copying CPUs is because the market is so much bigger and therefore the resources for QA. Actually the bigger and more complex the processors are becoming the more likely it is to have bugs. Of course most are things people don't worry about or that can be worked around by following advice in the errata.
  In fact enough people assume CPUs have bugs only in the rarest of cases makes it hard to convince others that you have actually found a bug that's not in the errata. The same thing happens with compilers, you tell people that the bug must be in the compiler and they roll their eyes at you.
13. Re:This isn't nearly as bad as the division bug by JWSmythe · 2012-03-05 19:08 · Score: 4, Informative
  
  Anyone who's programmed long enough has found unexplainable bugs that are eventually traced down to some bad hardware. :)
  I've preferred AMD over Intel for years. Long ago, in a distant computer store, far away.... We sold 386s, 486s, and Pentiums (or their reasonable clone) from Intel, IBM, AMD, and Cyrix. At the time, I didn't really care who made the chip, they were just built out for the customer.
  Over the years, I learned to prefer AMD for both the price and performance. Plenty of people will argue "but this Pentium is faster than that AMD". Well, it's all nice, but I don't *have* to stay bleeding edge. I never liquid cooled my CPU, video card, and memory. Friends did. I was always impressed with how much they wasted. I'd just wait 6 months or so, and get something better, faster, and cheaper. :) I do like having a high performance computer, so I upgrade every year or so.
  For example, I just set up a couple servers from COTS parts. They used AMD FX-8120's (8 core, 4.0Ghz turbo) for $199.99/ea. It seems the comparable Intel is the i7-980 (6 core, 3.6Ghz), which is selling at $589.99. For the difference in price, I could build out a 3rd server, and still have money left over. Toms hardware suggests the i5-2500K (4 core, 3.7Ghz turbo) for $224.99 or i7-2600K (4 core 3.8Ghz turbo) for $324.99 as comparable. If I wanted to spend a little more, I could have gone with the AMD FX-8150 (8 core, 4.2Ghz turbo) for $249.99. Was $50 for .2Ghz worth it? Not really. Something bigger, better, and faster will be out next year, and the year after, and then I'll buy something new.
  I used newegg.com for all the prices, so it would be fairly even.
  The servers actually use as many cores as I can throw at them, so it's extremely beneficial to have more cores at high speeds.
  My desktop/gaming machine still has a Phenom IIx6 1100T in it. All the games I play, I can leave all the settings turned all the way up. Maybe if I ran benchmarks, I'd see something else gets a slightly faster frame rate, but I can't see any difference. As we all know, various benchmarks show different things.
  
  --
  Serious? Seriousness is well above my pay grade.
14. Re:This isn't nearly as bad as the division bug by GoodNewsJimDotCom · 2012-03-05 19:08 · Score: 5, Interesting
  
  Heh. I coded a nice tile based RPG out of it, but I couldn't make it MMOG because there is no socket code in Quick Basic. The trick to making big games in Quick Basic is to write your own Virtual Disk so you can get past the 640k memory limit. Once you have a virtual disk, you can write an interpreted language inside Quick Basic, then your code is simply loaded up in a custom database. I rewrote the whole thing in C/C++ because people told me I could get socket libraries in it, but I gave up on my game entirely when Ultima Online came out because I felt I wouldn't be able to build up a market because my graphics are so bad. I was partially right in thinking there is only enough room for one MMORPG at a time back in 97, but I think I shouldn't have gave up after having coded for thousands of hours with things like Farmville succeeding today.
  
  --
  God spoke to me
15. Re:This isn't nearly as bad as the division bug by phantomfive · 2012-03-05 20:07 · Score: 3, Insightful
  
  Just because you find an error in a division when you were programming your MMORPG in visual basic doesn't mean you've found the pentium bug. If you noticed it happening a lot, it probably wasn't the bug, just normal IEEE precision issues.
  
  --
  "First they came for the slanderers and i said nothing."
16. Re:This isn't nearly as bad as the division bug by Pieroxy · 2012-03-05 20:08 · Score: 3, Informative
  
  What is the problem with Quick Basic? It came for free and it was quite ok.
  No network access? Might be fine for you, but for an MMORPG programmer on the other hand...
  
  --
  Write boring code, not shiny code!
17. Re:This isn't nearly as bad as the division bug by AmiMoJo · 2012-03-05 20:37 · Score: 3, Interesting
  
  Most of the undocumented op-codes on older CPUs were down to the fact that they were designed by hand rather than having the circuits computer generated. A computer will make sure all illegal op-codes are caught and generate an exception, but human beings didn't bother. Designers put in test op-codes as well which were usually just left in there for production. Even the way humans design circuits makes them more likely to produce useful undocumented op-codes and side-effects.
  It was somewhat risky to use them though because the manufacturer might decide to change CPU. The Z80 design was licensed out and any number of companies could supply them, all with their own unique bugs. Some games like to used these features for copy protection and then broke when the producer switched supplier.
  
  --
  const int one = 65536; (Silvermoon, Texture.cs)
  SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
18. Re:This isn't nearly as bad as the division bug by wvmarle · 2012-03-05 20:37 · Score: 3, Insightful
  
  Google is known to build their servers from cheap parts.
  Like a RAID, but then a RAIS (Redundant Array of Independent Servers). Load distribution may be an issue as it has to seamlessly reassign tasks when a server is down for whatever reason. But for sufficiently large operations (five servers or more) this sounds to me like the way to go. Instead of trying to make every individual server highly reliable, go with the still very reliable user-grade stuff and get your reliability by redundancy. And companies like Google need more than one server anyway.
19. Re:This isn't nearly as bad as the division bug by Anonymous Coward · 2012-03-05 21:49 · Score: 3, Informative
  
  What is the problem with Quick Basic? It came for free and it was quite ok.
  NO it did NOT. What came free was QBasic, which was a stripped-down version of Quick Basic. The full Quick Basic did not have the 640k memory limitation, was able to fully link/compile stand-alone executables, and had a host of other Professional features that QBasic lacked.
  Don't get me wrong- QBasic was great for a free environment (at the time). But it was severely limited, and all the references to "quick basic" in this thread appear to be referring to shortcomings in QBasic, which were not present in Quick Basic.
20. Re:This isn't nearly as bad as the division bug by JWSmythe · 2012-03-05 22:13 · Score: 5, Interesting
  
  Well, $604.94/ea. The memory came with an 8GB Class 4 micro SD, and we got a $10 newegg gift card each. I forget what that was bundled with. If you consider a gift card as cash, they were under $600/ea.
  
  13-131-767 @$94.99/ea ASUS M5A97 AM3+ 17-822-008 @$24.99/ea DIABLOTEK PSDA500 500W RT 19-103-961 @$199.99/ea AMD 8-CORE FX-8120 3.1G 20-220-609 @$84.99/ea 4Gx4 PATRIOT PGD316G1600ELQK 22-148-725 @$99.99/ea Seagate 1.5TB ST1500DL003 (x2)
  All of those are quantity 1, except the hard drives. They are 8 core 4Ghz (always running in Turbo mode), with 16GB ram, and RAID 1 on the drives. I opted to go more like Google's topless server. I used cable ties to mount up everything on wire racks from Home Depot. Ya, the same plastic/rubber coated ones you'd use in your closet. This is serving out of my house on a business FiOS line, so no one at a datacenter can complain. :) They're running amazingly cool. Because there's nothing interrupting normal convection air currents, all the heat sinks and drives are cool to the touch. They're a bit quieter than my desktop PC, because I don't require an extra fans to pull the hot air out of the case. My regular desktop has a 250cfm fan on it to keep it cool. Without it, and with the side on, it can overheat in a few minutes when gaming.
  The room does have an air conditioning return in it, which helps keep the room cool. The only fan I added was a HEPA filter. It's oversized for the room, but it'll help keep dust off the machines. The room is the same temperature as the rest of the house, so I'm happy with it. It serves no purpose for cooling the machines, since it's not even pointed at them. :)
  I have some pretty low load servers. Rather than buying a dozen of anything, I opted for using virtual machines. These two servers are hosting 4 VMs at this time, and there will be more. It's a young setup, and I have a lot of work to do on it. I opted to use VirtualBox. It works very well. I had intended trying VMWare ESXi or Citrix XenServer. unfortunately, neither would use the crappy software RAID that the boards provide, and I wasn't willing to drop money on real RAID controllers. I looked around a bit, and it seems that you can try to use some workarounds, but I didn't have the time or inclination to do it, where I could have VirtualBox going in less than an hour.
  The VMs are redundant between servers. Further on, you can read more about how I did it in the past between physical boxes. So if a single VM crashes, who cares. If a VM host crashes, well, it's reduced redundancy, but I'm still operating. I'm going to put out more VM hosts, and increase the redundancy. 4 machines with 6 VMs each is like 24 physical boxes. That's a serious savings, especially where the VM host costs about $600.
  Let me give you a little history. :)
  Long before Google made the pictures of the way they do servers, the company I was at was using COTS parts. That was voyeurweb.com (NSFW). They were hosting with a company not to be named (as in, I can't remember), who sold them on a $50k investment of a Sun server. They promised it was more power than anyone could ever want. That lasted about 3 days. It was after this, I got involved with them. We dropped about $15k on 10 servers. They were fairly cheap machines. Asus gaming motherboards, AMD K6/2 300 CPU, 512MB RAM, 8GB and 20GB IDE drives. The most expensive part at the time was the cases. It was pretty much what you'd be using at home at the time.
  We had the occasional failures, but they were usually due to load or CPU fan failures. At the time, they had under 1 million daily viewers, so we could handle that load on 4 of the 10 machines. Load balancing was done with DNS round robin. I know people say it's a poor system, but it worked well. There was typically a 3 second delay if you happened to hit a bad server, and then you'd roll off to the nex
  
  --
  Serious? Seriousness is well above my pay grade.
21. Re:This isn't nearly as bad as the division bug by Rockoon · 2012-03-06 01:14 · Score: 4, Informative
  
  We are not talking about the start of the function, but the end.
  Who is this "we" .. are you, the anonymous coward, teamed up with icebike (68054)? Clearly you shouldn't be, since he most definitely stated his belief that a two-parameter function would pop its two input parameters near its final return statement.
  
  anything the function has pushed onto the stack will be on the top of the stack, before the return functions.
  What you are saying is not news to me. The problem with your argument is that in the x86-64 calling conventions (which is what the article is talking about) there are plenty of volatile registers to use. To be specific there are 7 general purpose registers (64-bit) as well as 6 SSE registers (128-bit) that are considered volatile. If a function really uses so many registers that it requires saving a few of the non-volatile registers, then the function is also most often going to be so non-trivial that it must maintain 16-byte stack alignment.
  
  Only leaf functions can safely violate the 16-byte alignment rule and are allowed to push and pop willy-nilly, but leaf functions also dont need non-volatile registers themselves because they arent calling anything that might destroy the registers they use. So we are talking about a very narrow situation where the function is (a) A leaf function and (b) Takes many parameters (more than 4, certainly) in order to create the register pressure required to need to spill some of them onto the stack someplace other than the mandatory scratch stack space (for the first 4 arguments) required by the calling convention.
  
  My lawn...
  
  --
  "His name was James Damore."
22. Re:This isn't nearly as bad as the division bug by dalias · 2012-03-06 01:52 · Score: 4, Informative
  
  This is not insightful; it's wrong. Floating point on any modern system conforms, or at least is intended and assumed to conform, to IEEE 754. There are exact answers specified for every basic arithmetic operation and non-transcendental functions. Of course there are decimals that have no representation in binary, but 4.0 is not one of them.
23. Re:This isn't nearly as bad as the division bug by mcgrew · 2012-03-06 03:55 · Score: 4, Funny
  
  Ah, the memories...
  Though, it's still very serious. At least it generally causes your program to crash rather than spitting out a wrong answer.
  At Intel, Quality is Job 0.99989960954
  Q: What is a mad scientist?
  A: A researcher with a Pentium
  Q: How many Pentium designers does it take to screw in a light bulb?
  A: 1.99904274017, but that's close enough for non-technical people.
  Q: What's another name for the "Intel Inside" sticker they put on Pentiums?
  A: The warning label.
  Q: Why didn't Intel call the Pentium the 586?
  A: Because they added 486 and 100 on the first Pentium and got 585.999983605.
  Q: Did you hear about the new "morning after" pill being developed as a replacement for RU-486???
  A: Its called RU-Pentium. It causes the embryo to not divide correctly.
  
  --
  Free Martian Whores!
24. Re:This isn't nearly as bad as the division bug by JWSmythe · 2012-03-06 05:40 · Score: 3, Interesting
  
  It got more complex as it grew. This is all from memory, so there may be some inconsistencies with what I'm saying.
  The main site primarily used 3 subdomains. There were also 3 pay sites. For a while there were 6 machines for video streaming. The video streaming site went away, mostly due to cost vs income. We added free hosting, which was small to start, and picked up popularity and grew to dozens of servers. There were side projects launched on their own servers. Some worked. Some were dropped.
  Masterstats.com was an example of that. In function, it was similar to Google Analytics. It started as 1 DB and 2 web servers. Because of the load thrown at it, it became 3 DB, 4 web, and 2 offline calculation servers. The offline machines were just to process stats that were too intensive to do live, and created an undue load on the web servers.
  Another example was our backups. If you go digging through my journal posts, you'll find me talking about multi TB arrays, back when the largest drives on the market were 250GB. Back in the first iteration, it was fairly simple to keep backups. That grew as we kept throwing more into it.
  These are the rough counts just for the main sites.
  Before I took over, there were 3 subdomains (www, voy, and ww3.) Each was on one server. If one server failed, that part of the site stopped. Needless to say, that was bad if (and when) www. went down.
  4 servers in one city. All 4 machines had all 3 subdomains.
  8 servers in two cities. That's 4 machines in 2 cities. Each set could handle the full load, in the event we lost a city. That meant each city handled 50% of the traffic normally, or 100% in a city failure. We could operate on 2 per city, but the extra 2 provided for redundancy.
  When we scaled out to 3 cities, we split the 3 subdomains between groups of servers in each city. We also divided up the load equally, so each city got 33% of the traffic. Having a city fail increased the traffic to the other cities by 17%. The sites data existed on all the servers, so in the event of a failure of one server, we could distribute that load. So now we had 3 to 5 servers per group, in 3 cities. Newer hardware was faster, so those would have combined duties. Clearly a dual 1.4Ghz machine could handle more than a single 350Mhz machine. At this point, we had retired almost all of the 350Mhz machines, except for a few that were recycled to do DNS and other low-load tasks.
  When we got to 5 cities (New York, Los Angeles, San Diego, Tampa, and Frankfurt), the loads got divided up differently. Frankfurt had one 100Mb/s circuit. San Diego had one 100Mb/s circuit. New York had two GigE circuits. Los Angles had 2 GigE circuits, which grew to 3 when we retired San Diego. Tampa had 2 GigE circuits. Frankfurt was retired, and the traffic was just added to New York. That barely made a blip on the bandwidth graphs.
  Members sites had a warm friendly map so they could pick their city to view from. We looked at other options, but it was simple to give them a map to choose where to serve from, and they could pick another city any time they wanted. Forums had people discussing which ones were "faster". It was very subjective. People in the West would pick Los Angeles or San Diego, with most preferring Los Angeles. People in the East picked Tampa or New York. People in Europe said Frankfurt was too slow, and they had better access via New York or Tampa.
  Each city had different load characteristics. Free hosting servers were deployed in 3 cities. There were some special case servers too. For example, where someone had a very high load "free hosting" site that made a lot of money via the AVS, could get their own server or servers.
  Pet projects got their own servers as needed.
  I
  
  --
  Serious? Seriousness is well above my pay grade.
Re:Microcode patch by Omnifarious · 2012-03-05 17:30 · Score: 3, Informative

I'm wondering if they will. This seems like a very odd timing issue that may be a problem in the electronics. Of course, I suppose they could just put in some microcode to wait after certain operations to make sure things settle and so avoid the hardware bug.

--
Need a Python, C++, Unix, Linux develop
Re:another horrible cpu bug by Taco+Cowboy · 2012-03-05 17:31 · Score: 4, Insightful

What has Taiwan got to do with this ?
I mean, was the CPU bug somehow introduced by TSMC ?

--
Muchas Gracias, Señor Edward Snowden !
Re:cool, but...? by ffflala · 2012-03-05 17:40 · Score: 3, Insightful

It matters because it's impressive. It also seems fair to associate some of the positive impression with DragonflyBSD, and I cannot see any downside to throwing good PR at any BSD flavor.
Kudos by Mannfred · 2012-03-05 17:44 · Score: 5, Insightful

I can only imagine the time and effort spent on tracking down this problem - a rare CPU condition is exponentially more difficult to narrow down than most programming mistakes. A lot of progress in IT depends on engineers like this, who obsessively solve problems even when it's much easier to just ignore them, try to hack around them or pass the buck around. Kudos.
Affected CPUs by Anonymous Coward · 2012-03-05 17:49 · Score: 5, Informative

A pertinent addition to the submission would be which CPUs have been found to be affected.
The second link says Opteron 6168 and Phenom II X4 820. For a second I thought that bulldozer hasn't managed to do anything right, but these two examples are pre bulldozer.
No doubt this is not an exhaustive list.
1. Re:Affected CPUs by dargaud · 2012-03-05 22:19 · Score: 3, Interesting
  
  I just went and checked on their microcode page, but the last download is fairly old. Anyway, the explaination on how to update on Linux is not clear:
  
  Support for updating microcode for the AMD processors listed above will be available starting with kernel version 2.6.29. Microcode update for AMD processors uses the firmware loading infrastructure.
  Does that mean that the kernel uploads the new microcode on boot ? How does it get it ?
  
  --
  Non-Linux Penguins ?
2. Re:Affected CPUs by scheme · 2012-03-06 01:39 · Score: 4, Informative
  
  Does that mean that the kernel uploads the new microcode on boot ? How does it get it ?
  The microcode module loads the microcode for the cpu from /lib/firmware/amd if it's newer than the one on the cpu. You can download and place new microcode updates from amd in this directory if needed or just let your distro provider update the microcode files when they push new packages out.
  
  --
  "When you sit with a nice girl for two hours, it seems like two minutes. When you sit on a hot stove for two minutes, it
Re:cool, but...? by wrook · 2012-03-05 17:51 · Score: 5, Interesting

Matt Dillon is a rather famous programmer (as programmers go). I assume that's why they mention him by name. I think a very large percentage of old Amiga hackers know who he is. He's also done work on the Linux kernel. Despite all that, he's best known for his work on FreeBSD and on his DragonflyBSD project. While a lot of old timers will know that, not everyone else will.
Confirmed CPUs by Jah-Wren+Ryel · 2012-03-05 18:04 · Score: 4, Informative

FWIW:

The failure has been observed on three different machines, all running AMD cpus. A quad opteron 6168 (48 core) box, and two Phenom II x4 820 boxes.

--
When information is power, privacy is freedom.
security exploit? by Anonymous Coward · 2012-03-05 18:20 · Score: 3, Interesting

I have to worry about stack smashing bugs here... can there be a way for (say) a data pattern in a media file, or carefully crafted javascript or java code that's been JIT-compiled, to break out of its sandbox? What about a hostile OS kernel running inside a VPS container taking over the hypervisor or bare iron? Hmm.
Re:you are mistaken by Sir_Sri · 2012-03-05 18:36 · Score: 4, Informative

A floating point precision error. Floating points cannot represent quite a diverse collection of numbers, this is especially problematic when you're doing intersections with small objects. Say a ray projected from an object will, because of the minute errors in floating point, collide with the same object (which produces some cool patterns).
Floating points are kind of crappy. Not that I have a better option with viable performance on a desktop machine. That's not a division bug, that's just the nature of representing numbers in binary with a fixed number of bits.
Matt Dillon of Dragon Fly by hcs_$reboot · 2012-03-05 18:46 · Score: 4, Funny

Matt Dillon, desperate after chasing unsuccessfully mary in Something about Mary radically changed jobs and started to study computer science...

--
Slashdot, fix the reply notifications... You won't get away with it...
x86_64 ABI by DrYak · 2012-03-05 19:12 · Score: 4, Interesting

Pop two off the stack and ret to the calling routine seems fairly common to me. Lots of functions use two arguments and are called with near calls in various programming languages.
That might have been true on 386s.
But currently we're in 2012 and the most widely used instruction set for Linux on AMD processors is x86_64. Because these 64bit processors feature a big number of registers, the two arguments will be passed as registers, not on the stack. So the sequence of instructions isn't indeed common.

--
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
So is this the fanboy way to deflect from it? by Sycraft-fu · 2012-03-05 19:18 · Score: 4, Interesting

You try and find something that "the other guy" had a problem with and bring it up as worse so as to try and "protect" the thing you are a fan about? Because I see nothing about the FDIV bug anywhere but your post.
Oh and you know what that bug applied to, right? The Intel Pentium, the ORIGINAL Pentium. Not the Pentium MMX, not the Pentium Pro, not the Pentium II, not the Pentium III, not the Pentium 4, not the Core, not the Core 2, not the Core i, not the second generation Core i. And yes, that's how many major processor versions from Intel there have been since then (with another to launch in the next couple weeks). The original Pentium chips that had this problem came out almost 2 decades ago, 1993.
So seriously, leave off it. I get tired of any time there is a problem with $Product_X fans of it will point out how $Product_Y had a similar or worse error way back in the day and that somehow changes things.
No it doesn't. The story is about the AMD chips, nobody gives a shit about the FDIV bug and I'll wager there are people reading Slashdot who weren't alive when it happened.
The good news for AMD is that processors can often patch around this shit in microcode these days so a recall may not be needed. Have to see, but the potential is there for a software (so to speak) fix.
Re:you are mistaken by Rockoon · 2012-03-05 21:07 · Score: 3, Insightful

512bit calculations aren't that expensive
Yes they are.

--
"His name was James Damore."
Re:you are mistaken by neokushan · 2012-03-05 21:48 · Score: 4, Insightful

Except I very much doubt that would solve whatever "problems" this guy was having. As a newbie programmer, it's entirely understandable that he wouldn't know about the fun you can (or can't) have with floating point operations. However, I very much doubt that sheer accuracy was the issue, rather he was probably making assumptions such as 1.0 - 1.0 == 0.0, when in reality the result isn't necessarily exactly 0.0. Considering it's an MMO, he probably had something like "Why is this guy not dying, he has 4 HP left and this attack does exactly 4 damage? Must be a bug!".
Really, it doesn't matter a huge amount, if such "accuracy" is important to your game then instead of doing "if(Health is less than 0.0) /* die */", you do something like "if (Health is less than 0.0 + epsilon) /* die */", with "epsilon" being a very small number (such as 0.00000001).
The real fun with floats, however, is that each platform does something different. It's possible that the OP ran the game on Intel hardware and got one result (which may have seemed more "correct"), then ran it on an AMD machine and got a different (seemingly less-correct) result - you can see why he naturally jumped to the conclusion that the AMD system had a bug.
In reality, chances are both systems were "wrong" anyway, they just happen to use different implementations for floating-point logic. To solve this, once again higher rates of calculations aren't the answer, but rather there's a compiler switch (/fp:strict in VS) that will use the ISO standard floating point model. It's not as fast as the other methods, but you will at least game the same results across different platforms (assuming that CPU has implemented the standard correctly which these days is almost certain).
There's LOTS of fantastic info on this here: http://gafferongames.com/networking-for-game-programmers/floating-point-determinism/

--
+1 IDisagreeSoHeMustBeATrollOrAnAstroturferOrAShill
Re:you are mistaken by Mr+Z · 2012-03-06 03:26 · Score: 4, Informative

I'm pretty sure it was with the introduction of the Pentium (which had the famous FDIV bug) that John Carmack officially made the switch to single precision FP for most things because it was finally fast enough. FP wasn't cheap, per se, but the simplification it brings over keeping track of binary points and precision/range tradeoffs in integerized algorithms should not be underestimated either.
For example, if I want to do a floating point multiply and add, I just say: f3 = f0 * f1 + f2. Before I even start writing a fixed-point multiply and add, I need to ask what the Q points (binary points) are for each of the terms, what Q point you'd like for the result, and what sort of rounding (if any) the result requires for stability. You can end up with a monstrosity like this, assuming all four numbers are at the same Q point:
x3 = (int)(((long long)x0 * x1 + (1LL > Q) + x2;
Ok, maybe you hide that behind a macro, but what about cases where some of the terms are at different Q points? A fully general macro (which is no fun to write, BTW) would also have a ton of arguments, and only reduce you to something like x3 = FXMULADD(x0, Q0, x1, Q1, x2, Q2, Q3); which won't win you any awards in the clarity department.
And look at the operations themselves, too. You have type promotion, extra adds and shifts... the instruction sequence itself isn't super efficient. It pays off when floating point takes 10s and 100s of cycles, but is a dubious win when most of the core FP starts coming down into the single digits. With the Pentium's dual pipes and the fact you could keep integer instructions flowing in parallel to the float, that's effectively what happened. And notice we haven't even talked about dynamic range and overflow errors and how they screw you up. If you have to add tests for that... yuck. With floating point, you degrade gracefully if your dynamic range spikes a little higher than you expect.
Anyway, getting back on topic: This isn't the first time an x86 has had a stack-pointer related bug. I remember the 80386s that had the so-called "POPAD bug". That one was a bit easier to hit.
Hopefully, AMD will be able to publish a microcode update or something to work around theirs. That's one thing modern x86s have over their predecessors: A good number of CPU bugs can be patched around with microcode updates. I believe Intel added that with the Pentium Pro, and AMD followed suit. I believe my Phenom is one of the affected parts. I guess I'll have to keep an eye out for such a patch.

--
Program Intellivision!
Bulldozer not effected. by m.dillon · 2012-03-06 03:57 · Score: 5, Informative

AMD has indicated to me that the Bulldozer is not effected, which is a relief.
I guess I should have realized this would get slashdotted. In anycase, it took quite a bit of effort to track the bug down. It was very difficult to reproduce reliably. It isn't a show stopper in that it really takes a lot of work to get it to happen and most people will never see it, but it's certainly a significant bug owing to the fact that it can be reproduced with normal instruction sequences.
I began to suspect it might be a cpu bug last year and after exhaustive testing I posted my suspicions in December:
http://leaf.dragonflybsd.org/mailarchive/kernel/2011-12/msg00025.html
Older versions of GCC were more prone to generate the sequence of POP's + RET, coupled with a deep recursion and other stack state, that could result in the bug. It just so happened that DragonFly's buildworld hit the right combination inside gcc, and even then the bug only occurred sometimes and only one a small subset of .c files being compiled (like maybe 2-3 files). The bug never manifested anywhere else, doing anything else, running any other application. Ever.
In particular the bug disappeared with later versions of GCC and disppeared when I messed with the optimizations. We use -O by default, not -O2. The bug disappeared when I produced code with gcc -O2 (using 4.4.7).
It is really unlikely that Linux is effected... the sensitivity to particular code sequences laid out in the compiler is so fine that adding a single instruction virtually anywhere could make the bug disappear. Even just shifting the stack pointer a little bit would make it disappear.
In anycase, for a programmer like me being able to find an honest-to-god cpu bug in a modern cpu is very cool :-)
-Matt
1. Re:Bulldozer not effected. by m.dillon · 2012-03-06 04:17 · Score: 5, Interesting
  
  Since the cat is out of the bag some further clarification is required so I will include some more of the email I received. I didn't quite mean for it to explode onto the scene this quickly, but oh well.
  Again, note that this is *NOT* an issue with Bulldozer. And they will have a MSR workaround for earlier models.
  >> quote
  "AMD has taken your example and also analyzed the segmentation fault and the fill_sons_in_loop code. We confirm that you have found an erratum with some AMD processor families. The specific compiled version of the fill_sons_in_loop code, through a very specific sequence of consecutive back-to-back pops and (near) return instructions, can create a condition where the processor incorrectly updates the stack pointer.
  AMD will be updating the Revision Guide for AMD Family 10h Processors and the Revision Guide for AMD Family 12h Processors, to document this erratum. In this documentation update, which will be available on amd.com later this month, the erratum number for this issue will be #721. The revision guide will also note a workaround that can be programmed in a model-specific register (MSR)."
  end quote
  They go on to document a specific workaround when the MSR is not programmed, which is basically to add a nop for every five pop+return instructions (though I'm not sure if the nop must occur between sequences or within the sequence). I will note that just the presence of 5xPOP + RET does not trigger the bug alone, it requires a very specific set of circumstances setup prior to that (that gcc's fill_sons_in_loop() procedure was able to trigger when gcc 4.7.x was compiled -O, when compiling particular .c files).
  As I said, this bug was very difficult to reproduce. It took a year to isolate it and find a test case that would reproduce it in a few seconds. Until then it was taking me upwards of 2 days to reproduce it on a 48-core and much longer to reproduce it on a 4-core.
  Since the bug was stack pointer address is sensitive the initial stack randomization that DragonFly does multiplied the time it took to reproduce the bug. But without the stack randomization the bug would NOT reproduce at all (I would never have observed it in the first place). In otherwords, the bug was *very* stack address sensitive on top of everything else.
  I was ultimately able to improve the time it took to reproduce the bug by pouring over all my previous buildworld runs and finding the .c files that gcc had compiled that were most statistically likely for gcc to seg-fault in. Then once I isolated the files I iterated all possible starting stack offsets and eventually managed to reproduce the bug within 10 seconds using a gcc loop (10-20 gcc runs on the same file).
  Changing the stack offset by a mere 16 bytes and the bug went away completely. The one or two particular stack offsets that reproduced the bug could then be further offset in multiples of 32K and still reproduce the bug at the same rate. Using a later version of gcc and the bug disappeared. Compiling with virtually any other options (turning on and off optimizations)... the bug disappeared.
  On the bright side, I thought this was a bug in DragonFly for most of last year and set about 'fixing' it, and wound up refactoring most of DragonFly's VM system to get rid of SMP bottlenecks and making it perform much better on SMP in the face of a high VM fault rate. So even though we wound up not doing the 2.12 release the eventual 3.0 release (that we just put out recently) has greatly improved cpu-bound performance on SMP systems.
  -Matt
Re:cool, but...? by m.dillon · 2012-03-06 04:33 · Score: 4, Funny

What's really amusing is that I've been on the scene for so long if you google my name 'Matthew Dillon', the first entry is actually... me! And not the actor(s). I'm sure that grinds a bit but I do bask in the occasional fan mail reaching my inbox, just before I hit the 'delete' key.
In recent years its started to flip back and forth, and I expect Hollywood will again take over the top spot after things die down again :-)
-Matt
Re:This is why I use Intel, Quality by m.dillon · 2012-03-06 05:20 · Score: 4, Insightful

Intel has had quite a few serious chip bugs too, all in errata. A number of new cpu bugs in both AMD and Intel chips always appears in new generations, but both companies have very large test suites and the number of new bugs goes down in every generation.
Don't forget that Intel had to recall a sandybridge chipset early in the sandybridge cycle, which cost them something like a billion dollars because the related motherboards had to be thrown away and replaced. That was due to internal on-chip circuitry related to a SATA port burning out.
Right at this moment AMD has two issues facing it in order to compete on workstations: (1) Power and (2) Performance. Their initial bulldozer release clearly depends too much on compiler optimizations to make full use of the architecture. They will clearly have to bulk-up some of the simplifications they made that made their cpu cores a little too sensitive to instruction sequences generated by compilers and I hope their next few releases will do better.
On power consumption it comes down to the Fab as much as anything else. Their dependence on the Fab is clearly a problem and they've made a break for it to try to solve it, even though it is costing them dearly. At the same time Intel has made some major advances in their three fabs, to the point where Intel can do their entire production on just two of those three fabs now but they decided to keep the third fab because they think they can 'grow into' it.
So AMD definitely has some work ahead of it, and I am hoping they reserve some of their focus for the high-end and don't concentrate entirely on laptops. I always like to say that I love AMD, but in the stock market I invest in Intel. That's just business. But I got on the AMD bandwagon big-time when they got to 64-bit first and I stuck with them all the way through the Phenom II.
Now, at this moment, Intel's SandyBridge has the best value and AMDs bulldozer is quite far behind, so new purchases for me right now are Intel. That may change in the next year or two and when it does my new purchases will happily be in the AMD camp again. Frankly, AMD only has to get within shouting distance (~8%) of Intel and I will happily use AMD. AMD doesn't have to beat Intel.
I think there are a number of things AMD can do right now to compete better with Intel. One of the biggest is in the mini-server department (albeit clearly with lower volumes than their current focus on laptops & integrated graphics). AMD consumer cpus (aka Phenom II) always had ECC support but very few motherboards actually supported it, which made it difficult to use AMD for mini-servers and avoid the Intel Xeon tax to get ECC. If AMD worked on the mobo vendors to ALWAYS support an ECC option that would allow them to compete against Intel Xeons on price, even if they are unable to compete on performance.
On the opterons AMD clearly has the right idea going with high-core-count cpus, but the memory subsystem is lagging too much to really be able to make use of all those cores. That seems to be low-hanging fruit to me, something which should be readily addressable by AMD. The opterons still have a lot of value and potentially can have a radical improvement in value with Bulldozer, but only if AMD can push the core count and improve the memory subsystem.
On large multi-core boxes AMD also needs to improve CMPXCHG and other atomic instructions in situations where contention is high. Right now multi-chip opteron systems seriously lag Intel on contended latency due to cache coherency inefficiencies. Will Bulldozer fix those latency issues? I don't know.
AMD only needs to get within shouting distance of Intel for me to buy their chips, and work their mobo producers a bit more to get better overall support for their chip's capabilities. They don't have to beat Intel.
-Matt