AMD Confirms CPU Bug Found By DragonFly BSD's Matt Dillon
An anonymous reader writes "Matt Dillon of DragonFly BSD just announced that AMD confirmed a CPU bug he found. Matt quotes part of the mail exchange and it looks like 'consecutive back-to-back pops and (near) return instructions can create a condition where the processor incorrectly updates the stack pointer.' The specific manifestations in DragonFly were random segmentation faults under heavy load."
I can only imagine the time and effort spent on tracking down this problem - a rare CPU condition is exponentially more difficult to narrow down than most programming mistakes. A lot of progress in IT depends on engineers like this, who obsessively solve problems even when it's much easier to just ignore them, try to hack around them or pass the buck around. Kudos.
I was trying to write the first MMORPG using Quick Basic.
Sounds like the division bug was the least of your problems....
A pertinent addition to the submission would be which CPUs have been found to be affected.
The second link says Opteron 6168 and Phenom II X4 820. For a second I thought that bulldozer hasn't managed to do anything right, but these two examples are pre bulldozer.
No doubt this is not an exhaustive list.
Matt Dillon is a rather famous programmer (as programmers go). I assume that's why they mention him by name. I think a very large percentage of old Amiga hackers know who he is. He's also done work on the Linux kernel. Despite all that, he's best known for his work on FreeBSD and on his DragonflyBSD project. While a lot of old timers will know that, not everyone else will.
Floating point operations are never fully precise. Simple numbers such as 4.0 would be represented as 4.0000000000000213 or 3.99999999999973 if you arrive at this after doing a bunch of calculations.
This is an inherent limitation of how floating point works, and not something that has been "fixed". Programmers still have to worry about this.
Oh, I've found CPU bugs before. But I never found one others hadn't already found. The 16MHz 80386 had a bug with counters. If you did a REP MOVSW or similar instruction in a 16 bit mode, starting on an odd address, and you made the pointer registers roll over, the CPU would lock up. Couldn't handle the transition from 0xFFFF to 0x0001 in either direction. That was fixed in all the faster 386's. As I recall, there were about a dozen bugs in the 386. Of course later processors were all checked for those specific bugs, so they never happened again.
Then there's unintended features such as pipeline oddities. If you have self modifying code, and it changes the destination of a jump instruction immediately before executing it, the computer will jump to the old address. Step through those same instructions in a debugger, and it will jump to the new address. Strictly speaking, jumping to the old address is incorrect, but it doesn't break any good code and fixing it would wreck pipelining. This behavior has been known for a long time, and every CPU from at least the 386 to the Pentium 4 behaves this way. It wasn't an important problem because so little code was self modifying. Wasn't any good as a copy protection method either, as only an amateur would be fooled by it. I think it's been resolved in at least 2 ways. First, by amending the documentation for the instruction set to expressly state that behavior is undefined in such a case, and second, by proving that there is never any need for self modifying code. And making the separation between code and data explicit. Now we have No eXecution bits.
There are sometimes even Easter eggs. For some processors, a few unassigned opcodes performed a useful operation. It wasn't by design. Is that a bug? Another case was the use of out of bounds values. For instance, the ancient 6502 supports this packed decimal arithmetic mode, in which 0x99 meant 99. So what happened when some joker gave it an illegal value such as 0xFF? 0xFF was interpreted as 15*10+15 = 165, and one could perform some math on it and get correct results. Divide 0xFF by 2 (shift right), and it would compute the correct result of 0x82. That sort of thing makes life tough for emulators, and I have yet to find an Apple II emulator that reproduces that behavior faithfully.
Intellectual Property is a monopolistic, selfish, and defective concept. It is "tyranny over the mind of man"
Imagine, there is a tiny bug that makes your floating point results just slightly wrong once in 1000 times. You run an iterative dynamic simulation of a bridge under load that runs for a million cycles. The results LOOK right...
Heh. I coded a nice tile based RPG out of it, but I couldn't make it MMOG because there is no socket code in Quick Basic. The trick to making big games in Quick Basic is to write your own Virtual Disk so you can get past the 640k memory limit. Once you have a virtual disk, you can write an interpreted language inside Quick Basic, then your code is simply loaded up in a custom database. I rewrote the whole thing in C/C++ because people told me I could get socket libraries in it, but I gave up on my game entirely when Ultima Online came out because I felt I wouldn't be able to build up a market because my graphics are so bad. I was partially right in thinking there is only enough room for one MMORPG at a time back in 97, but I think I shouldn't have gave up after having coded for thousands of hours with things like Farmville succeeding today.
God spoke to me
Well, $604.94/ea. The memory came with an 8GB Class 4 micro SD, and we got a $10 newegg gift card each. I forget what that was bundled with. If you consider a gift card as cash, they were under $600/ea.
All of those are quantity 1, except the hard drives. They are 8 core 4Ghz (always running in Turbo mode), with 16GB ram, and RAID 1 on the drives. I opted to go more like Google's topless server. I used cable ties to mount up everything on wire racks from Home Depot. Ya, the same plastic/rubber coated ones you'd use in your closet. This is serving out of my house on a business FiOS line, so no one at a datacenter can complain. :) They're running amazingly cool. Because there's nothing interrupting normal convection air currents, all the heat sinks and drives are cool to the touch. They're a bit quieter than my desktop PC, because I don't require an extra fans to pull the hot air out of the case. My regular desktop has a 250cfm fan on it to keep it cool. Without it, and with the side on, it can overheat in a few minutes when gaming.
The room does have an air conditioning return in it, which helps keep the room cool. The only fan I added was a HEPA filter. It's oversized for the room, but it'll help keep dust off the machines. The room is the same temperature as the rest of the house, so I'm happy with it. It serves no purpose for cooling the machines, since it's not even pointed at them. :)
I have some pretty low load servers. Rather than buying a dozen of anything, I opted for using virtual machines. These two servers are hosting 4 VMs at this time, and there will be more. It's a young setup, and I have a lot of work to do on it. I opted to use VirtualBox. It works very well. I had intended trying VMWare ESXi or Citrix XenServer. unfortunately, neither would use the crappy software RAID that the boards provide, and I wasn't willing to drop money on real RAID controllers. I looked around a bit, and it seems that you can try to use some workarounds, but I didn't have the time or inclination to do it, where I could have VirtualBox going in less than an hour.
The VMs are redundant between servers. Further on, you can read more about how I did it in the past between physical boxes. So if a single VM crashes, who cares. If a VM host crashes, well, it's reduced redundancy, but I'm still operating. I'm going to put out more VM hosts, and increase the redundancy. 4 machines with 6 VMs each is like 24 physical boxes. That's a serious savings, especially where the VM host costs about $600.
Let me give you a little history. :)
Long before Google made the pictures of the way they do servers, the company I was at was using COTS parts. That was voyeurweb.com (NSFW). They were hosting with a company not to be named (as in, I can't remember), who sold them on a $50k investment of a Sun server. They promised it was more power than anyone could ever want. That lasted about 3 days. It was after this, I got involved with them. We dropped about $15k on 10 servers. They were fairly cheap machines. Asus gaming motherboards, AMD K6/2 300 CPU, 512MB RAM, 8GB and 20GB IDE drives. The most expensive part at the time was the cases. It was pretty much what you'd be using at home at the time.
We had the occasional failures, but they were usually due to load or CPU fan failures. At the time, they had under 1 million daily viewers, so we could handle that load on 4 of the 10 machines. Load balancing was done with DNS round robin. I know people say it's a poor system, but it worked well. There was typically a 3 second delay if you happened to hit a bad server, and then you'd roll off to the nex
Serious? Seriousness is well above my pay grade.
AMD has indicated to me that the Bulldozer is not effected, which is a relief.
I guess I should have realized this would get slashdotted. In anycase, it took quite a bit of effort to track the bug down. It was very difficult to reproduce reliably. It isn't a show stopper in that it really takes a lot of work to get it to happen and most people will never see it, but it's certainly a significant bug owing to the fact that it can be reproduced with normal instruction sequences.
I began to suspect it might be a cpu bug last year and after exhaustive testing I posted my suspicions in December:
http://leaf.dragonflybsd.org/mailarchive/kernel/2011-12/msg00025.html
Older versions of GCC were more prone to generate the sequence of POP's + RET, coupled with a deep recursion and other stack state, that could result in the bug. It just so happened that DragonFly's buildworld hit the right combination inside gcc, and even then the bug only occurred sometimes and only one a small subset of .c files being compiled (like maybe 2-3 files). The bug never manifested anywhere else, doing anything else, running any other application. Ever.
In particular the bug disappeared with later versions of GCC and disppeared when I messed with the optimizations. We use -O by default, not -O2. The bug disappeared when I produced code with gcc -O2 (using 4.4.7).
It is really unlikely that Linux is effected... the sensitivity to particular code sequences laid out in the compiler is so fine that adding a single instruction virtually anywhere could make the bug disappear. Even just shifting the stack pointer a little bit would make it disappear.
In anycase, for a programmer like me being able to find an honest-to-god cpu bug in a modern cpu is very cool :-)
-Matt