AMD Confirms CPU Bug Found By DragonFly BSD's Matt Dillon
An anonymous reader writes "Matt Dillon of DragonFly BSD just announced that AMD confirmed a CPU bug he found. Matt quotes part of the mail exchange and it looks like 'consecutive back-to-back pops and (near) return instructions can create a condition where the processor incorrectly updates the stack pointer.' The specific manifestations in DragonFly were random segmentation faults under heavy load."
Matt Dillon is a rather famous programmer (as programmers go). I assume that's why they mention him by name. I think a very large percentage of old Amiga hackers know who he is. He's also done work on the Linux kernel. Despite all that, he's best known for his work on FreeBSD and on his DragonflyBSD project. While a lot of old timers will know that, not everyone else will.
Oh, I've found CPU bugs before. But I never found one others hadn't already found. The 16MHz 80386 had a bug with counters. If you did a REP MOVSW or similar instruction in a 16 bit mode, starting on an odd address, and you made the pointer registers roll over, the CPU would lock up. Couldn't handle the transition from 0xFFFF to 0x0001 in either direction. That was fixed in all the faster 386's. As I recall, there were about a dozen bugs in the 386. Of course later processors were all checked for those specific bugs, so they never happened again.
Then there's unintended features such as pipeline oddities. If you have self modifying code, and it changes the destination of a jump instruction immediately before executing it, the computer will jump to the old address. Step through those same instructions in a debugger, and it will jump to the new address. Strictly speaking, jumping to the old address is incorrect, but it doesn't break any good code and fixing it would wreck pipelining. This behavior has been known for a long time, and every CPU from at least the 386 to the Pentium 4 behaves this way. It wasn't an important problem because so little code was self modifying. Wasn't any good as a copy protection method either, as only an amateur would be fooled by it. I think it's been resolved in at least 2 ways. First, by amending the documentation for the instruction set to expressly state that behavior is undefined in such a case, and second, by proving that there is never any need for self modifying code. And making the separation between code and data explicit. Now we have No eXecution bits.
There are sometimes even Easter eggs. For some processors, a few unassigned opcodes performed a useful operation. It wasn't by design. Is that a bug? Another case was the use of out of bounds values. For instance, the ancient 6502 supports this packed decimal arithmetic mode, in which 0x99 meant 99. So what happened when some joker gave it an illegal value such as 0xFF? 0xFF was interpreted as 15*10+15 = 165, and one could perform some math on it and get correct results. Divide 0xFF by 2 (shift right), and it would compute the correct result of 0x82. That sort of thing makes life tough for emulators, and I have yet to find an Apple II emulator that reproduces that behavior faithfully.
Intellectual Property is a monopolistic, selfish, and defective concept. It is "tyranny over the mind of man"
Heh. I coded a nice tile based RPG out of it, but I couldn't make it MMOG because there is no socket code in Quick Basic. The trick to making big games in Quick Basic is to write your own Virtual Disk so you can get past the 640k memory limit. Once you have a virtual disk, you can write an interpreted language inside Quick Basic, then your code is simply loaded up in a custom database. I rewrote the whole thing in C/C++ because people told me I could get socket libraries in it, but I gave up on my game entirely when Ultima Online came out because I felt I wouldn't be able to build up a market because my graphics are so bad. I was partially right in thinking there is only enough room for one MMORPG at a time back in 97, but I think I shouldn't have gave up after having coded for thousands of hours with things like Farmville succeeding today.
God spoke to me
Well, $604.94/ea. The memory came with an 8GB Class 4 micro SD, and we got a $10 newegg gift card each. I forget what that was bundled with. If you consider a gift card as cash, they were under $600/ea.
All of those are quantity 1, except the hard drives. They are 8 core 4Ghz (always running in Turbo mode), with 16GB ram, and RAID 1 on the drives. I opted to go more like Google's topless server. I used cable ties to mount up everything on wire racks from Home Depot. Ya, the same plastic/rubber coated ones you'd use in your closet. This is serving out of my house on a business FiOS line, so no one at a datacenter can complain. :) They're running amazingly cool. Because there's nothing interrupting normal convection air currents, all the heat sinks and drives are cool to the touch. They're a bit quieter than my desktop PC, because I don't require an extra fans to pull the hot air out of the case. My regular desktop has a 250cfm fan on it to keep it cool. Without it, and with the side on, it can overheat in a few minutes when gaming.
The room does have an air conditioning return in it, which helps keep the room cool. The only fan I added was a HEPA filter. It's oversized for the room, but it'll help keep dust off the machines. The room is the same temperature as the rest of the house, so I'm happy with it. It serves no purpose for cooling the machines, since it's not even pointed at them. :)
I have some pretty low load servers. Rather than buying a dozen of anything, I opted for using virtual machines. These two servers are hosting 4 VMs at this time, and there will be more. It's a young setup, and I have a lot of work to do on it. I opted to use VirtualBox. It works very well. I had intended trying VMWare ESXi or Citrix XenServer. unfortunately, neither would use the crappy software RAID that the boards provide, and I wasn't willing to drop money on real RAID controllers. I looked around a bit, and it seems that you can try to use some workarounds, but I didn't have the time or inclination to do it, where I could have VirtualBox going in less than an hour.
The VMs are redundant between servers. Further on, you can read more about how I did it in the past between physical boxes. So if a single VM crashes, who cares. If a VM host crashes, well, it's reduced redundancy, but I'm still operating. I'm going to put out more VM hosts, and increase the redundancy. 4 machines with 6 VMs each is like 24 physical boxes. That's a serious savings, especially where the VM host costs about $600.
Let me give you a little history. :)
Long before Google made the pictures of the way they do servers, the company I was at was using COTS parts. That was voyeurweb.com (NSFW). They were hosting with a company not to be named (as in, I can't remember), who sold them on a $50k investment of a Sun server. They promised it was more power than anyone could ever want. That lasted about 3 days. It was after this, I got involved with them. We dropped about $15k on 10 servers. They were fairly cheap machines. Asus gaming motherboards, AMD K6/2 300 CPU, 512MB RAM, 8GB and 20GB IDE drives. The most expensive part at the time was the cases. It was pretty much what you'd be using at home at the time.
We had the occasional failures, but they were usually due to load or CPU fan failures. At the time, they had under 1 million daily viewers, so we could handle that load on 4 of the 10 machines. Load balancing was done with DNS round robin. I know people say it's a poor system, but it worked well. There was typically a 3 second delay if you happened to hit a bad server, and then you'd roll off to the nex
Serious? Seriousness is well above my pay grade.
Since the cat is out of the bag some further clarification is required so I will include some more of the email I received. I didn't quite mean for it to explode onto the scene this quickly, but oh well.
Again, note that this is *NOT* an issue with Bulldozer. And they will have a MSR workaround for earlier models.
>> quote
"AMD has taken your example and also analyzed the segmentation fault and the fill_sons_in_loop code. We confirm that you have found an erratum with some AMD processor families. The specific compiled version of the fill_sons_in_loop code, through a very specific sequence of consecutive back-to-back pops and (near) return instructions, can create a condition where the processor incorrectly updates the stack pointer.
AMD will be updating the Revision Guide for AMD Family 10h Processors and the Revision Guide for AMD Family 12h Processors, to document this erratum. In this documentation update, which will be available on amd.com later this month, the erratum number for this issue will be #721. The revision guide will also note a workaround that can be programmed in a model-specific register (MSR)."
end quote
They go on to document a specific workaround when the MSR is not programmed, which is basically to add a nop for every five pop+return instructions (though I'm not sure if the nop must occur between sequences or within the sequence). I will note that just the presence of 5xPOP + RET does not trigger the bug alone, it requires a very specific set of circumstances setup prior to that (that gcc's fill_sons_in_loop() procedure was able to trigger when gcc 4.7.x was compiled -O, when compiling particular .c files).
As I said, this bug was very difficult to reproduce. It took a year to isolate it and find a test case that would reproduce it in a few seconds. Until then it was taking me upwards of 2 days to reproduce it on a 48-core and much longer to reproduce it on a 4-core.
Since the bug was stack pointer address is sensitive the initial stack randomization that DragonFly does multiplied the time it took to reproduce the bug. But without the stack randomization the bug would NOT reproduce at all (I would never have observed it in the first place). In otherwords, the bug was *very* stack address sensitive on top of everything else.
I was ultimately able to improve the time it took to reproduce the bug by pouring over all my previous buildworld runs and finding the .c files that gcc had compiled that were most statistically likely for gcc to seg-fault in. Then once I isolated the files I iterated all possible starting stack offsets and eventually managed to reproduce the bug within 10 seconds using a gcc loop (10-20 gcc runs on the same file).
Changing the stack offset by a mere 16 bytes and the bug went away completely. The one or two particular stack offsets that reproduced the bug could then be further offset in multiples of 32K and still reproduce the bug at the same rate. Using a later version of gcc and the bug disappeared. Compiling with virtually any other options (turning on and off optimizations)... the bug disappeared.
On the bright side, I thought this was a bug in DragonFly for most of last year and set about 'fixing' it, and wound up refactoring most of DragonFly's VM system to get rid of SMP bottlenecks and making it perform much better on SMP in the face of a high VM fault rate. So even though we wound up not doing the 2.12 release the eventual 3.0 release (that we just put out recently) has greatly improved cpu-bound performance on SMP systems.
-Matt