Whose Bug Is This Anyway?

← Back to Stories (view on slashdot.org)

Posted by Soulskill on Tuesday December 18, 2012 @01:04PM from the it's-nobody's-fault-and-everybody's-angry dept.

An anonymous reader writes "Patrick Wyatt, one of the developers behind the original Warcraft and StarCraft games, as well as Diablo and Guild Wars, has a post about some of the bug hunting he's done throughout his career. He covers familiar topics — crunch time leading to stupid mistakes and finding bugs in compilers rather than game code — and shares a story about finding a way to diagnose hardware failure for players of Guild Wars. Quoting: '[Mike O'Brien] wrote a module ("OsStress") which would allocate a block of memory, perform calculations in that memory block, and then compare the results of the calculation to a table of known answers. He encoded this stress-test into the main game loop so that the computer would perform this verification step about 30-50 times per second. On a properly functioning computer this stress test should never fail, but surprisingly we discovered that on about 1% of the computers being used to play Guild Wars it did fail! One percent might not sound like a big deal, but when one million gamers play the game on any given day that means 10,000 would have at least one crash bug. Our programming team could spend weeks researching the bugs for just one day at that rate!'"

22 of 241 comments (clear)

Min score:

Reason:

Sort:

The memory thing... by Loopy · 2012-12-18 13:15 · Score: 5, Informative

...is pretty much what those of us that build our own systems do anytime we upgrade components (RAM/CPU/MB) or experience unexplained errors. It's similar to running the Prime95 torture tests overnight, which also checks calculations in memory against known data sets for expected values.
Good stuff for those that don't already have a knack for QA.
1. Re:The memory thing... by AaronLS · 2012-12-18 13:46 · Score: 5, Interesting
  
  "The defect rate on hardware is so low you don't need to"
  I think the point of the article is to cast significant doubt on statements like this.
2. Re:The memory thing... by DMUTPeregrine · 2012-12-18 13:47 · Score: 5, Informative
  
  Unless you're trying to overclock.
  Admittedly that's a small percentage of the populace, even among people who build their own systems.
  
  --
  Not a sentence!
3. Re:The memory thing... by Alwin+Henseler · 2012-12-18 14:39 · Score: 5, Informative
  
  The defect rate on hardware is so low you don't need to - buy your stuff from Newegg, assemble, and install. Either it's DOA or runs forever.
  
  Look up "bathtub curve" sometime. Even well-built, perfectly working gear is aging, aging usually translates into "reduced performance / reliability", and any electronic part will fail sometime. Possibly gradually. Especially the just-makes-it-past-warranty crap that's sold these days. And there may be instabilities / incompatibilities that only show under very specific conditions (like when a system is pushed really hard).
  That's ignoring things like ambient temperature variations, CPU coolers clogging with dust over the years, sporadic contact problems on connectors, or the odd cosmic ray that nukes a bit in RAM (yes that happens, too). A lot of things must come together to have (and keep) a reliable working computer, so a lot of things can go wrong and put an end to that.
4. Re:The memory thing... by DigiShaman · 2012-12-18 15:38 · Score: 5, Interesting
  
  I think it's a crying shame that the PC industry hasn't forced ECC as a mandatory standard. Servers and workstations have it, and with memory as cheap as it is to fab, there's absolutely -zero- excuse not to use ECC!!! With the transistor count as densely packed and small, errors will occur. I'll go a step further and even recommend ECC throughout the entire motherboard bridge buses. End-to-end error correction should be a requirement!
  
  --
  Life is not for the lazy.
Wait its possible?! by Anonymous Coward · 2012-12-18 13:16 · Score: 5, Funny

You mean all those times when my code was 'fine' and i gave up it really could have been the compiler or a memory problem
shit i'm a much better programmer than i realized
1. Re:Wait its possible?! by disambiguated · 2012-12-18 15:58 · Score: 5, Insightful
  
  You're a better programmer for assuming it's not a compiler bug and trying harder to figure out what you did wrong.
  
  I've been programming professionally for over 20 years, mostly in C/C++ (MSVC, GCC, and recently CLang (and others back in the olden days)). I've seen maybe two serious compiler bugs in the past 10 years. They used to be common.
  
  On the other hand, I can't count how many times I've seen coders insist there must be a compiler bug when after investigation, the compiler had done exactly what it should according to the standard (or according to the compiler vendor's documentation when the compiler intentionally deviated from the standard).
  
  By "serious", I mean the compiler itself doesn't crash, issues no warnings or errors, but generates incorrect code. Maybe I've just been lucky. (Or maybe QA just never found them ;-)
  
  Oh, and btw, yes I realize you were joking (and I found it funny.)
OsStress by larry+bagina · 2012-12-18 13:20 · Score: 5, Informative

Microsoft found similar impossible bugs when overclocking was involved.

--
Do you even lift?
These aren't the 'roids you're looking for.
1. Re:OsStress by Anonymous Coward · 2012-12-18 14:33 · Score: 5, Interesting
  
  Then again, it might not be overclocking after all.
  More relevantly, Microsoft has access to an enormous wealth of data about hardware failures from Windows Error Reporting. This paper has some fascinating data in it:
  - Machines with at least 30 days of accumulated CPU time over an 8 month period had a 1 in 190 chance of crashing due to a CPU subsystem fault
  - Machines that crashed once had a probability of 1 in 3.3 of crashing a second time
  - The probability of a hard disk failure in the first 5 days of uptime is 1 in 470
  - Once you've had one hard disk failure, the probability of a second failure is 1 in 3.4
  - Once you've had two failures, the probability of a third failure is 1 in 1.9
  Conclusion: When you get a hard disk failure, replace the drive immediately.
2. Re:OsStress by Anonymous Coward · 2012-12-18 14:48 · Score: 5, Insightful
  
  Bullshit. While Intel does occasionally bin processors into lower speeds to fulfill quotas and such, often times those processors are binned lower because they can't pass the QA process at their full speed. But they can pass the QA process when running at a lower speed. These processors were meant to be the same as the more expensive line, but due to minor defects can't run stably or reliably at the higher speed. Or at least not enough for Intel to sell them at full speed.
  Which is a large part of why some processors in the same batch can handle it when others can't.
  As much as I hate Intel, I think we could at least realize that they are often times doing this with good reason.
Caution: by fahrbot-bot · 2012-12-18 13:25 · Score: 5, Funny

Bug hunts on LV-426 often end badly.

--
It must have been something you assimilated. . . .
Re:stress test by SJHillman · 2012-12-18 13:43 · Score: 5, Funny

In my field, I have a bunch of grass, a few shrubs and even a small tree. Lots of rodents and birds. If a computer can survive two weeks sitting in my field and still power on, you have a damned good system. If not, you're left with people wondering why you left your computer in my field for two weeks.
Re:I don't believe 1% of computers give wrong answ by PaladinAlpha · 2012-12-18 13:49 · Score: 5, Insightful

You don't have any idea what you're talking about, and that's why you don't understand what he's talking about.
Re:stress test by AaronLS · 2012-12-18 13:50 · Score: 5, Funny

He didn't say anything about a computer: "In my field, if YOU can survive"... scary...
How to deal with compiler bugs by MtHuurne · 2012-12-18 13:52 · Score: 5, Insightful

If you suspect the compiler is generating invalid machine code, try to make a minimal test case for it. If you succeed, file a bug report and add that test case; the compiler developers will appreciate it. If you don't succeed in finding a minimal test case that triggers the same issue, it's likely not a compiler bug but an issue in your program in some place where you weren't expecting it.
Re:I don't believe 1% of computers give wrong answ by Jeremi · 2012-12-18 14:06 · Score: 5, Insightful

I think this is bull. I just don't believe 1% of computers give wrong answers
1% of all computers? Probably not.
1% of gamers' computers, in an era when PC gaming technology was progressing very quickly, and so gamers were often running overclocked (or otherwise poorly set up) hardware? Sounds plausible enough.

--

I don't care if it's 90,000 hectares. That lake was not my doing.
Re:I don't believe 1% of computers give wrong answ by MtHuurne · 2012-12-18 14:08 · Score: 5, Insightful

He said 1% of computers that were used to play Guild Wars gave wrong answers. Gaming PCs are more likely to be overclocked too far, have under-dimensioned power supplies or overheating issues than the average PC. 1% doesn't sound unrealistically high to me.
Yep, seen it all by russotto · 2012-12-18 14:26 · Score: 5, Insightful

I've had compilers miscompile my code, assemblers mis-assemble it, and even on a few cases CPUs mis-execute it consistently (look up CPU6 and msp430). Random crashes due to bad memory/cpu... yep. But on very rare occasions, I find that the bug is indeed in my own code, so I check there first.
Typical for safety cert programs by Okian+Warrior · 2012-12-18 14:32 · Score: 5, Interesting

We deal with this type of bug all the time in safety-certified systems (medical apps, aircraft, &c).
Most of the time an embedded program doesn't use up 100% of the CPU time. What can you do in the idle moments?
Each module supplies a function "xxxBIT" (where "BIT" stands for "Built In Test") which checks the module variables for consistency.
The serial driver (SerialBIT) checks that the buffer pointers still point within the buffer, checks that the serial port registers haven't changed, and so on.
The memory manager knows the last-used static address for the program (ie - the end of .data), and fills all unused memory with a pattern. In it's spare time (MemoryBIT) it checks to make sure the unused memory still has the pattern. This finds all sorts of "thrown pointer" errors. (Checking all of memory takes a long time, so MemoryBIT only checked 1K each call.)
The stack pointer was checked - we put a pattern at the end of the stack, and if it ever changed we knew something want recursive or used too much stack.
The EEPROM was checksummed periodically.
Every module had a BIT function and we check every imaginable error in the processor's spare time - over and over continuously.
Also, every function began with a set of ASSERTs that check the arguments for validity. These were active in the released code. The extra time spent was only significant in a handful of functions, so we removed the ASSERTs only in those cases. Overall the extra time spent was negligible.
The overall effect was a very "stiff" program - one that would either work completely or wouldn't work at all. In particular, it wouldn't give erroneous or misleading results: showing a blank screen is better than showing bad information, or even showing a frozen screen.
(Situation specific: Blank screen is OK for aircraft, but not medical. You can still detect errors, log the problem, and alert the user.)
Everyone says to only use error checking during development, and remove it on released code. I don't see it that way - done right, error checking has negligible impact, and coupled with good error logging it can turbocharge your bug-fixing.
Re:Reminded me of my first C application by richardcavell · 2012-12-18 15:46 · Score: 5, Informative

I just want to correct this, not to prove how smart I am but because there are novice programmers out there who will learn from this case. The statement:

if (i = 1) {

is equivalent to:

i = 1; /* correction */ if (i) {
Even better! by Roger+W+Moore · 2012-12-18 16:15 · Score: 5, Funny

the compiler had done exactly what it should according to the standard...
That's even better - it means that you've found a bug in the standard! ;-)
More error checking by Okian+Warrior · 2012-12-18 16:21 · Score: 5, Interesting

My previous was modded up, so here's some more checks.
During boot, the system would execute a representative sample of CPU instructions, in order to test that the CPU wasn't damaged. Every mode of memory storage (ptr, ptr++, --ptr), add, subtract, multiply, divide, increment &c.
During boot, all memory was checked - not a burin-in test, just a quick check for integrity. The system wrote 0, 0xFF, A5, 5A and read the output back. This checked for wires shorted to ground/VCC, and wires shorted together.
During boot, the .bss segment was filled with a pattern, and as a rule, all programs were required to initialize all of their static variables. Each routine had an xxxINIT function which was called at boot. You could never assume a static variable was initialized to zero - this caught a lot of "uninitialized variable" errors.
(This allowed us to reboot specific systems without rebooting the system. Call the SerialINIT function, and don't worry about reinitializing that section's static vars.)
The program code was checksummed (1K at a time) continuously.
When filling memory, what pattern should you use? The theory was that any program using an uninitialized variable would crash immediately because of the pattern. 0xA5 is a good choice:
1) It's not 0, 1, or -1, which are common program constants.
2) It's not a printable character
3) It's a *really big* number (negative or unsigned), so array indexing should fail
4) It's not a valid floating point or double
5) Being odd, it's not a valid pointer
Whenever we use enums, we always start the first one at a different number; ie:
enum Day { Sat = 100, Sun, Mon... }
enum Month { Jan = 200, Feb, Mar, ... }
Note that the enums for Day aren't the same as Month, so if the program inadvertently stores one in the other, the program will crash. Also, the enums aren't small integers (ie - 0, 1, 2), which are used for lots of things in other places. Storing a zero in a Day will cause an error.
(This was easy to implement. Just grep for "enum" in the code, and ensure that each one starts on a different "hundred" (ie - one starts at 100, one starts at 200, and so on).)
The nice thing about safety cert is that the hardware engineer was completely into it as well. If there was any way for the CPU to test the hardware, he'd put it into the design.
You could loopback the serial port (ARINC on aircraft) to see if the transmitter hardware was working, you could switch the A/D converters to a voltage reference, he put resistors in the control switches so that we could test for broken wires, and so on.
(Recent Australian driver couldn't get his vehicle out of cruise-control because the on/off control wasn't working. He also couldn't turn the engine off (modern vehicle) nor shift to neutral (shift-by-wire). Hilarity ensued. Vehicle CPU should abort cruise control if it doesn't see a periodic heartbeat from the steering-wheel computer. But, I digress...)
If you're interested in the software safety systems, look up the Therac some time. Particularly, the analysis of the software bugs. Had the system been peppered with ASSERTs, no deaths would have occurred.
P.S. - If you happen to be building a safety cert system, I'm available to answer questions.