Whose Bug Is This Anyway?
An anonymous reader writes "Patrick Wyatt, one of the developers behind the original Warcraft and StarCraft games, as well as Diablo and Guild Wars, has a post about some of the bug hunting he's done throughout his career. He covers familiar topics — crunch time leading to stupid mistakes and finding bugs in compilers rather than game code — and shares a story about finding a way to diagnose hardware failure for players of Guild Wars. Quoting: '[Mike O'Brien] wrote a module ("OsStress") which would allocate a block of memory, perform calculations in that memory block, and then compare the results of the calculation to a table of known answers. He encoded this stress-test into the main game loop so that the computer would perform this verification step about 30-50 times per second. On a properly functioning computer this stress test should never fail, but surprisingly we discovered that on about 1% of the computers being used to play Guild Wars it did fail! One percent might not sound like a big deal, but when one million gamers play the game on any given day that means 10,000 would have at least one crash bug. Our programming team could spend weeks researching the bugs for just one day at that rate!'"
...is pretty much what those of us that build our own systems do anytime we upgrade components (RAM/CPU/MB) or experience unexplained errors. It's similar to running the Prime95 torture tests overnight, which also checks calculations in memory against known data sets for expected values.
Good stuff for those that don't already have a knack for QA.
You mean all those times when my code was 'fine' and i gave up it really could have been the compiler or a memory problem
shit i'm a much better programmer than i realized
Microsoft found similar impossible bugs when overclocking was involved.
Do you even lift?
These aren't the 'roids you're looking for.
Bug hunts on LV-426 often end badly.
It must have been something you assimilated. . . .
I think this is bull. I just don't believe 1% of computers give wrong answers. There are many reasons why precomputed table might differ - threading, reordering of floating point operations, etc. Basically, compilers guarantee certain precision, not by-bit determinstic result (unless you set up certain IEEE flags, which are not on by default).
In my field, if you can survive a gcc (gcc.gnu.org) testsuite run, twice, and get the same answer, you have a verified good system. If not, you have a steaming pile of trash you should throw away. The begins, and ends all stress testing you need to do.
I think this is bull. I just don't believe 1% of computers give wrong answers.
Why would he lie about it?
I found out his the hard way: by buying different DIMM modules, combining them, and of course NOT combining them. Nevertheless, when you do a lot of multitasking, play 2-3 games...ok, ok, only one, but the other two are still in background, having some VM, etc, you will find out how fragile the computers are nowadays. The solution? Buy some DELL, and be happy with the most stable, and with the least performance computer.
You don't have any idea what you're talking about, and that's why you don't understand what he's talking about.
If you suspect the compiler is generating invalid machine code, try to make a minimal test case for it. If you succeed, file a bug report and add that test case; the compiler developers will appreciate it. If you don't succeed in finding a minimal test case that triggers the same issue, it's likely not a compiler bug but an issue in your program in some place where you weren't expecting it.
I actually believe it. I am sure they might have think of floating point precision problem. But most likely they only used integers. That's what prime 95 and memtest are doing. Integer and memory operations uncover most common hardware failure. I encountered many computers with faulty hardware when stressed. And I am sure guildwars was stressful.
For being a skilled developer, I can't believe he would not think that Dev/Test/Prod build environments not running the same version of the compiler was not an issue (Obviously, until it was an issue).
That's Development Cycle 101.
Linux O Muerte!
I think this is bull. I just don't believe 1% of computers give wrong answers
1% of all computers? Probably not.
1% of gamers' computers, in an era when PC gaming technology was progressing very quickly, and so gamers were often running overclocked (or otherwise poorly set up) hardware? Sounds plausible enough.
I don't care if it's 90,000 hectares. That lake was not my doing.
I won't go into specific reasons you mention, but it is perfectly possible to write code that has a known, fully deterministic result. After all: compilers produce machine code, and the bulk of that is integer operations which have exactly defined behavior with 0 room for interpretation (when it comes to digital logic like CPU's, "defined" is deterministic). Maybe there are exceptions (like floating point? don't count on it), maybe for some types of operations you need to sidestep a compiler and code some assembly directly, but that's beside the point.
With that in hand, expect some of computed results to turn out wrong. Knowing what junk parts go into computers sometimes, how shoddy some machines are built, and how some people abuse their computers, I'd think a 1% failure rate is probably on the low end of the scale.
For example, try running Memtest86 sometime, leave running for a few hours, repeat for other computers you encounter, and see how many computers you need to try before you see it spit out errors. You might be surprised.
He said 1% of computers that were used to play Guild Wars gave wrong answers. Gaming PCs are more likely to be overclocked too far, have under-dimensioned power supplies or overheating issues than the average PC. 1% doesn't sound unrealistically high to me.
shows what happens pulling 12+ hour days does at least some did not die due to it. Yes that has happened in the past with trains, trucks, airplanes , ect's.
> I actually believe it. I am sure they might have think of floating point precision problem.
I can believe it. Ten years ago on one the PC games I worked on there were significant floating-point differences between Intel and AMD. Fortunately it was an RTS so we could get away with fixed-point. If we would of been forced to deal with floats it would of been a hassle to keep them "in sync."
Floating-point is an approximation anyways, so IMHO
a) the server should be making the authoritative decision(s), and
b) should be sending a quantized result to the clients.
I've had compilers miscompile my code, assemblers mis-assemble it, and even on a few cases CPUs mis-execute it consistently (look up CPU6 and msp430). Random crashes due to bad memory/cpu... yep. But on very rare occasions, I find that the bug is indeed in my own code, so I check there first.
We deal with this type of bug all the time in safety-certified systems (medical apps, aircraft, &c).
Most of the time an embedded program doesn't use up 100% of the CPU time. What can you do in the idle moments?
Each module supplies a function "xxxBIT" (where "BIT" stands for "Built In Test") which checks the module variables for consistency.
The serial driver (SerialBIT) checks that the buffer pointers still point within the buffer, checks that the serial port registers haven't changed, and so on.
The memory manager knows the last-used static address for the program (ie - the end of .data), and fills all unused memory with a pattern. In it's spare time (MemoryBIT) it checks to make sure the unused memory still has the pattern. This finds all sorts of "thrown pointer" errors. (Checking all of memory takes a long time, so MemoryBIT only checked 1K each call.)
The stack pointer was checked - we put a pattern at the end of the stack, and if it ever changed we knew something want recursive or used too much stack.
The EEPROM was checksummed periodically.
Every module had a BIT function and we check every imaginable error in the processor's spare time - over and over continuously.
Also, every function began with a set of ASSERTs that check the arguments for validity. These were active in the released code. The extra time spent was only significant in a handful of functions, so we removed the ASSERTs only in those cases. Overall the extra time spent was negligible.
The overall effect was a very "stiff" program - one that would either work completely or wouldn't work at all. In particular, it wouldn't give erroneous or misleading results: showing a blank screen is better than showing bad information, or even showing a frozen screen.
(Situation specific: Blank screen is OK for aircraft, but not medical. You can still detect errors, log the problem, and alert the user.)
Everyone says to only use error checking during development, and remove it on released code. I don't see it that way - done right, error checking has negligible impact, and coupled with good error logging it can turbocharge your bug-fixing.
This is why stress testing is so important. The system may seem stable at overclocked speeds but only while it is lightly or even moderately loaded, and not every error will result in a kernel panic. The hardest errors to get stable are often the subtle ones that cause cascades elsewhere, minutes or hours after the load finished.
I start by getting it stable enough to pass memtest86+ tests 5 and 7 at (or as close as possible) my target frequencies/dividers. This is pretty easy to do nowadays, but it's a good sanity check starting point before booting the OS and minimizes gross misconfigurations that cause filesystem corruption. Then I run prime95, then linpack, then y cruncher, then loops of a few 3dmark versions. Sometimes I run the number crunchers simultaneously across all cores, first configured to stress the cpu/cache, then with large sets to stress ram (but not swap! in fact turn swap off for this). The minimum time for all of this really should be 12 hrs.. 24 is best, or more if you're paranoid. A variety of loads over this time is important because the synthetic ones are often highly repetitious, and this can sometimes fail to expose problems despite the load the system's under. The 3dmark (or pick a scriptable util of your choice) stresses bus IO as well as all the really cranky and picky gfx driver code. As a unique stressor, I use a quake 3 map compile that eats most of the ram and pegs the cpu for hours.. q3map2 is a bitch and it usually finds those subtle 'non-fatal' hardware errors if they exist.
If the boot survives without an application or kernel crash (or other wonky behavior), I run a few games in timedemo loops. In the old days this was quake1/2/3, but these days I stick with games like metro 2033 which have their own bench utilities. these tests are still valid even if your intended use is for 'workstation' class work and don't game much, but still want to squeeze as much performance as you can from your hardware. I do both with mine and have had great success with this method.
I get hilarious impossible errors on my gaming rig all the time. Memtest revealed my RAM is throwing errors about once in a couple million memory transactions. Leaky gate I guess. Not about to replace the RAM any time soon, though -- nobody is really making DDR2 anymore and the utterly random errors I get when data comes back incorrect is very instructive as to which companies practice sanity checking and which skip it. Oh, and how annoying it can be when windows decides device driver errors are worthy of halting the OS over.
Don't forget to do out of the box testing / testing for stuff that you may not think of off hand.
I can't remember the exact code sequence, but in a loop, I had the statement:
if (i = 1) {
Where "i" was the loop counter.
Most of the time, the code would work properly as other conditions would take program execution but every once in a while the loop would continue indefinitely.
I finally decided to look at the assembly code and discovered that in the conditional statement, I was setting the loop counter to 1 which was keeping it from executing.
I'm proud to say that my solution to preventing this from happening is to never place a literal last in a condition, instead it always goes first like:
if (1 = i) {
So the compiler can flag the error.
I'm still amazed at how rarely this trick is not taught in programming classes and how many programmers it still trips up.
myke
Mimetics Inc. Twitter
If you suspect the compiler is generating invalid machine code, try to make a minimal test case for it. If you succeed, file a bug report and add that test case; the compiler developers will appreciate it. If you don't succeed in finding a minimal test case that triggers the same issue, it's likely not a compiler bug but an issue in your program in some place where you weren't expecting it.
Yeah, right. Let's see how that works out in practice.
I go to the home page of the project with bug in hand (including sample code). Where do I log the problem?
I have to register with your site. One more external agent gets my E-mail, or I have to take pains to manage multiple E-mails to avoid spam. (I don't want to be part of your community! I just thought you wanted to make your product better.)
Once registered, I'm subscribed to your newsletter. (My temp E-mail has been getting status updates from the GCC crowd for years. My mail reader does something funky with the subject line, so responding with "unsubscribe" doesn't work for me.)
Once entered, my E-mail and/or name is publicly available on the bug report for the next millenium. In plain text in the bug report, and sometimes in the publicly-accessible changelog - naked for the world to see (CPAN is especially fragrant).
Some times the authors think it's the user's problem (no, really? This program causes gcc to core dump. How can that be *my* fault?) Some times the authors interpret the spec different from everyone else (Opera - I'm looking at you). Some times you're just ignored, some times they say "We're rewriting the core system, see if it's still there at the next release", and some times they say "it's fixed in the next release, should be available in 6 months".
What you really do is figure out the sequence of events that causes the problem, change the code to do the same thing in a different way (which *doesn't* trigger the error), and get on with your life. I've given up reporting bugs. It's a waste of time.
That's how you deal with compiler bugs: figure out how to get around them and get on with your work.
No, I'm not bitter...
In my own coding, I tend to *gasp* make mistakes. Sometimes, really, really dumb ones.
One of the biggest problems with my coding, is that I am often the only real coder looking at it. Even my FOSS work seldom gets reviewed by coders.
I can't say enough about peer review. I wish I had more. It can really suck, as one thing that geeks LOVE to do, is cut down other geeks. However, they are sometimes right, and should be heard.
Negative feedback makes the product better. Positive feedback makes the producer feel better.
I prefer a better product, but that's just me.
I had an interesting bug just the other day in my FOSS project. It's an iOS (iPhone/iPad) app that uses the MapKit Framework API.
The bug was on this line.
The original code is here.
So that folks don't have to look at a whole bunch of source, here's the problematic two lines:
[mapSearchView addAnnotation:myMarker]; [mapSearchView setDelegate:self];
When iOS 6 came out (with Apple's...wonderful...new maps), the black marker suddenly started showing as the default marker (this only works on iPads, so no one seemed to see it).
I went nuts trying to figure it out (actually, I've been nuts for a long time, but now I have something to blame it on).
I traced into the callbacks, and saw that they were being called with an empty annotation. Whiskey Tango Foxtrot?
Then, just for s's and g's, tried this:
[mapSearchView setDelegate:self]; [mapSearchView addAnnotation:myMarker];
Damn if that didn't fix it.
It was a case of an ambiguous API contract. The Apple maps call the annotation setup as soon as the annotation is set, and the old Google API waited until a few things were set up, so the delegate call set after the annotation worked.
I could rail against the framework, but it was really my own fault, and I am just glad I figgered it out.
"For every complex problem there is an answer that is clear, simple, and wrong."
-H. L. Mencken
I believe it, I am actually surprised by the low number. The problem with memory is that a simple memory test often fails to detect errors, so a POST will not find anything but the easiest faults.
memtest86+ is actually really smart, it tries a lot of different kind of access methods and bit patterns for certain types of errors. Even memtest86+ can fail to detect memory errors.
Interestingly enough the best test for memory errors they found is using the gcc compiler, try to compile gentoo (including X, kde, perl (perl is huge)) with many parallel compiles so that it will hit swap slightly and hope that it won't segfault.
This article sounds interesting, I haven't read it yet. On Dr. Dobbs there was also an article not long ago about checking (like fsck) your application data structures continuously to check for errors. I am almost thinking that you should run a CRC on all your data structures. Nothing is worse than a customer who has faulty memory.
Ok trolling is one thing, but what you wrote is a sick joke.
which, well, it can be.
i can't believe you don't understand that the brain doesn't work 100% reliably when you force it past the breaking point like this. its work 101.
It's just a matter of whether you realize it or not.
The blatant ones cause an application or OS crash. But depending on what got corrupted, it might just cause a momentary application glitch, or even cause an alteration in the contents of a file that you won't notice for weeks... if ever.
When I build PCs, they get an overnight Memtest run at a minimum. Most of the time I also use ECC RAM to protect against random flipped bits and DIMMs that fail after being in use for a while.
anything that both assigns and tests a loop index is by definition a fucking accident waiting to happen. its like driving a car without wearing a seatbelt and then deciding the 'solution' is to put the steering wheel in the back seat instead of the front.
I agree, but
if (i = 1) {
is a perfectly valid "C" (and Java) statement - there was no intention of putting an assignment in a conditional statement.
Modern compilers now issue warnings on statements like this, but at the time nothing was returned.
myke
Mimetics Inc. Twitter
Fortunately it was an RTS so we could get away with fixed-point.
Does it really vary by genre? For a game world the size of Liechtenstein, a 32-bit fixed-point length gives precision down to 10 microns or so. And even in a vast open world, you start to get glitches like the far lands in Minecraft if you stray more than 12.5 million units from the origin.
He said 1% of computers that were used to play Guild Wars gave wrong answers. Gaming PCs are more likely to be overclocked too far, have under-dimensioned power supplies or overheating issues than the average PC. 1% doesn't sound unrealistically high to me.
guild wars runs okay on a crappy netbook, let alone anyones PC, If you need to OC to play Guild Wars then you are using a Pentium III.
Just saying.
I've fixed a lot of PC. In fact, i pride myself on the fact that I can usually figure out what is wrong with a computer, software or hardware. And I've seen some funky hardware in my time. And lost a lot of hardware going bad.
I bet more then 1% of computers are problemmatic and the owners don't know it. The last chunk of memory could be bad, but if they don't use all the memory, might never find that out. (or if they test it). They might dismissed random shutdowns without understanding there is a problem.
In my experience, there is a lot of computer hardware out there that is crappily made and shouldn't be sold, let alone in someone's computer.
Be seeing you...
Worse, the article hints at a bigger problem:
"We had "pushed" a new build out to end-users, and now none of them could play the game!"
Which I read as: developers write & debug code, that code goes through a build server which builds it & combines with game data etc, result of that is pushed to users. The obvious step missing here: make sure the exact same stuff you're pushing to users, is working & tested thoroughly before release. Seems like a gaping Quality Assurance fail right there, forget differences between developer and production systems.
Skip that step and you're implicitly assuming that correct code (like, what's known to work well on developer's system) will produce correct working end product. Even if developer's system and production systems are configured 100% the same, that assumption is still flawed: there's always the possibility of file corruption, eg. a random single-bit error that occurs somewhere during the build process, or anything else that goes into the end product which a developer doesn't check directly.
Of course it's best to make sure individual steps in the process are reliable, but whatever you do: at the very least check what you kick out the door. QA 101.
Hiya,
I just noticed the confusion when I put in "keeping it from executing" when I meant to say "keeping it from exiting [the loop]".
Sorry about that,
myke
Mimetics Inc. Twitter
I don't think this should be modded as flamebait. Personally I view the article with a similar degree of skepticism and incredulity. I have a good friend who works at a major chip manufacturer and specializes in fault detection. He related to me that, essentially they have never seen a case of an undetected (by the CPU) fault, despite running tests like this on massively huge systems. Between a game programmer and the company who makes their bread and butter doing this, I'm going to have to go with the latter until someone posts code or something more concrete than what I view as a lot of speculation.
the compiler had done exactly what it should according to the standard...
That's even better - it means that you've found a bug in the standard! ;-)
My previous was modded up, so here's some more checks.
During boot, the system would execute a representative sample of CPU instructions, in order to test that the CPU wasn't damaged. Every mode of memory storage (ptr, ptr++, --ptr), add, subtract, multiply, divide, increment &c.
During boot, all memory was checked - not a burin-in test, just a quick check for integrity. The system wrote 0, 0xFF, A5, 5A and read the output back. This checked for wires shorted to ground/VCC, and wires shorted together.
During boot, the .bss segment was filled with a pattern, and as a rule, all programs were required to initialize all of their static variables. Each routine had an xxxINIT function which was called at boot. You could never assume a static variable was initialized to zero - this caught a lot of "uninitialized variable" errors.
(This allowed us to reboot specific systems without rebooting the system. Call the SerialINIT function, and don't worry about reinitializing that section's static vars.)
The program code was checksummed (1K at a time) continuously.
When filling memory, what pattern should you use? The theory was that any program using an uninitialized variable would crash immediately because of the pattern. 0xA5 is a good choice:
1) It's not 0, 1, or -1, which are common program constants.
2) It's not a printable character
3) It's a *really big* number (negative or unsigned), so array indexing should fail
4) It's not a valid floating point or double
5) Being odd, it's not a valid pointer
Whenever we use enums, we always start the first one at a different number; ie:
enum Day { Sat = 100, Sun, Mon... } ... }
enum Month { Jan = 200, Feb, Mar,
Note that the enums for Day aren't the same as Month, so if the program inadvertently stores one in the other, the program will crash. Also, the enums aren't small integers (ie - 0, 1, 2), which are used for lots of things in other places. Storing a zero in a Day will cause an error.
(This was easy to implement. Just grep for "enum" in the code, and ensure that each one starts on a different "hundred" (ie - one starts at 100, one starts at 200, and so on).)
The nice thing about safety cert is that the hardware engineer was completely into it as well. If there was any way for the CPU to test the hardware, he'd put it into the design.
You could loopback the serial port (ARINC on aircraft) to see if the transmitter hardware was working, you could switch the A/D converters to a voltage reference, he put resistors in the control switches so that we could test for broken wires, and so on.
(Recent Australian driver couldn't get his vehicle out of cruise-control because the on/off control wasn't working. He also couldn't turn the engine off (modern vehicle) nor shift to neutral (shift-by-wire). Hilarity ensued. Vehicle CPU should abort cruise control if it doesn't see a periodic heartbeat from the steering-wheel computer. But, I digress...)
If you're interested in the software safety systems, look up the Therac some time. Particularly, the analysis of the software bugs. Had the system been peppered with ASSERTs, no deaths would have occurred.
P.S. - If you happen to be building a safety cert system, I'm available to answer questions.
It's not valid Java. Java doesn't allow assignment inside a conditional or loop statement.
I've never coded an FPS though but can imagine it would involve more use of raycasting (and a high precision) compared to that of an RTS, which I have one shipped game experience.
Lots of girls play guildwars. I've been asked by a few to take a look at their computer and why its crashing all the time.
Its always been cheap laptops with hello kitty stickers or something on them conveniently placed on a cutsey little carpet for their laptop tray.
This and the cat hair from cutsey little invariable cat that likes to sit on the warm laptop do the rest.
faulty hardware is a common problem in clusters. We still have a 5 years old cluster at work that is falling in pieces due to faulty processor, faulty memory, faulty power supplies and faulty hard drives. Some errors are just weird. You never saw a machine having errors in memtest86 but stopping to have them once you swapped two dimms? I see that from time to time and I don't spend much time dealing directly with hardware.
The problem is that you can't really test an overclocked CPU. CPUs do not work perfectly or catastrophically fail (crash the software). There is a range of errors between these two extremes. Some of the overclocking induced errors are quite subtle, a simple incorrect answer. The problem is that these subtle errors are sometimes dependent upon a certain sequence of instructions or certain data patterns and the sequences and data, as well as the failing instruction and the clock speed where these errors manifest, can vary from one individual CPU to the next.
You can run all sorts of test utilities and you will probably only discover the CPUs with the more severe problems. You will probably not be able to tell the CPUs that have subtle failures from CPUs that are functioning properly.
I do both HW and SW and (too) much repair work. As many of you know, bad capacitors are epidemic. I've traced all kinds of HW and SW problems down to bad capacitors. They frequently show no external sign of failure.
Before repairing them, I had systems that ran Linux reliably but when Windows booted on the same system, glitches, freezes, bluescreens, etc. I blame it on Windows being a little less efficient with CPU do-nothing cycles (System Idle Process), thus drawing more power and more power supply ripple and more glitching on the digital signals. I'm amazed that they're only finding 1% of systems with problems. Probably because gamers tend to buy (much) better hardware, although not all high-end hardware has good caps.
While at a large game company I wrote the code that collected CPU make and model, video make and model, amount of RAM, OS version, etc. Basically the type of info you see under minimum system requirements. The CPUID instruction can return a vendor string indicating who made the CPU. Intel CPUs return "GenuineIntel". On very very rare and often transient occasions the reported string had a misspelling, the misspellings generally indicated a single bit error. Whether an overclocked CPU generating subtle errors or bad RAM or a bad power supply or something else is responsible I can't say. All I really know is that outside of the CPU manufacturing facility things do go wrong in hardware. The article is consistent with various things I have seen.
Higher end "gamer" motherboards come with default overclock settings, it doesn't require anything more then leaving the default settings as they are for the motherboard to attempt optimal settings rather then purely the settings your CPU/Memory themselves report. Going even further is also pretty easy requiring little more then selecting performance mode from a nice graphical screen.
Yes, extreme overclocking is still for enthousiasts but running your hardware slightly faster then recommended by you CPU/Memory maker is childsplay today on anything beyond an office computer.
MMO Quests are like orgasms:
You may solo them, I prefer them in a group.
Most wrong answers fail to crash the computer, so are unnoticed. Bit errors happen far more often than you'd believe. Copy 100 GB of data with no validation and no retries, you're almost guaranteed to have at least 1 bit error. Don't care how you do it - memory to memory, hard disk to hard disk. Copy 100 GB of data over the Internet, and you're VERY likely to have a bit error that gets past the TCP/IP checksums.
Crashes due to "that bit is not right, don't know how that happened, but it wasn't a bug, it's hardware" happens about once per thousand hours of play for us, we've been tracking that for years and working hard to work around it.
Eh... read the FIRST story, it is the more common one. As a fellow coder, I agree the FIRST story should be wiped from the face of the earth, less the fact that we nerds are not perfect in always writing code that compiles on the first run every time, ruins our image with the females of the species.
It is of course one of the annoying things in having to deal with bug reports, so many originate between the chair and the keyboard and since most of them bug reporters pay your salary you can't kill them. Because then they won't pay you. And that is bad. Sure, you can ask yourself what Jezus would do but Jezus didn't have a mortgage to pay did he?
MMO Quests are like orgasms:
You may solo them, I prefer them in a group.
That's quite an old and sensible trick with C, I was taught it at uni in the late 80's on a System V terminal. These days C/C++ compilers will flag an assignment in a logical expression with a warning.
The strangest error I got from a compiler was with MSVC 1.52, it was a real head scratcher so I had to resort to the debugger, the culprit was an innocent looking i++ statement, when disassembled it was clear the compiler had inserted two INC statements. Thus the counter was incrementing by 2 each time it executed the statement. We rebuilt the same code on the same machine and the problem disappeared. When weird shit like that happens everyone just shrugs and keeps going, but it sticks in your head for a long time and often comes to mind at inappropriate times such as during take off in a modern plane.
My favorite linker error also comes from Sytem V; "Bad magic number."
And did you exchange a walk on part in the war for a lead role in a cage? - Pink Floyd.
FLAC has a verify mode when encoding which, in parallel, decodes the encoded output and compares it against the original input to make sure they're identical. Every once in a while I'd get a report that there were verification failures, implying FLAC had a bug.
If it were actually a FLAC bug, the error would be repeatable* (same error in the same place) because the algorithm is deterministic, but upon rerunning the exact same command the users would get no error, or (rarely) an error in a different place. Then they'd run some other hardware checker and find the real problem.
Turns out FLAC encoding is also a nice little hardware stressor.
(* Pedants: yes, there could be some pseudo-random memory corruption, etc but that never turned out to be the case. PS I love valgrind.)
FLAC - Free Lossless Audio Codec
Yeah but people don't buy a separate PC for Guild Wars and for other machines. Gamers as a whole are disproportionately likely to overclock even if this particular game doesn't need it. It may be the overclockers have a 10% fail rate but only 10% of people are overclockers, or some other mix.
Excuse me but...WHOOSH...
The GP's tip is about avoiding accidentally inserting an assignment operator "=" when you intended to put an equivalence operator "==". If you follow the GP's habit there can be no such accidents because it's illegal to assign a value to a constant. The coding tip itself has been circulating for 23yrs that I personally know about, in those days compilers did not warn you about assignments within conditionals, the compiler saw valid syntax and assumed you knew what you were doing. This is why lint was so popular at the time. Even with warnings enabled it is still a good advice since following the habit will always produce a hard error when ( by Murphy's law) you inevitably make that particular typo.
And did you exchange a walk on part in the war for a lead role in a cage? - Pink Floyd.
That was me. My bad.
Help stamp out iliturcy.
a) the server should be making the authoritative decision(s), and
b) should be sending a quantized result to the clients.
This is already the case. The user side of Guild Wars only renders and 'continues', actual changes in the world always occur on the server side (you can see this at a network failure). And it has to be, if you want to avoid cheaters.
I thought the article was talking about 1% of computers very very occasionally giving wrong answers?
It's not like all their operations are failing. The test was run dozens of times per second, and if it fails once every few days, it's still a really low "failure rate" for a failing computer -- the only problem is that we expect the failure rate to be Zero.
Don't quote me on this.
Well it should be modded ignorant/overrated then.
;).
The return rates for most PC components are about 1-3%. Yes some of the returns are user error, but the rest are faulty stuff. And if you look at the laptop malfunction rates they're from 15-25%!
So it should not be surprising that 1% of the PCs that "appear to work" don't actually work correctly 100% of the time.
I think a fair number of corporate IT people handling faulty PCs every day would actually be surprised the figure is so low
Incorrect - java will allow assignment code inside a conditional statement so long as it is a boolean - most decent IDEs will flag it as suspicious but it will compile. e.g: if((i = 3) == 2){ }
Well, the minimum CPU requirement for Guild Wars is an 800 Mhz Pentium III. That was current when Guild Wars came out.
So yeah, overclocking something like that would probably have been reasonably common at the time.
guild wars runs okay on a crappy netbook, let alone anyones PC, If you need to OC to play Guild Wars then you are using a Pentium III.
But chances are the same PC is also used to play games that require a lot more resources, and that the user isn't fiddling with his hardware all the time.
Perhaps only 1 per 1,000 actually CAN make the speed rated per silicon wafer etched. Hence the extreme cost of the top end CPU's from Intel (look at the price per 1,000 lots of the 3960 extreme Core CPU).
Hm. I can't remember the time I was allowed to compile stuff that I bought for Windows myself. As it is a shipped binary blob the creator of the test can guarantee that your compiler based assumptions wont happen.
Every piece might be fine for itself. That doesn't mean they work together. A lot of cheap stuff is barley in the allowed tolerances. Usually the CPU (unless) overclocked and the GPU (unless) overclocked are the only ones klinging tight to the spec.
Almost every time an OS crashes due to hardware, people are going to blame it on the OS or some piece of software. Hardware is rarely the first go-to, unless the user is blatantly told.
How difficult would it be to offer an option to test all the hardware when installing the OS? And to randomly check hardware while the computer is idle? Doesn't even need to be in depth tests, you could check just a fraction at a time and cover all the hardware over the period of a few days or weeks.
Almost every bit of software can be tested from inside the OS. Eurosoft makes a program called QAWin32+ which can test all the memory, the entire hard drive, the processor, and a whole bunch of other hardware while the computer is being used.
Yeah, right. Let's see how that works out in practice.
The only time I was on a project where someone discovered an architecture-specific bug in GCC, he reported it with an included test case, and the code was patched in branch in the next six days. Of course the next point release was 8 months later, so we just disabled the problematic optimization setting, but we could have used a nightly build of GCC.
The actual error was more nuanced than I had room to describe in the comment.
Car systems pass messages to each other using an internal buss. In this particular incident, one of the systems failed in a way that made it continuously spew out messages, using all the bandwidth so that no other system could get a message through.
The brakes will normally abort cruise control, but there's no direct wire to the engine computer - it's all messages passed around on the buss. Similar with the On/Off button (newer cars) and shift-by-wire.
It was determined experimentally that the operator is not strong enough to overpower the engine using the brakes. The engine would just downshift for more power.
It all boils down to the safety "mindset": you code the systems to fail safe instead of fail dangerous.
I maintain that the engine CPU should automatically abort cruise control if it doesn't see a periodic heartbeat from the steering-wheel computer.
It used to, I don't know about current gen RTS's. But back ~2000 RTS typically you would run in a lock-step model. We used fixed-point to guarantee each machine was doing the _exact_ same 3D math due to the imprecision of the FPU. ANY discrepancy and your game state was boned. I believe at the time this decision was due to network implementation -- I don't know the exact reason though since I was doing rendering / optimizations.
You also have to keep in mind the context. Back in 2000 AMD's FPU was beating the pants of the Intel's (depending on the operation as much as 1000% !!) With Intel having such a slow FPU you didn't rely on it unless you had to. Also, using C's 64-bit 'double' was prohibited for two reasons:
a) the PS2 emulated it IN SOFTWARE !
b) it was horrendously SLOW compared to 32-bit floats.
Game programmers stayed as far away as possible from floats (and especially doubles!) as long as (reasonably) possible. For FPS you were forced to go the float route because while Intel hid the latency of the INT-to-FLOAT casts it was just easier to stay entirely in the float domain. That also opened the door for some clever optimizations like Carmack did with over-lapping the FPU and INT units but that was the rare case.
On PS3 you take a HUGE Load-Hit-Store penalty if you try doing the naive INT32-to-FLOAT32 cast so fixed point has fallen out favor for lack of performance reasons.
Err, I thought that floating point was always deterministic, even if it's not full accuracy. Do you have an example otherwise?
Massively Huge Systems are likely to be put together properly, run at correct speeds, appropriately cooled, and kept clean. How often did he run tests on home-built systems that have 4 years of dust, inconsistent voltage to the PSU, over-drawn PSUs, insufficient cooling, and chips not completely seated properly? Imagine a room of 100 PC MMORPG gamers. Do you think at least 1 of them has a machine that matches the above description, and that it might produce some unusual errors, particularly when the system is under high load?
~Anguirel (lit. Living Star-Iron)
QA: The art of telling someone that their baby is ugly without getting punched.
From time to time /. puts up stories about commercial code checkers that has all kinds of cool AI built-in.
Wouldn't a test of "if this then exit, if not this then exit" be flagged as a possible logic bug when there is more code after that last exit but before function end?
In regards to current gen RTS's: Starcraft 2 works like that. Less bandwidth is the reason, I believe.
Ah, I could see that being one reason. Thanks!