Slashdot Mirror


Whose Bug Is This Anyway?

An anonymous reader writes "Patrick Wyatt, one of the developers behind the original Warcraft and StarCraft games, as well as Diablo and Guild Wars, has a post about some of the bug hunting he's done throughout his career. He covers familiar topics — crunch time leading to stupid mistakes and finding bugs in compilers rather than game code — and shares a story about finding a way to diagnose hardware failure for players of Guild Wars. Quoting: '[Mike O'Brien] wrote a module ("OsStress") which would allocate a block of memory, perform calculations in that memory block, and then compare the results of the calculation to a table of known answers. He encoded this stress-test into the main game loop so that the computer would perform this verification step about 30-50 times per second. On a properly functioning computer this stress test should never fail, but surprisingly we discovered that on about 1% of the computers being used to play Guild Wars it did fail! One percent might not sound like a big deal, but when one million gamers play the game on any given day that means 10,000 would have at least one crash bug. Our programming team could spend weeks researching the bugs for just one day at that rate!'"

9 of 241 comments (clear)

  1. The memory thing... by Loopy · · Score: 5, Informative

    ...is pretty much what those of us that build our own systems do anytime we upgrade components (RAM/CPU/MB) or experience unexplained errors. It's similar to running the Prime95 torture tests overnight, which also checks calculations in memory against known data sets for expected values.

    Good stuff for those that don't already have a knack for QA.

    1. Re:The memory thing... by DMUTPeregrine · · Score: 5, Informative

      Unless you're trying to overclock.
      Admittedly that's a small percentage of the populace, even among people who build their own systems.

      --
      Not a sentence!
    2. Re:The memory thing... by Alwin+Henseler · · Score: 5, Informative

      The defect rate on hardware is so low you don't need to - buy your stuff from Newegg, assemble, and install. Either it's DOA or runs forever.

      Look up "bathtub curve" sometime. Even well-built, perfectly working gear is aging, aging usually translates into "reduced performance / reliability", and any electronic part will fail sometime. Possibly gradually. Especially the just-makes-it-past-warranty crap that's sold these days. And there may be instabilities / incompatibilities that only show under very specific conditions (like when a system is pushed really hard).

      That's ignoring things like ambient temperature variations, CPU coolers clogging with dust over the years, sporadic contact problems on connectors, or the odd cosmic ray that nukes a bit in RAM (yes that happens, too). A lot of things must come together to have (and keep) a reliable working computer, so a lot of things can go wrong and put an end to that.

    3. Re:The memory thing... by scheme · · Score: 4, Informative

      Yeah, yeah, yeah - I realize a single person's anecdotal evidence doesn't carry much weight. I wonder what the statistics are though? As AaronLS already pointed out, these tests seem to indicate that my situation isn't very unusual. Components age and wear out.

      Check out "A study of DRAM failures in the field" from the supercomputing 2012 proceedings. They have some interesting stats based on 5 million DIMM days of operation.

      --
      "When you sit with a nice girl for two hours, it seems like two minutes. When you sit on a hot stove for two minutes, it
    4. Re:The memory thing... by Lonewolf666 · · Score: 4, Informative

      Intel also charges you extra for ECC (only in server processors and mainboards), while AMD supports it in their better desktop processors. You still have to check if the mainboard does support it, though.

      A quick online price check shows that for 8 GByte DDR3 RAM (2 sticks), you might have to pay 20 Euros more for the ECC variety, compared to non-ECC from the same vendor. The more limited choice in mainboards might end up costing you cost another 10-20 Euros, so let's say +40 Euros to get your AMD PC with ECC Ram.

      On the Intel side, it is more like +50 Euros for a small Xeon instead of a matching i5, +100 Euros for an ECC-capable board and the same +20 for the RAM as with AMD. That makes about +170 Euros to get an Intel with ECC RAM, and was the main reason why my current PC is still an AMD...

      --
      C - the footgun of programming languages
  2. OsStress by larry+bagina · · Score: 5, Informative

    Microsoft found similar impossible bugs when overclocking was involved.

    --
    Do you even lift?

    These aren't the 'roids you're looking for.

  3. Re:I don't believe 1% of computers give wrong answ by godrik · · Score: 4, Informative

    I actually believe it. I am sure they might have think of floating point precision problem. But most likely they only used integers. That's what prime 95 and memtest are doing. Integer and memory operations uncover most common hardware failure. I encountered many computers with faulty hardware when stressed. And I am sure guildwars was stressful.

  4. Re:Reminded me of my first C application by safetyinnumbers · · Score: 4, Informative

    That's known as "Yoda style"

  5. Re:Reminded me of my first C application by richardcavell · · Score: 5, Informative

    I just want to correct this, not to prove how smart I am but because there are novice programmers out there who will learn from this case. The statement:

    if (i = 1) {

    is equivalent to:

    i = 1; /* correction */
    if (i) {