Slashdot Mirror


Whose Bug Is This Anyway?

An anonymous reader writes "Patrick Wyatt, one of the developers behind the original Warcraft and StarCraft games, as well as Diablo and Guild Wars, has a post about some of the bug hunting he's done throughout his career. He covers familiar topics — crunch time leading to stupid mistakes and finding bugs in compilers rather than game code — and shares a story about finding a way to diagnose hardware failure for players of Guild Wars. Quoting: '[Mike O'Brien] wrote a module ("OsStress") which would allocate a block of memory, perform calculations in that memory block, and then compare the results of the calculation to a table of known answers. He encoded this stress-test into the main game loop so that the computer would perform this verification step about 30-50 times per second. On a properly functioning computer this stress test should never fail, but surprisingly we discovered that on about 1% of the computers being used to play Guild Wars it did fail! One percent might not sound like a big deal, but when one million gamers play the game on any given day that means 10,000 would have at least one crash bug. Our programming team could spend weeks researching the bugs for just one day at that rate!'"

241 comments

  1. The memory thing... by Loopy · · Score: 5, Informative

    ...is pretty much what those of us that build our own systems do anytime we upgrade components (RAM/CPU/MB) or experience unexplained errors. It's similar to running the Prime95 torture tests overnight, which also checks calculations in memory against known data sets for expected values.

    Good stuff for those that don't already have a knack for QA.

    1. Re:The memory thing... by Gothmolly · · Score: 0, Troll

      Nobody does that anymore. The defect rate on hardware is so low you don't need to - buy your stuff from Newegg, assemble, and install. Either it's DOA or runs forever.

      --
      I want to delete my account but Slashdot doesn't allow it.
    2. Re:The memory thing... by AaronLS · · Score: 5, Interesting

      "The defect rate on hardware is so low you don't need to"

      I think the point of the article is to cast significant doubt on statements like this.

    3. Re:The memory thing... by DMUTPeregrine · · Score: 5, Informative

      Unless you're trying to overclock.
      Admittedly that's a small percentage of the populace, even among people who build their own systems.

      --
      Not a sentence!
    4. Re:The memory thing... by Runaway1956 · · Score: 3, Informative

      " Either it's DOA or runs forever."

      Nonsense. I bought 8 gig of memory about 4 years ago, for an Opteron rig. That computer recently started having serious problems, with corrupted data and crashing. I looked at all the other components first, then finally ran memory tests. Memtest failed immediately. I removed three modules and ran memtest again, it failed immediately. Replaced with another module, memtest ran for awhile, then failed. The other two modules proved to be good, so I am now running that aging Opteron with 4 gig of memory.

      Yeah, yeah, yeah - I realize a single person's anecdotal evidence doesn't carry much weight. I wonder what the statistics are though? As AaronLS already pointed out, these tests seem to indicate that my situation isn't very unusual. Components age and wear out.

      --
      "Windows is like the faint smell of piss in a subway: it's there, and there's nothing you can do about it." - Charlie Br
    5. Re:The memory thing... by Alwin+Henseler · · Score: 5, Informative

      The defect rate on hardware is so low you don't need to - buy your stuff from Newegg, assemble, and install. Either it's DOA or runs forever.

      Look up "bathtub curve" sometime. Even well-built, perfectly working gear is aging, aging usually translates into "reduced performance / reliability", and any electronic part will fail sometime. Possibly gradually. Especially the just-makes-it-past-warranty crap that's sold these days. And there may be instabilities / incompatibilities that only show under very specific conditions (like when a system is pushed really hard).

      That's ignoring things like ambient temperature variations, CPU coolers clogging with dust over the years, sporadic contact problems on connectors, or the odd cosmic ray that nukes a bit in RAM (yes that happens, too). A lot of things must come together to have (and keep) a reliable working computer, so a lot of things can go wrong and put an end to that.

    6. Re:The memory thing... by scheme · · Score: 1

      That's not true. There was a recent paper looking at memory defects and causes on the Jaguar supercomputer, and memory errors were moderately common. Just as surprisingly, there were errors were a single DIMM going bad would cause errors for all the DIMMs on that channel.

      So, memory does go bad and it does that more frequently than you'd expect.

      --
      "When you sit with a nice girl for two hours, it seems like two minutes. When you sit on a hot stove for two minutes, it
    7. Re:The memory thing... by scheme · · Score: 4, Informative

      Yeah, yeah, yeah - I realize a single person's anecdotal evidence doesn't carry much weight. I wonder what the statistics are though? As AaronLS already pointed out, these tests seem to indicate that my situation isn't very unusual. Components age and wear out.

      Check out "A study of DRAM failures in the field" from the supercomputing 2012 proceedings. They have some interesting stats based on 5 million DIMM days of operation.

      --
      "When you sit with a nice girl for two hours, it seems like two minutes. When you sit on a hot stove for two minutes, it
    8. Re:The memory thing... by Anonymous Coward · · Score: 0

      Add in that hardware can fail in strange ways. Years and years ago I had a client's computer that would run Windows for a little while and then suddenly crash. Running diagnostics on it didn't show anything wrong unless the diagnostics ran long enough. The diagnostics thoroughly checked the operation of each motherboard component one after the other. Running the diagnostics briefly didn't turn up a damn thing. Only by letting them run for hours did the cause of the crashes make itself known. One of the interrupt controllers would suddenly start failing after 30000 or so tests. It took hours for the diagnostics to hit the interrupt controller enough to make it fail. When running Windows of course the interrupt controller got to the point of failure a whole heck of a lot faster.

    9. Re:The memory thing... by TubeSteak · · Score: 3, Interesting

      Especially the just-makes-it-past-warranty crap that's sold these days.

      Actually, to get 95% of your product past the warranty period, you have to overengineer because, statistically, some of your product will fail earlier than you expect.

      So if you have a 3 year warranty, you better be engineering for 4+ years or you're going to spend a lot on replacements for the near end of the bathtub curve.

      I've had an unfortunate amount of experience with made in china crap that's ended up being replaced a few times within the warranty period.

      --
      [Fuck Beta]
      o0t!
    10. Re:The memory thing... by DigiShaman · · Score: 5, Interesting

      I think it's a crying shame that the PC industry hasn't forced ECC as a mandatory standard. Servers and workstations have it, and with memory as cheap as it is to fab, there's absolutely -zero- excuse not to use ECC!!! With the transistor count as densely packed and small, errors will occur. I'll go a step further and even recommend ECC throughout the entire motherboard bridge buses. End-to-end error correction should be a requirement!

      --
      Life is not for the lazy.
    11. Re:The memory thing... by Anonymous Coward · · Score: 1

      Paywall.

      I thought you were going to talk about the google study because it's invalid since they are using low grade RAM.

    12. Re:The memory thing... by Sir_Sri · · Score: 3, Informative

      Even if you have a small calculation failure rate, it's not practical for an end user to recognize that as a hardware partial failure rather rather than a software bug.

      From the perspective of the average user, yes, it either works or it doesn't. If you use something bit (like wow/guildwars or the like) and they can diagnose it for you then you might have an argument. But even then, 1% could be overclocking or, as the author of TFA says, heat or PSU undersupply issues. That's not 'defective' hardware, that's temperamental hardware or the user doing it wrong. And because it's rare it's not necessarily serious, most users can handle the odd application crash in something like an MMO once every few days.

      It does mean a bug hunter needs to know what is happening though.

    13. Re:The memory thing... by phantomfive · · Score: 1
      --
      "First they came for the slanderers and i said nothing."
    14. Re:The memory thing... by Greyfox · · Score: 3, Insightful

      Heh, back in the day when I was doing OS/2 phone support, I had a customer call up with a trap zero during install. Now I'd seen a lot of odd shit during an OS/2 install, but I'd never seen a trap zero. Turns out that was a divide by zero error. Fucker made me start filling out the paperwork to send him to level 2 before admitting that he was trying to overclock his processor. If memory servers me correctly (Which it might not, nearly two decades later) he was trying to go from 8 mhz to 20 mhz, and was also getting a lot of crashes in DOS and DOS applications. I told him that was probably what his problem was and if I tried to send this on to level 2 it'd be rejected with a "Don't do that," so I was just going to save him some time and tell him "don't do that" now.

      --

      I'm trying to teach myself to set people on fire with my mind... Is it hot in here?

    15. Re:The memory thing... by Animats · · Score: 3, Interesting

      "The defect rate on hardware is so low you don't need to" I think the point of the article is to cast significant doubt on statements like this.

      Right. Google assumes their server hardware (which is cheap, not good) is flaky, and designs their software to deal with that. I've heard a Google engineer say that if they sort a terabyte twice, they get two different results.

    16. Re:The memory thing... by afidel · · Score: 3, Insightful

      Nah, that's pretty typical. In fact ram is the only component other than HDD's to have a statistically significant AFR in my datacenter. At the peak I had a bit over 200 servers and we'd have a DIMM go bad about once every other month (so say 6 of 1200 DIMMs per year). Heck with my Proliants the fans and PSUs were more reliable as we've only lost a handful of each over the last 6 years.

      --
      There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
    17. Re:The memory thing... by Ritchie70 · · Score: 0

      I don't know what the software architecture of these games looks like, but there is an alternative to faulty hardware tripping this test.

      It's another execution unit (thread or similar) running rampant through memory.

      I've "fixed" bugs in cooperative multitasking software by putting big buffers around variables that were getting stomped on. Usually the buffers got removed before the software was considered "done." But not always.

      --
      The preferred solution is to not have a problem.
    18. Re:The memory thing... by afidel · · Score: 1

      I don't think any modern bus would survive without ECC. Heck Intel even does ECC on the cache lines inside the processor these days (a feature brought down from the Itanium to the Xeons for the Nehalem generation, checkout RAS for more info). The more interesting idea to me is T10-DIF which allows ECC from disk to application and back, I'm kind of surprised it hasn't taken off.

      --
      There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
    19. Re:The memory thing... by adolf · · Score: 3, Insightful

      Especially the just-makes-it-past-warranty crap that's sold these days.

      I've been hearing this for the entirety of my worldly awareness (several decades), and the song remains the same.

      Eventually, I'd hoped that folks would realize that they were unlucky or were just buying garbage, instead of the insipidly assuming that such-and-such widget was so perfectly constructed and planned that it failed within hours/days of the warranty expiring -- just as designed.

      The truth is that no matter what the nature of the item, or the term of the limited warranty: Given sufficient quantity, some of them are going to fail mere seconds after the warranty is gone.

      Such as it is.

      We all want everything we buy to work perfectly and last forever, but nothing ever does. It should be no surprise that this is not the result of any conspiracy, but just life. Things wear out. (Even DIMMs.)

    20. Re:The memory thing... by adolf · · Score: 1

      Zero?

      Adding a bit-per-byte for ECC means multiplying the number of bits required by 1.125.

      Not zero.

      *ahem*

    21. Re:The memory thing... by DarwinSurvivor · · Score: 1

      Tell that to the DIMM stick I sent back to System76 after it kept crashing my desktop.

    22. Re:The memory thing... by magic+maverick+ · · Score: 1

      Meh, I bought a two gig stick of RAM about 3 years ago. It failed within the three years. If I was still in the country where I'd bought it (a place with decent consumer protection laws) I would have taken it back and complained.
      It caused some real trouble because my computer had been really stable, no crashes, etc. Ubuntu 10.04 (which I wish was going to be supported longer and then I'd still use it; 12.04 is much more unstable). What really confused me was that I'd just a few weeks before it failed run MemTest+ and turned up no errors. I didn't think RAM would be my problem.

      TLDR: RAM can and does die. Live in a country with good consumer protection laws and make use of them.

      --
      HELP MY ACCOUNT HAS BEEN HACKED BY AN ILLIBERAL ART STUDENT SET TO DESTROY THE INTERWEBZ!
    23. Re:The memory thing... by sjames · · Score: 1

      From the developer's standpoint, it most certainly IS defective hardware. The computer as a whole does not produce the results that it is supposed to produce. No bug report from that machine can be trusted because it executed something similar to but not exactly the software being reported.

    24. Re:The memory thing... by Anonymous Coward · · Score: 1

      remind me of the old win95/NT4 days, lots of hardware that had been running win95 resonably ok for a long time, trying to install NT4 installation wouldn't even get past the memory check

    25. Re:The memory thing... by Impy+the+Impiuos+Imp · · Score: 3, Insightful

      Open computer and blow it out with a leaf blower every 6 six months. Solves 80% of your boot problems, no need to reinstall or re-seat components.

      --
      (-1: Post disagrees with my already-settled worldview) is not a valid mod option.
    26. Re:The memory thing... by dackroyd · · Score: 1

      Any chance of providing a link. The site http://sc12.supercomputing.org/ doesn't appear to have downloadable papers.

      --
      "Free software as in beer, copy protection as in racket" - Telsa Gwynne
    27. Re:The memory thing... by FireFury03 · · Score: 3, Interesting

      Look up "bathtub curve" sometime.

      This is exactly why I cringe when I hear people saying "we need to replace that hardware because its been running for a few years now so might fail soon" - the chances of your brand new hardware going pop are often far higher than the tired old hardware. Eventually the old kit will of course die, but in my experience that is far further into the future than most people imagine.

      I've not quite figured out the optimal hardware replacement frequency, but I tend to think that for servers (excluding the hard drives) the time you want to replace it is largely when it is no longer powerful enough to do what you want, rather than because its a bit old and creaky and you're worried it might break.

      Hard drives, on the other hand, seem to break with reasonable frequency whatever their age, so usually I just run them (in a RAID) until they either give up, or SMART tells me they are reallocating large numbers of sectors, rather than trying to preemptively replace them.

    28. Re:The memory thing... by V+for+Vendetta · · Score: 1

      Eventually, I'd hoped that folks would realize that they were unlucky or were just buying garbage, instead of the insipidly assuming that such-and-such widget was so perfectly constructed and planned that it failed within hours/days of the warranty expiring -- just as designed.

      While I agree that there's a lot of conspiracy crazyiness out there, it's also not uncommon to experience by some "hard-coded" end-of-life events. In IT, printer manufactureres have been proven more than once to let printers willingly fail after <event> (such as number of pages printed). It's also practiced elsewhere. See Planned Obsolescence.

    29. Re:The memory thing... by Anonymous Coward · · Score: 0

      I've had an unfortunate amount of experience with made in china crap that's ended up being replaced a few times within the warranty period.

      Just a hunch, but was much of that the capacitors on various boards? Start to bulge and leak, before the hardware just stops working? Got hit by that years ago on several bits of hardware. I believe there was some skullduggery with some Chinese/Taiwanese manufacturers, one jumping ship to another company and stealing a formula... and the formula was bad, thus leaving us with shoddy, broken equipment.

    30. Re:The memory thing... by mcgrew · · Score: 3, Informative

      My experience goes along with this. A few times I've had dual-boot computers constantly crashing on the Windows side, so I was blaming MS for their buggy software -- until the flaky hardware that made Windows flaky failed completely. Turns out that Linux is simply far more hardware fault-tolerant than Windows, rather than Windows being a bug-ridden piece of shit.

    31. Re:The memory thing... by Lonewolf666 · · Score: 4, Informative

      Intel also charges you extra for ECC (only in server processors and mainboards), while AMD supports it in their better desktop processors. You still have to check if the mainboard does support it, though.

      A quick online price check shows that for 8 GByte DDR3 RAM (2 sticks), you might have to pay 20 Euros more for the ECC variety, compared to non-ECC from the same vendor. The more limited choice in mainboards might end up costing you cost another 10-20 Euros, so let's say +40 Euros to get your AMD PC with ECC Ram.

      On the Intel side, it is more like +50 Euros for a small Xeon instead of a matching i5, +100 Euros for an ECC-capable board and the same +20 for the RAM as with AMD. That makes about +170 Euros to get an Intel with ECC RAM, and was the main reason why my current PC is still an AMD...

      --
      C - the footgun of programming languages
    32. Re:The memory thing... by mgbastard · · Score: 3, Informative

      Without the paywall: The study was performed on the Jaguar Supercomputer at Oak Ridge National Laboratories http://softerrors.info/selse/images/selse_2012/Papers/selse2012_submission_4.pdf

      --
      Anyone seen my low uid? last seen 10 years ago while panning the #@$# out of Taco's 'web based discussion system'
    33. Re:The memory thing... by Anonymous Coward · · Score: 1

      I hit one of those. Thankfully I found a youtube video on how to reset it (without downloading skeevy software) with about 10 packets sent from the Windows telnet program, including the opening and closing handshakes that told the printer I had replaced three $20 paper sensors. (Yep, $20 for each the sensors, and that doesn't include what they wanted to charge me to ship the printer to them and take the time to replace them.) So, just by telling the printer I replaced them, they magically started working again.

    34. Re:The memory thing... by Hatta · · Score: 1

      instead of the insipidly assuming that such-and-such widget was so perfectly constructed and planned that it failed within hours/days of the warranty expiring -- just as designed

      I'd imagine warranty lengths are designed around the known failure rates of your hardware, instead of the other way around.

      --
      Give me Classic Slashdot or give me death!
    35. Re:The memory thing... by drkoemans · · Score: 2

      Look up "bathtub curve" sometime.

      This is exactly why I cringe when I hear people saying "we need to replace that hardware because its been running for a few years now so might fail soon" - the chances of your brand new hardware going pop are often far higher than the tired old hardware. Eventually the old kit will of course die, but in my experience that is far further into the future than most people imagine.

      I think that could be taken as generally true, particularly with RAM and CPU but I'm still seeing fallout from the electrolytic fluid/capacitor debacle from the 2000s. Power supplies and main/daughter boards are still failing unpredictably in older hardware. Even newer equipment has suffered as "new old stock" components were integrated in equipment manufactured after 2006. The tolerance of the capacitors was close enough to get them past the initial failure period but off enough to eventually cause problems.
      http://en.wikipedia.org/wiki/Capacitor_plague

    36. Re:The memory thing... by Anonymous Coward · · Score: 0

      Look up "bathtub curve" sometime.

      This is exactly why I cringe when I hear people saying "we need to replace that hardware because its been running for a few years now so might fail soon" - the chances of your brand new hardware going pop are often far higher than the tired old hardware. Eventually the old kit will of course die, but in my experience that is far further into the future than most people imagine.

      I've not quite figured out the optimal hardware replacement frequency, but I tend to think that for servers (excluding the hard drives) the time you want to replace it is largely when it is no longer powerful enough to do what you want, rather than because its a bit old and creaky and you're worried it might break.

      Hard drives, on the other hand, seem to break with reasonable frequency whatever their age, so usually I just run them (in a RAID) until they either give up, or SMART tells me they are reallocating large numbers of sectors, rather than trying to preemptively replace them.

      So...? At the tail end of the bathtub curve you could be out of warranty, at the beginning you are under warranty. The stuff's going to fail eventually, you're better off keeping it under warranty.

    37. Re:The memory thing... by barjam · · Score: 1

      My experience has always been the opposite. Linux for me has been far less tolerant of bad hardware than windows but honestly both have done well for me over the years.

    38. Re:The memory thing... by jandrese · · Score: 2

      For me the biggest source of failure in semi-modern equipment has been capacitor failure. You know when you open a box and see those bulging capacitors, sometimes with brown gunk seeping out of the top that someone tried to save a few cents and is making you pay the price. This problem was supposed to have been fixed years ago, but it still keeps happening to random equipment for me. I lost a pair of low-high end speakers (M-Audio AV30s) that were manufactured in 2010 to bad caps, worse they covered the whole board in a thick coat (half an inch thick!) of epoxy, making repair nigh impossible.

      --

      I read the internet for the articles.
    39. Re:The memory thing... by AaronLS · · Score: 1

      "But even then, 1% could be overclocking or, as the author of TFA says, heat or PSU undersupply issues. That's not 'defective' hardware, that's temperamental hardware or the user doing it wrong."

      The hardware produced the wrong results. That is the very definition of a defect. The concept of who is at fault is completely orthogonal to whether or not the hardware is functioning properly. No, it's not a manufacturing defect, not the kind of defect that would give you a right to RMA, but it is still operating in a defective state. From the perspective of the person researching the crash report, they are wanting to know "Is it my software, or their hardware?" Once they determine it's the hardware, that's the end of the line for them. It isn't the software engineer's job to go further down the rabbit hole and help the user test his PSU, or determine if overclocking is the cause. They are not doing support. Their goal is to determine if there is a bug that is causing the crash, and as a first step in that, rule out other possibilities.

      "And because it's rare it's not necessarily serious, most users can handle the odd application crash in something like an MMO once every few days."
      From the perspective of a developer, if I have a bug that causes a crash for all users every few days, that is a huge issue and worth investigating.
      1) The amount of work that goes into eliminating that bug probably would be very small in comparison to the thousands of minutes that would accumulate in user's wasted time, restarting the app and getting back to where they left off. If it is indeed a legitimate bug, and is effecting all users every few days, that is a huge amount of time and frustration that accumulates over the lifetime of that application.
      2) If you are lazy about not investigating such a bug, as the app evolves over time and new bugs crop up unfixed, you will slowly accumulate many such bugs over time. Eventually you're game/app will be one of those that is well known for random crashes. The job of pinpointing the cause will be more difficult. If you investigate bugs as they are discovered, you can focus on recently changed code as a potential cause. Additionally, as you collect crash logs, it will be hard to focus on a single bug, because you could have three different bugs causing crashes, and so trying to find commonality between crash reports will be very difficult.

      Some shops are very ignorant of how buggy their applications are. They don't realize that for every person that reports a bug, there's many more experiencing that bug. The kind of people who take the time to email/file a bug report are pretty rare.

    40. Re:The memory thing... by Anonymous Coward · · Score: 0

      I think there would be a lot more interest if there were a technology that could multiply the number of bits required by zero.

    41. Re:The memory thing... by Anonymous Coward · · Score: 0

      Zero?

      Adding a bit-per-byte for ECC means multiplying the number of bits required by 1.125.

      Not zero.

      *ahem*

      WRONG!

      One bit per byte is not ECC - it's parity. That lets you detect an error, but ECC stands for Error CORRECTING Code. It requires (from memory) 3 bits per byte if you want to be able to correct all one-bit errors (as a by-product it lets you detect all two-bit errors).

    42. Re:The memory thing... by DrJimbo · · Score: 1

      ... ECC stands for Error CORRECTING Code. It requires (from memory) 3 bits per byte if you want to be able to correct all one-bit errors (as a by-product it lets you detect all two-bit errors).

      It depends on what your block size is. You are assuming an 8-bit block size. ECC RAM uses a 64-bit block size and requires just one extra bit per byte (nine bits per byte instead of the standard 8) just like the GP said.

      --
      We don't see the world as it is, we see it as we are.
      -- Anais Nin
    43. Re:The memory thing... by Anonymous Coward · · Score: 0

      wtf is it not a serious problem when the engineers at Google are getting inconsistent data from the same HDD

  2. Wait its possible?! by Anonymous Coward · · Score: 5, Funny

    You mean all those times when my code was 'fine' and i gave up it really could have been the compiler or a memory problem

    shit i'm a much better programmer than i realized

    1. Re:Wait its possible?! by meerling · · Score: 1

      I started bringing my personal laptop to my programming classes for a simple reason. About 20% (seemed like 65%, but that's probably just a trick of memory)of the class computers had their compilers borked by another student and any particular time. You had no idea that somebody had put in a weird setting in the compiler, or had just outright broke something, until after you'd done way too much troubleshooting. I found I got a whole lot more done on my personal box that nobody else could mess up. :)

    2. Re:Wait its possible?! by Anonymous Coward · · Score: 0

      No, he's saying that about 1% of your users could have issues that aren't related to the program, and crash reports from said systems are likely suspect, and should be ignored.

    3. Re:Wait its possible?! by hazah · · Score: 2

      Are you talking about actual compilers or the IDE?

    4. Re:Wait its possible?! by Anonymous Coward · · Score: 0

      How is that possible? Students shouldn't have sufficient privileges to change anything on the computer. Also, the computers should be resetting themselves at least every night, if not at every log in. Your school has incompetent IT staff.

    5. Re:Wait its possible?! by disambiguated · · Score: 5, Insightful

      You're a better programmer for assuming it's not a compiler bug and trying harder to figure out what you did wrong.

      I've been programming professionally for over 20 years, mostly in C/C++ (MSVC, GCC, and recently CLang (and others back in the olden days)). I've seen maybe two serious compiler bugs in the past 10 years. They used to be common.

      On the other hand, I can't count how many times I've seen coders insist there must be a compiler bug when after investigation, the compiler had done exactly what it should according to the standard (or according to the compiler vendor's documentation when the compiler intentionally deviated from the standard).

      By "serious", I mean the compiler itself doesn't crash, issues no warnings or errors, but generates incorrect code. Maybe I've just been lucky. (Or maybe QA just never found them ;-)

      Oh, and btw, yes I realize you were joking (and I found it funny.)

    6. Re:Wait its possible?! by Anonymous Coward · · Score: 0

      Students shouldn't have sufficient privileges to change anything on the computer.

      Yes they should. They can do as much damage with a shared user (which I imagine was the case) as with root. And what if they want to install something, like valgrind, vim or clang? It's much easier to do "sudo aptitude install something-with-way-too-many-dependencies" than hunting every single dependency. Sure public machines may be running keylogger software, but if you trust a public machine then you're already screwed one way or the other (how can you be sure it isn't running a hardware keylogger, for instance?).

    7. Re:Wait its possible?! by caseih · · Score: 2

      Wow you had really crappy computer installations to work with. In my labs, gcc was in /usr/bin and I didn't have any write permission to that directory at all to mess things up. gcc and make just always worked for me.

    8. Re:Wait its possible?! by tlhIngan · · Score: 3, Interesting

      By "serious", I mean the compiler itself doesn't crash, issues no warnings or errors, but generates incorrect code. Maybe I've just been lucky. (Or maybe QA just never found them ;-)

      I saw this once - took me weeks to solve it. Basically I had a flash driver that would occasionally erase the boot block (bad!). It was odd because we had protected the boot block both in the higher level OS as well as the code itself.

      Well, it happened and I ended up tracing through the assembly code - it turned out the optimizer worked a bit TOO well - it completely optimized out a macro call used to translate between parameters (the function to erase the block required a sector number. The OS called with the block number, so a simple multiplication was needed to convert). End result, the checks worked fine, but because the multiplication never happened, it erased the wrong block. (The erase code erased the block a sector belonged to - so sectors 0, 1, ... NUM_SECTORS_PER_BLOCK-1 erased the first block).

      A little #pragma to disable optimizations on that one function and the bug was fixed.

    9. Re:Wait its possible?! by Anonymous Coward · · Score: 0

      I've been programming professionally for over 20 years, mostly in C/C++ (MSVC, GCC, and recently CLang (and others back in the olden days)). I've seen maybe two serious compiler bugs in the past 10 years. They used to be common.

      It must be nice to not work with microcontrollers.

      "The compiler output was fine, it turned out to be a bug in the CPU."

    10. Re:Wait its possible?! by Paradise+Pete · · Score: 3, Insightful

      Your school has incompetent IT staff.

      Welcome to planet Earth. If your species expects competence in its dealings with humans you should have done more research before landing. Didn't you get those episodes of I Love Lucy we kept sending you?

    11. Re:Wait its possible?! by mcvos · · Score: 1

      Interpreting that first bug as a compiler problem is incomprehensible to me. Even if you don't see right away what a bunch of `if-then return` statements do, surely you notice how it works the first time you walk through it?

      The first time I read it, I assumed that the bug was the exact opposite: that it did reach the line of code it wasn't supposed to reach, which can only happen if the value of `UnitIsHarvester(unit)` changes during the execution of that code. That would have been an interesting bug. As it is, it smells of incompetence and bad practice. (But then, 12 hours days are bad practice. Horrid code is probably to be expected.)

    12. Re:Wait its possible?! by Megane · · Score: 2

      I've had optimizer problems that caused SPI transfers to break (resulting in bricked SD cards!), probably due to optimizing out a wait loop, and another that replaced a call to a delay subroutine with a branch-to-self instruction. (FWIW: It was A commeRcial cross-compiler, not gcc or LLVM)

      --
      #naabhaprzrag, #sverubfr-000, #agi-fcbafberq, negvpyr[pynff*=' negvpyr-ary-'] { qvfcynl: abar !vzcbegnag; }
    13. Re:Wait its possible?! by PhxBlue · · Score: 1

      Students shouldn't have sufficient privileges to change anything on the computer. Also, the computers should be resetting themselves at least every night, if not at every log in. Your school has incompetent IT staff.

      Schools aren't exactly bursting with excess cash to hire competent IT staff -- or any IT staff, for that matter. At most schools, the "IT staff" probably comprises the library employee(s).

      That said, you can probably say the same thing about most schools in the United States -- including the school where my son goes (and where he installed Minecraft on several computers before he got busted).

      --
      !#@%*)anks for hanging up the phone, dear.
    14. Re:Wait its possible?! by Anonymous Coward · · Score: 0

      I've had optimizer problems that caused SPI transfers to break (resulting in bricked SD cards!), probably due to optimizing out a wait loop.

      I know you didn't call it a compiler bug, but it is funny that you replied on this thread:

      On the other hand, I can't count how many times I've seen coders insist there must be a compiler bug when after investigation, the compiler had done exactly what it should according to the standard (or according to the compiler vendor's documentation when the compiler intentionally deviated from the standard).

    15. Re:Wait its possible?! by MightyYar · · Score: 1

      I'm old and crusty, but 20 years ago when I was in school they imaged the drives every night when the computer center closed. I remember the panic of having a bad floppy as they were about to close.

      --
      W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
    16. Re:Wait its possible?! by Megane · · Score: 1

      So what is your point, exactly? It's a compiler bug when you turn off optimization and the code starts working properly. Optimization is not supposed to cause the code to behave differently.

      And in the second case, the "branch to self" was clearly visible in the assembly code of the .lst file.

      I also had a case where the compiler generated an instruction with an invalid operand, which resulted in an assembler error! Fortunately that one was fixable by adding a "volatile" keyword to a variable declaration. And all of these are within the past two years.

      The biggest case of finding compiler bugs for me was in college back in the '80s. I was writing a Pascal compiler for a CS class, and kept finding all sorts of bugs in TML Pascal. The most annoying was if you had a set with more than 32 elements, "constant IN set" would be compiled as a 32-bit AND instruction, completely ignoring some of the involved data. Then Apple made MPW free (as in beer) and I didn't have to worry about those bugs anymore.

      --
      #naabhaprzrag, #sverubfr-000, #agi-fcbafberq, negvpyr[pynff*=' negvpyr-ary-'] { qvfcynl: abar !vzcbegnag; }
  3. OsStress by larry+bagina · · Score: 5, Informative

    Microsoft found similar impossible bugs when overclocking was involved.

    --
    Do you even lift?

    These aren't the 'roids you're looking for.

    1. Re:OsStress by Anonymous Coward · · Score: 2, Informative

      That's not too surprising. For instance if you try to read too fast from memory, the data you read may not be what was actually in the memory location. Some bits may be correct, some may not. Sometimes the incorrect values may relate to the data that was on the bus last cycle, eg there has not been enough time for the change to propagate through. This can easily lead to the data apparently read being a value that should not be possible. This is why overclocking is not a good idea for mission critical systems, although of course it can be fun to push a system a bit harder to get better performance for non critical applications.
      John

    2. Re:OsStress by DNS-and-BIND · · Score: 2, Insightful

      We all realize that when Intel bakes a bunch of processors, they come out all the same, and then Intel labels some as highspeed, some as middle, and some as low. They are then sold for different prices. However, they are the exact same CPU.

      Overclocking isn't the issue, because the CPUs are the same. The problem arises when aggressive overclocking is done by ignorant hobbyists or money-grubbing computer retailers. They overclock the computer to where it crashes, and then back off just a little bit. "There! Now I've got a real MEAN MACHINE," he thinks.

      --
      Shutting down free speech with violence isn't fighting fascism. It IS fascism!
    3. Re:OsStress by Anonymous Coward · · Score: 5, Interesting

      Then again, it might not be overclocking after all.

      More relevantly, Microsoft has access to an enormous wealth of data about hardware failures from Windows Error Reporting. This paper has some fascinating data in it:

      - Machines with at least 30 days of accumulated CPU time over an 8 month period had a 1 in 190 chance of crashing due to a CPU subsystem fault
      - Machines that crashed once had a probability of 1 in 3.3 of crashing a second time
      - The probability of a hard disk failure in the first 5 days of uptime is 1 in 470
      - Once you've had one hard disk failure, the probability of a second failure is 1 in 3.4
      - Once you've had two failures, the probability of a third failure is 1 in 1.9

      Conclusion: When you get a hard disk failure, replace the drive immediately.

    4. Re:OsStress by Anonymous Coward · · Score: 5, Insightful

      Bullshit. While Intel does occasionally bin processors into lower speeds to fulfill quotas and such, often times those processors are binned lower because they can't pass the QA process at their full speed. But they can pass the QA process when running at a lower speed. These processors were meant to be the same as the more expensive line, but due to minor defects can't run stably or reliably at the higher speed. Or at least not enough for Intel to sell them at full speed.

      Which is a large part of why some processors in the same batch can handle it when others can't.

      As much as I hate Intel, I think we could at least realize that they are often times doing this with good reason.

    5. Re:OsStress by Anonymous Coward · · Score: 2, Informative

      We all realize that when Intel bakes a bunch of processors, they come out all the same, and then Intel labels some as highspeed, some as middle, and some as low. They are then sold for different prices. However, they are the exact same CPU.

      This is not 100% correct. When Intel or other fabricators of microprocessors make the things they do use the same "mold" to "stamp out" all the processors in large bunches all based on the same design, however they don't get the exact same result each time. The little difference from chip to chip, like on this chip some transistors ended up a few atoms closer together than what is the optimum distance so this part of the processor now will heat up more when in use, or on this chip someone coughed* during the process and smeared the result and its totally unusable, or on this this chip part of the cache memory is fubared and has to be disabled.

      The end result is they have a chip and they have to test it to see how well it performs because of all these variables in the manufacturing process. One chip might be 100% reliable and operate under the desired temperature at clock speed A, while another chip due to its unique manufacturing imperfections has problems at clock speed A, either its too hot or needs too much voltage or has calculation errors, but when lowered to clock speed B it works just fine. I believe they call this process "binning" and its the main thing that separates the chips into different speeds and capabilities.

      IT IS HOWEVER a known practice that the chip manufacturers will sometimes take a processor that is just fine and dandy to work at clock speed A but they label it a slower clock speed B part because they are running low on clock speed B parts and it makes better financial sense to sell it as such instead of lowering the price on their clock speed A parts. Sometimes its more than a clock speed, sometimes its the intentional disabling of capabilities of the processor to make it match their budget models like disabling some of the on board cache memory or some of the (working) cores.

      What it comes down to is that it costs the processor manufacturers the exact same amount to make all the different speed processors in a given family, but they don't all come out the same. The worst ones are put on the low end, the better quality ones on the high and expensive end, and sometimes there is a perfect high quality one that is sold as a low end one because they need to produce and ship more low quality ones. If you get one of those, then consider yourself lucky and overclock the shit out of it. All processors can be overclocked, as the manufacturers make the official speed the 100% stable and error free operation with a normal (not aftermarket) cooling solution that will last for the lifetime of the warranty. You just sometimes get lucky and have a processor that is super easily overclocked because it could have been labeled and sold as a higher speed to begin with. This is never guaranteed however.

      *Yeah its more complicated, they aren't using molds that they press, they are using some complicated look-it-up-and-read-if-you-are-really-interested stuff to make the things so amazingly small

    6. Re:OsStress by Anonymous Coward · · Score: 1

      That's so shockingly wrong that I hope no one confuses it for the truth. I assume you have no actual engineering experience with Si process technology, validation of processors, or anything related to electrical engineering.

      A fab process has a target set of parameters. Sometimes there are deviations. A processor is designed to yield well at a certain frequency for the process targets. Each wafer is slightly different. Individual die on a wafer are different. Some are outright flawed and thrown away. Some have higher static leakage and must be run at lower voltage to hit the TDP spec, and often are down-binned. Some are slower and some are faster. There is a complex process on a high volume tester to find the sweet spot of frequency, voltage, and TDP to bin a part in the right SKU. Then, each part is dynamically fused with a various voltage levels for min frequency, max frequency, turbo frequency, etc.

      On a really mature process node with a really mature design that's not being pushed to the limits of performance, all the processors may be mostly the same. Not the latest and greatest stuff.

    7. Re:OsStress by Anonymous Coward · · Score: 0

      We all realize that when Intel bakes a bunch of processors, they come out all the same, and then Intel labels some as highspeed, some as middle, and some as low. They are then sold for different prices. However, they are the exact same CPU.

      No the CPUs are not exactly the same. Each and every CPU is unique. Just like a snow flake. None of them is entirely perfect, otherwise they'd all run exactly the same at any speed. And we know that they don't.

    8. Re:OsStress by DNS-and-BIND · · Score: 2

      Nope! It's the same processor. Sure, some come out different, but oftentimes there are loads of perfectly good processors that get underclocked for marketing reasons only. It's not like the ratios come out perfectly every time, which is what you seem to be implying. They often times don't do it with good reason. Intel is very big into marketing. If they were an engineering firm, they'd sell one product at one price and be done with it.

      --
      Shutting down free speech with violence isn't fighting fascism. It IS fascism!
    9. Re:OsStress by Anonymous Coward · · Score: 0

      Sure, some come out different

      Which contradicts:

      they come out all the same

      So which is it?

    10. Re:OsStress by Anonymous Coward · · Score: 0

      When you get a hard disk failure, replace the drive immediately.

      Ehh, run RAID1 and run the drive into the ground, then replace.

    11. Re:OsStress by epine · · Score: 3, Informative

      Nope! It's the same processor. Sure, some come out different, but oftentimes there are loads of perfectly good processors that get underclocked for marketing reasons only.

      When the day arrives that we achieve molecular assembly, even then for two devices identically assembled with atom for atom correspondence, there will likely be enough variation in molecular or crystaline conformation remaining to classify the two devices at the margin as "not quite the same".

      Binning levels are determined by the weakest transistor out of billions, the one with a gate thickness three deviations below the mean, and a junction length a deviation above. There is probably some facility for defective block substitution at the level of on-chip SRAM (cache memory), and maybe you can laser out an entirely defective core or two.

      As production ramps, Intel has a rough model of how the binning will play out, but this is a constantly moving target. Meanwhile, marketting is making promises to the channel on prices and volumes at the various tiers. There's no sane way to do this without sometimes shifting chips down a grade from the highest level of validation in order to meet your promises at all levels despite ripples experienced in actual production.

      Intel is also concerned--for good reason--about dishonest remarking in the channel. There's huge profit in it, and it comes mainly at the expense of Intel's reputation. Multiplier locks help to discourage this kind of shady business practice. So yeah, a few chips do get locked into a speed grade less than the chip could feasibly achieve. This is all common sense from gizzard to gullet. What's your point, then?

      If they were an engineering firm, they'd sell one product at one price and be done with it.

      Where you even find so many stupid engineers? The College of Engineering for Engineers Who Think Statistics is One Big Cosmic Joke presided over by the Edwin J. Goodwin Chair of Defining Pi As Equal to 22/7?

    12. Re:OsStress by epine · · Score: 1

      In my previous post, the value implied for Pi by that figure is actually 32/10. And there's a "would" missing from my final sentence. I was in thrall momentarily to reductive epilepsy.

    13. Re:OsStress by Anonymous Coward · · Score: 0

      This is partially true.

      If a processor fails QA, it gets binned lower. Every chip goes through QA, and gets binned appropriately.

      Unfortunately, the Engineers are better then the Economy and the Marketing department. For quite some years, apart from very early on in the cycle, the Yield the Engineers get is many times better than whats needed. The end result is, a certain and large amount of chips get binned lower to fill demand rather than any faults.
      It's always a bit of a lottery on the other end, but you'd be shocked just how high the odds of getting a better spec chip than was it's sold.

    14. Re:OsStress by Anonymous Coward · · Score: 1

      The problem with binning is that naturally most processors come out in the high end (average amount of flaws), not ultra high end, but still in the very expensive part of binning. Most processors sold however are in the low end. To make all those low end processors chip manufactorers either need to lower the price of the high end parts OR move parts binned as high end to low end bins. Since they always do the later, there is a 90-99% chance that any low end part was not actually binned as a low end part to begin with, but was originally a higher end part.

    15. Re:OsStress by gnasher719 · · Score: 1

      Bullshit. While Intel does occasionally bin processors into lower speeds to fulfill quotas and such, often times those processors are binned lower because they can't pass the QA process at their full speed. But they can pass the QA process when running at a lower speed. These processors were meant to be the same as the more expensive line, but due to minor defects can't run stably or reliably at the higher speed. Or at least not enough for Intel to sell them at full speed.

      According to an Intel employee who posted this years ago, sometimes you get complete wafers that work fine at a lower speed and not at all at the intended speed, but you also get chips where a tiny amount of transistors don't work at the intended speed. The second type is the problem for overclockers. These chips will run at high speed most of the time, and only very rarely will they go wrong.

    16. Re:OsStress by sjames · · Score: 1

      They do NOT come out all the same. They are creating features on the wafer near the limits of the technology's resolution. That's why each and every one is tested separately. Some test fine with all features at the highest rated speed. Some have failed cache or some cores don't work at all. Those get a quick zap to disable the failed parts and get sold as a lesser processor (That's how Celeron started). Others have working features, but not at full speed,. Those are binned as lower speed processors.

      THEN, they will mark faster CPUs as slower as required to meet actual sales. If you buy a slower marked part and overclock it, sometimes you get a part that was de-rated to meet sales volume and you win. Other times you get a part marked down for good reason and you lose. Sometimes (often) your testing isn't as good as theirs and you get mysterious failures from time to time that you write off as software bugs.

      Near the end of a product's lifecycle, you tend to win frequently. Over the lifecycle they have tweaked the process to get a high quality yield and so end up marking a lot of parts below their test results. Near the beginning of a lifecycle, especially if it's a die shrink, you'll lose more often.

      As for how much of an issue that is, it really depends on what you do with the machine and your willingness to acknowledge the potential pitfalls. If you're just gaming and you're willing to accept that your overclocking might cause the occasional crash, it's all good. If you dun the game for each and every crash insisting it couldn't be your monster gaming platform, not so good. If you do your taxes on it, GOOD LUCK!

  4. Caution: by fahrbot-bot · · Score: 5, Funny

    Bug hunts on LV-426 often end badly.

    --
    It must have been something you assimilated. . . .
    1. Re:Caution: by asliarun · · Score: 1

      Its them damn cosmic rays, I tell ya.

      The death of Moore's law, they will be.

    2. Re:Caution: by Anonymous Coward · · Score: 0

      Obligatory:
      Nuke them from orbit. It's the only way to be sure.

    3. Re:Caution: by fragMasterFlash · · Score: 1

      Its them damn cosmic rays, I tell ya.

      The death of Moore's law, they will be.

      Or the reason semiconductor houses switch from conventional (bulk CMOS) processes to Silicon-on-Insulator. Many SOI processes are rad hardened by default.

    4. Re:Caution: by Anonymous Coward · · Score: 0

      May be they want to control leakages at small geometry, perhaps?
      Rad hardened is a side effect, not a feature they are after.

        Beside, to have true rad harden, there are other things that need to be taken care off and not just the chip. e.g.
      - Shielding the package against radioactivity, to reduce exposure of the bare device.
      - Shielding the chips themselves by use of depleted boron (consisting only of isotope Boron-11) in the borophosphosilicate glass passivation layer protecting the chips, as boron-10 readily captures neutrons and undergoes alpha decay (see soft error).
      - Redundant elements can be used at the system level.
      - Redundant elements may be used at the circuit level.
      - Hardened latches may be used.
      http://en.wikipedia.org/wiki/Radiation_hardening

    5. Re:Caution: by Roger+W+Moore · · Score: 2

      Many SOI processes are rad hardened by default.

      Rad hard usually means that they are not damaged by radiation e.g. you can stick them close to an LHC beam as part of a detector and the massive radiation dose they receive will not cause the device to permanently cease functioning (or at least last longer before it fails). On the other hand cosmic rays which slow down and stop in material can cause a large amount of local ionization. This can be enough to flip the state of a memory bit which can cause crashes. As devices get smaller the charge needed to flip the state gets less and so more cosmic rays are capable of depositing enough charge to make a difference. These two processes are different: one is a permanent failure of the device the other is just temporarily flipping the state.

  5. I don't believe 1% of computers give wrong answers by PineGreen · · Score: 0, Flamebait

    I think this is bull. I just don't believe 1% of computers give wrong answers. There are many reasons why precomputed table might differ - threading, reordering of floating point operations, etc. Basically, compilers guarantee certain precision, not by-bit determinstic result (unless you set up certain IEEE flags, which are not on by default).

  6. stress test by Anonymous Coward · · Score: 1

    In my field, if you can survive a gcc (gcc.gnu.org) testsuite run, twice, and get the same answer, you have a verified good system. If not, you have a steaming pile of trash you should throw away. The begins, and ends all stress testing you need to do.

    1. Re:stress test by SJHillman · · Score: 5, Funny

      In my field, I have a bunch of grass, a few shrubs and even a small tree. Lots of rodents and birds. If a computer can survive two weeks sitting in my field and still power on, you have a damned good system. If not, you're left with people wondering why you left your computer in my field for two weeks.

    2. Re:stress test by AaronLS · · Score: 5, Funny

      He didn't say anything about a computer: "In my field, if YOU can survive"... scary...

    3. Re:stress test by godrik · · Score: 1

      so do you suggest guildwars incorporate a gcc testsuite run in parallel with the game?

    4. Re:stress test by AaronLS · · Score: 1

      Funny though, I like what you did there.

  7. Re:I don't believe 1% of computers give wrong answ by Desler · · Score: 1

    I think this is bull. I just don't believe 1% of computers give wrong answers.

    Why would he lie about it?

  8. Memory modules by stanlyb · · Score: 0

    I found out his the hard way: by buying different DIMM modules, combining them, and of course NOT combining them. Nevertheless, when you do a lot of multitasking, play 2-3 games...ok, ok, only one, but the other two are still in background, having some VM, etc, you will find out how fragile the computers are nowadays. The solution? Buy some DELL, and be happy with the most stable, and with the least performance computer.

  9. Re:I don't believe 1% of computers give wrong answ by PaladinAlpha · · Score: 5, Insightful

    You don't have any idea what you're talking about, and that's why you don't understand what he's talking about.

  10. How to deal with compiler bugs by MtHuurne · · Score: 5, Insightful

    If you suspect the compiler is generating invalid machine code, try to make a minimal test case for it. If you succeed, file a bug report and add that test case; the compiler developers will appreciate it. If you don't succeed in finding a minimal test case that triggers the same issue, it's likely not a compiler bug but an issue in your program in some place where you weren't expecting it.

    1. Re:How to deal with compiler bugs by Anonymous Coward · · Score: 0

      No kidding captain obvious, that's standard bug reporting 101.

      http://www.chiark.greenend.org.uk/~sgtatham/bugs.html

    2. Re:How to deal with compiler bugs by jittles · · Score: 1

      I've only seen a compiler issue once in my entire career. Granted, I have probably worked on more mature platforms than some of the cowboys that started the industry but here is what it was: It was using GCC 3 on MontaVista 2.4 Linux. I was building an insanely complex system library w/ a lot of templated code and a large number of classes inheriting from a parent. The problem is that a pthread_mutex_t was not being constructed properly. When we would try and lock the mutex, it would crash the entire application. The exact same code worked flawlessly on our MontaVista 2.6 toolchain, but our flash drive was too small to fit a 2.6 install. We spent weeks working with MontaVista to build a test case that reproduced the issue and could not. So how do we know it was a compiler issue? Well like I said, we knew it worked on all of our other Linux/Windows platforms. We also told the compiler not to assemble the code, and we looked at the assembly output by the compiler. Sure enough it was doing something stupid (I can't remember what). MontaVista said that they had no idea how to fix it. We ended up doing #ifdef for that particular platform and using sem_t in place of pthread_mutex_t and it worked like a champ.

    3. Re:How to deal with compiler bugs by Anonymous Coward · · Score: 0

      Or even better, use a verified compiler like CompCert.

    4. Re:How to deal with compiler bugs by MtHuurne · · Score: 1

      I see internal compiler errors (assertion-style) and crashes quite regularly, as in multiple times per year. Valid language constructs being rejected or invalid ones accepted happens too, although I don't know the language specs well enough to spot most of those. And occasionally a code generation bug (I can remember two from the last 10 years).

      How often you find compiler bugs depends a lot on what part of the compiler you're stressing: you'll encounter more bugs on MIPS than on x86, for example, since x86 has far more users. And if you're trying new C++11 features it's far more likely you'll see a bug than when you're compiling plain C.

  11. Re:I don't believe 1% of computers give wrong answ by godrik · · Score: 4, Informative

    I actually believe it. I am sure they might have think of floating point precision problem. But most likely they only used integers. That's what prime 95 and memtest are doing. Integer and memory operations uncover most common hardware failure. I encountered many computers with faulty hardware when stressed. And I am sure guildwars was stressful.

  12. Compilers by Mullen · · Score: 4, Funny

    For being a skilled developer, I can't believe he would not think that Dev/Test/Prod build environments not running the same version of the compiler was not an issue (Obviously, until it was an issue).

    That's Development Cycle 101.

    --
    Linux O Muerte!
    1. Re:Compilers by Anonymous Coward · · Score: 0

      Its funny the amount of 'not my problem' that creeps into the dev process. You are working on your code. There is a guy in the corner his JOB is to make sure it builds. But only for dev. QA has their own build guy because the head of QA wants to build it himself because of 'process'. So now you have 2 builds. You have people who ignore known bugs because they 'do not want to rock the boat'. You have people shipping with ancient tools for years on end because 'upgrading is too big of a hassle'.

      Then you have people thinking visual studio 6 is cutting edge and do not want to upgrade and only grudgingly up to 2002 (more well tested).

      People are funny about dev tools. They think they somehow get better with age.

    2. Re:Compilers by Anonymous Coward · · Score: 0

      All screw-ups of this nature are totally obvious in hindsight.

      This isn't a case of him saying "it doesn't matter whether the build server is up to date". This is a case of nobody remembering to update the build server. If they had been a large enough company to have a single person whose full-time job was to support build servers, it would have been that person's job, but I'm going to guess that Guild Wars was made by a relatively small team who wore multiple hats. The story certainly makes it sound that way. So it was, at the same time, everyone's fault and no-one's fault; anyone could have remembered to update the server, but no-one did, and no single person was uniquely in charge of maintaining the build server.

      The ease of making mistakes is why very important jobs, like airplane pilot, rely so much on checklists. If there had been a "Development Cycle 101 Checklist" available to the Guild Wars team at the time of the story, I'm sure they would have used it, and when going down the checklist they would have said "oh, check that the build machine is up to date."

      So, it's easy to laugh at them for this mistake, but I think of mistakes I have made and I don't look down on them for this one. He posted it so that others could learn from his experience rather than learning this the hard way.

    3. Re:Compilers by disambiguated · · Score: 1

      Wow, I feel for you. If QA is not testing against the same build the developers are using, they're doing it horribly wrong. Or did you mean QA is doing their own build for their testing tools? That I can understand.

    4. Re:Compilers by Epicaxia · · Score: 1

      You missed the part where he said it was a deliberate feature of the Prod build:

      A program that doesn’t run uses a lot less CPU.

    5. Re:Compilers by Anonymous Coward · · Score: 1

      Orig ac here. No I do not allow that. I would yell till my face is red. I have however seen orgs which are that dysfunctional. I skipped taking the job :)

      Some orgs think that VC6 was the gold standard ever in compilers and no more money need be spent on that. Never mind they can not find anyone who will want to work with stuff that creeky anymore...

      For some reason some orgs gentrify is the point I was trying to make. Then the management can not figure out why the tool vendors will no longer support them.

      There is never 'any time to upgrade'.

  13. Re:I don't believe 1% of computers give wrong answ by Jeremi · · Score: 5, Insightful

    I think this is bull. I just don't believe 1% of computers give wrong answers

    1% of all computers? Probably not.

    1% of gamers' computers, in an era when PC gaming technology was progressing very quickly, and so gamers were often running overclocked (or otherwise poorly set up) hardware? Sounds plausible enough.

    --


    I don't care if it's 90,000 hectares. That lake was not my doing.
  14. Re:I don't believe 1% of computers give wrong answ by Alwin+Henseler · · Score: 2

    I won't go into specific reasons you mention, but it is perfectly possible to write code that has a known, fully deterministic result. After all: compilers produce machine code, and the bulk of that is integer operations which have exactly defined behavior with 0 room for interpretation (when it comes to digital logic like CPU's, "defined" is deterministic). Maybe there are exceptions (like floating point? don't count on it), maybe for some types of operations you need to sidestep a compiler and code some assembly directly, but that's beside the point.

    With that in hand, expect some of computed results to turn out wrong. Knowing what junk parts go into computers sometimes, how shoddy some machines are built, and how some people abuse their computers, I'd think a 1% failure rate is probably on the low end of the scale.

    For example, try running Memtest86 sometime, leave running for a few hours, repeat for other computers you encounter, and see how many computers you need to try before you see it spit out errors. You might be surprised.

  15. Re:I don't believe 1% of computers give wrong answ by MtHuurne · · Score: 5, Insightful

    He said 1% of computers that were used to play Guild Wars gave wrong answers. Gaming PCs are more likely to be overclocked too far, have under-dimensioned power supplies or overheating issues than the average PC. 1% doesn't sound unrealistically high to me.

  16. shows what happens pulling 12+ hour days does by Joe_Dragon · · Score: 0

    shows what happens pulling 12+ hour days does at least some did not die due to it. Yes that has happened in the past with trains, trucks, airplanes , ect's.

    1. Re:shows what happens pulling 12+ hour days does by DrVxD · · Score: 1

      Yes that has happened in the past with trains, trucks, airplanes , ect's.

      12+ hours of ECT is just asking for trouble...

      --
      Not everything that can be measured matters; Not everything that matters can be measured.
  17. Re:I don't believe 1% of computers give wrong answ by UnknownSoldier · · Score: 2

    > I actually believe it. I am sure they might have think of floating point precision problem.

    I can believe it. Ten years ago on one the PC games I worked on there were significant floating-point differences between Intel and AMD. Fortunately it was an RTS so we could get away with fixed-point. If we would of been forced to deal with floats it would of been a hassle to keep them "in sync."

    Floating-point is an approximation anyways, so IMHO

    a) the server should be making the authoritative decision(s), and
    b) should be sending a quantized result to the clients.

  18. Yep, seen it all by russotto · · Score: 5, Insightful

    I've had compilers miscompile my code, assemblers mis-assemble it, and even on a few cases CPUs mis-execute it consistently (look up CPU6 and msp430). Random crashes due to bad memory/cpu... yep. But on very rare occasions, I find that the bug is indeed in my own code, so I check there first.

    1. Re:Yep, seen it all by Belial6 · · Score: 1

      I hate it when the bugs are not in my code.

  19. Typical for safety cert programs by Okian+Warrior · · Score: 5, Interesting

    We deal with this type of bug all the time in safety-certified systems (medical apps, aircraft, &c).

    Most of the time an embedded program doesn't use up 100% of the CPU time. What can you do in the idle moments?

    Each module supplies a function "xxxBIT" (where "BIT" stands for "Built In Test") which checks the module variables for consistency.

    The serial driver (SerialBIT) checks that the buffer pointers still point within the buffer, checks that the serial port registers haven't changed, and so on.

    The memory manager knows the last-used static address for the program (ie - the end of .data), and fills all unused memory with a pattern. In it's spare time (MemoryBIT) it checks to make sure the unused memory still has the pattern. This finds all sorts of "thrown pointer" errors. (Checking all of memory takes a long time, so MemoryBIT only checked 1K each call.)

    The stack pointer was checked - we put a pattern at the end of the stack, and if it ever changed we knew something want recursive or used too much stack.

    The EEPROM was checksummed periodically.

    Every module had a BIT function and we check every imaginable error in the processor's spare time - over and over continuously.

    Also, every function began with a set of ASSERTs that check the arguments for validity. These were active in the released code. The extra time spent was only significant in a handful of functions, so we removed the ASSERTs only in those cases. Overall the extra time spent was negligible.

    The overall effect was a very "stiff" program - one that would either work completely or wouldn't work at all. In particular, it wouldn't give erroneous or misleading results: showing a blank screen is better than showing bad information, or even showing a frozen screen.

    (Situation specific: Blank screen is OK for aircraft, but not medical. You can still detect errors, log the problem, and alert the user.)

    Everyone says to only use error checking during development, and remove it on released code. I don't see it that way - done right, error checking has negligible impact, and coupled with good error logging it can turbocharge your bug-fixing.

    1. Re:Typical for safety cert programs by UnknownSoldier · · Score: 1

      > Everyone says to only use error checking during development, and remove it on released code. I don't see it that way - done right, error checking has negligible impact,

      That depends on the _type_ of app. In the games industry a _debug_ build runs TOO slow to be even practical. You are forced to run optimized code if you want to have any hope of going above 1 fps.

      TINSFAAFL. Error checking costs. If I was doing software were somebody's life depended on it -- hell yeah you spot on! But for a "game" you can't afford the frame rate hit to do it "right". :-/

    2. Re:Typical for safety cert programs by Gaygirlie · · Score: 1

      But for a "game" you can't afford the frame rate hit to do it "right". :-/

      Yes, you can, just time it right: if the CPU is under-utilized at the moment then you can safely have it run a check or another. Of course it would be silly to run a check when the CPU is already pegged at 100%, but then again, OP clearly said they run their checks when there's spare resources.

    3. Re:Typical for safety cert programs by UnknownSoldier · · Score: 1

      With multi-core machines, yes, we actually have space CPU cycles again but the problem is the check needs to be deterministic time-wise. Another monkey wrench is that games unfortunately have uneven latency. Sadly not all gamers value > 60 fps. :-(

    4. Re:Typical for safety cert programs by Gaygirlie · · Score: 1

      Another monkey wrench is that games unfortunately have uneven latency. Sadly not all gamers value > 60 fps. :-(

      Why should they value such? Most displays these days are 60Hz so having higher FPS gives absolutely nothing. Also, high average FPS doesn't guarantee in any way or form that there won't be occasional severe framerate drops: having 350 FPS when you're roaming around in general and not doing anything specific is useless if the framerate drops to 15 FPS during the short moments when the framerate matters. No, even framerate across the board matters more than having high average FPS.

    5. Re:Typical for safety cert programs by UnknownSoldier · · Score: 1

      > Most displays these days are 60Hz so having higher FPS gives absolutely nothing.

      For single player I would agree.

      For multiplayer you NEED headroom for when you have tons of explosions going off with 16 - 32+ players. This way you can STILL keep everything silky smooth.

      > even framerate across the board matters more than having high average FPS.
      That is correct, but ideally you want to guarantee an even 60+ fps for SMOOTH motion due to lack of temporal aliasing.

      If your frame rate is dipping down into the teens you have some serious optimizations problems to look into. Or you need to bring out the clue-stick for the artists/designers and see just what the hell they are doing. :-)

    6. Re:Typical for safety cert programs by Gaygirlie · · Score: 1

      For multiplayer you NEED headroom for when you have tons of explosions going off with 16 - 32+ players. This way you can STILL keep everything silky smooth.

      That's exactly the thing I was saying: having high average FPS does NOT guarantee that the FPS will be acceptable during the short, hectic moments where it would matter. To use a car analogy: the way you think is similar to driving with just one gear on at all times, at 100kmph in 60kmph areas, just so that when you come upon an upwards slope the speed wouldn't dip under that 60kmph, whereas the proper way would be to drive at 60kmph until you come upon an upwards slope, then switch gear while you're on the slope to maintain that 60kmph, and then switch gear back when you come off of it. The first method is the brute-force method and the second one is the right method -- it's just that the brute-force method is easier to employ and therefore it's used so much in games.

    7. Re:Typical for safety cert programs by Bram+Stolk · · Score: 1

      asserts should be used in games as well.
      If you keeps them out of inner loops, performance impact will be very close to zero.

      --
      Bram Stolk http://stolk.org/tlctc/
  20. Stress testing: most critical Overclocking step! by epyT-R · · Score: 2

    This is why stress testing is so important. The system may seem stable at overclocked speeds but only while it is lightly or even moderately loaded, and not every error will result in a kernel panic. The hardest errors to get stable are often the subtle ones that cause cascades elsewhere, minutes or hours after the load finished.

    I start by getting it stable enough to pass memtest86+ tests 5 and 7 at (or as close as possible) my target frequencies/dividers. This is pretty easy to do nowadays, but it's a good sanity check starting point before booting the OS and minimizes gross misconfigurations that cause filesystem corruption. Then I run prime95, then linpack, then y cruncher, then loops of a few 3dmark versions. Sometimes I run the number crunchers simultaneously across all cores, first configured to stress the cpu/cache, then with large sets to stress ram (but not swap! in fact turn swap off for this). The minimum time for all of this really should be 12 hrs.. 24 is best, or more if you're paranoid. A variety of loads over this time is important because the synthetic ones are often highly repetitious, and this can sometimes fail to expose problems despite the load the system's under. The 3dmark (or pick a scriptable util of your choice) stresses bus IO as well as all the really cranky and picky gfx driver code. As a unique stressor, I use a quake 3 map compile that eats most of the ram and pegs the cpu for hours.. q3map2 is a bitch and it usually finds those subtle 'non-fatal' hardware errors if they exist.

    If the boot survives without an application or kernel crash (or other wonky behavior), I run a few games in timedemo loops. In the old days this was quake1/2/3, but these days I stick with games like metro 2033 which have their own bench utilities. these tests are still valid even if your intended use is for 'workstation' class work and don't game much, but still want to squeeze as much performance as you can from your hardware. I do both with mine and have had great success with this method.

  21. Random bluescreens by Anonymous Coward · · Score: 0

    I get hilarious impossible errors on my gaming rig all the time. Memtest revealed my RAM is throwing errors about once in a couple million memory transactions. Leaky gate I guess. Not about to replace the RAM any time soon, though -- nobody is really making DDR2 anymore and the utterly random errors I get when data comes back incorrect is very instructive as to which companies practice sanity checking and which skip it. Oh, and how annoying it can be when windows decides device driver errors are worthy of halting the OS over.

    1. Re:Random bluescreens by Just+Brew+It! · · Score: 2

      If every value read from memory had to be sanity checked there would be little CPU horsepower left to perform useful work. Furthermore, if the bad value happens to be code instead of data, the application (or OS) is probably gonna crash before you even have a chance to check anything.

    2. Re:Random bluescreens by Anonymous Coward · · Score: 0

      At most, in a proper design, only one clock of latency is added to check read-data from ram. No CPU horsepower is used.

    3. Re:Random bluescreens by Just+Brew+It! · · Score: 1

      I may have mis-interpreted the post I replied to. I thought he was referring to sanity checks in software; upon re-reading he may have been referring to ECC RAM. I agree, the performance penalty for ECC RAM is minimal (and I use it in most of the systems I build).

    4. Re:Random bluescreens by DrVxD · · Score: 1

      If every value read from memory had to be sanity checked there would be little CPU horsepower left to perform useful work

      That's why ECC memory exists - the 'sanity check' is a real-time hardware operation that (typically) generates an NMI when it fails. It's not common in consumer-grade PCs (due to cost), but it does get used in some critical applications.

      --
      Not everything that can be measured matters; Not everything that matters can be measured.
  22. Don't forget to do out of the box testing by Joe_Dragon · · Score: 1

    Don't forget to do out of the box testing / testing for stuff that you may not think of off hand.

  23. Reminded me of my first C application by mykepredko · · Score: 3, Interesting

    I can't remember the exact code sequence, but in a loop, I had the statement:

    if (i = 1) {

    Where "i" was the loop counter.

    Most of the time, the code would work properly as other conditions would take program execution but every once in a while the loop would continue indefinitely.

    I finally decided to look at the assembly code and discovered that in the conditional statement, I was setting the loop counter to 1 which was keeping it from executing.

    I'm proud to say that my solution to preventing this from happening is to never place a literal last in a condition, instead it always goes first like:

    if (1 = i) {

    So the compiler can flag the error.

    I'm still amazed at how rarely this trick is not taught in programming classes and how many programmers it still trips up.

    myke

    1. Re:Reminded me of my first C application by Anonymous Coward · · Score: 0

      how would "assign 1 to value i" ever return false?

    2. Re:Reminded me of my first C application by Anonymous Coward · · Score: 0

      One of the reasons why C++, C# and Java use the "==" operand to do Boolean evaluation. AFAIK, most of the modern implementations of C compilers prefer this as well, to prevent Boolean evaluations from becoming assignments. The only time I really remember encountering this issue was decades ago when I was still learning to program BASIC on my VIC 20, and C for me was still just the third letter in the alphabet.

    3. Re:Reminded me of my first C application by Anonymous Coward · · Score: 0

      how would "assign 1 to value i" ever return false?

      It wouldn't, but partly because it wouldn't compile.

    4. Re:Reminded me of my first C application by safetyinnumbers · · Score: 4, Informative

      That's known as "Yoda style"

    5. Re:Reminded me of my first C application by mykepredko · · Score: 1, Informative

      The statement:

      if (i = 1) {

      is equivalent to:

      i = i;
      if (i) { // Always true because i = 1 and i != 0

      myke

    6. Re:Reminded me of my first C application by Anonymous Coward · · Score: 0

      Took me a second too, but what he was saying is that he had an error that the compiler didn't flag because the syntax was correct. By putting the literal first, it then becomes an lvalue, which is illegal syntax and thus will cause the compiler to throw an error.

    7. Re:Reminded me of my first C application by mykepredko · · Score: 1

      The reference, like, do I.

      myke

    8. Re:Reminded me of my first C application by dido · · Score: 1

      Which is why I always compile with -Wall -Werror on gcc. I get: "warning: suggest parentheses around assignment used as truth value [-Wparentheses]" for code which looks like that. I consider code that generates compiler warnings as being a bad sign, and always make it a point to clean them up before considering any code suitable. I don't know why this doesn't seem to be as widely done as it should be.

      --
      Qu'on me donne six lignes écrites de la main du plus honnête homme, j'y trouverai de quoi le faire pendre.
    9. Re:Reminded me of my first C application by Anonymous Coward · · Score: 0

      Most modern IDEs are smart enough to flag this visually, and can usually be configured to flag this at an "error" level. What should concern you is the lack of interest shown by most developers in making the best use of their IDE.

      If you're a die-hard VI or Emacs dude... well, there really is no hope then, is there?

    10. Re:Reminded me of my first C application by richardcavell · · Score: 5, Informative

      I just want to correct this, not to prove how smart I am but because there are novice programmers out there who will learn from this case. The statement:

      if (i = 1) {

      is equivalent to:

      i = 1; /* correction */
      if (i) {

    11. Re:Reminded me of my first C application by MichaelSmith · · Score: 1

      if ( 1 == i ) {

    12. Re:Reminded me of my first C application by Anonymous Coward · · Score: 0

      if ( 1 == i ) {

      What type of error/warning does that result in?

    13. Re:Reminded me of my first C application by Bill+Currie · · Score: 1

      Ick, that's what real compilers (eg, gcc) are for: good warning messages (such as "suggest parenthesis around assignment used as truth value"), and better yet, -Werror. "if (1 == i)" is completely unnatural (for an English speaker anyway), which makes it more likely to forget to do 1 == i than it is to forget to double the equals sign. I too used to make the same mistake when I first started with C (having come from Pascal: that was fun := became =, = became ==), but I quickly learned to double check my tests first when bug hunting. While mixing up = and == has become extremely rare for me (it helps that I usually test against some variant of 0 and thus can avoid using any operator other than !), I often mix up my other tests...

      --

      Bill - aka taniwha
      --
      Leave others their otherness. -- Aratak

    14. Re:Reminded me of my first C application by RedHackTea · · Score: 1

      The (1 = i) trick was taught in my college. The university's main language was Java, and we were also taught to do this:

      String name = null;
      // This will not cause a NullPointerException
      if("bob".equals(name)) { // do stuff }
      // This will
      if(name.equals("bob")) { // do stuff }

      Not only is it safer, but it's actually a microsecond faster than if you were to always do this:
      if(name != null && name.equals("bob")) { // do stuff }

      Here's another major gotcha:

      String password = getPasswordInputFromUser();
      if("password".equals(password));
      {
      // access granted
      }

      Did you notice the semi-colon at the end of the if-statement? That means access will always be granted, as it isn't seeing it as really an if-block but just a block; the if-statement would be executed, but has no true "then" clause. My co-worker ran into this problem a couple of weeks ago during test... luckily, it was during development and not production.

      Despite these concerns, I'll still write code every now and then that is in the unsafer way, as it's more natural in English to say in your head "does the user's name equal to bob?" than to say "does bob equal to the user's name?" Of course, if(i = 1) is invalid in Java anyway (as it has to be a boolean result; it can't be an integer like in C/C++), but I can write if((--i) >= 0). I feel comfortable writing this way though as I feel like I know most of the gotchas. I'm not trying to be arrogant here. I just write code the natural way to me and how my subconscious has written it since I first started coding. Human error will always be a problem; it's just important to always be paranoid.

      --
      The G
    15. Re:Reminded me of my first C application by Anonymous Coward · · Score: 0

      Because the return value of (i = 1) is 1. You can think of it as though you just called a function with the signature:
      int assignValue(left, right)

    16. Re:Reminded me of my first C application by Anonymous Coward · · Score: 0

      Yoda style made sense before compilers started warning if you don't put extra parentheses (i.e. "if ((i = 1)) { ... }"), but now that every compiler does it with appropriate warning level flags, I can safely say that bad code smell yoda style is.

    17. Re:Reminded me of my first C application by Anonymous Coward · · Score: 0

      I even use -pedantic as a compile flag. Now a days I compile with both gcc and clang and if I have the chance the clang static analyser.
      The amount of subtle bugs that can be found by removing all warnings is staggering.

      You also find out things that look perfectly normal like a casting through a pointer that is actually illegal in C, I now cast through a union ever since. There are many more examples but I had a bad night sleep and can't remember one off hand.

      Also Yoda style, I really hate it with a passion. A program should be able to read aloud.

    18. Re:Reminded me of my first C application by elbonia · · Score: 1

      Most compilers can catch errors like this. For example in this C situation you would find it when compiling with gcc and looking for the warning:
      warning: suggest parentheses around assignment used as truth value
      http://www.network-theory.co.uk/docs/gccintro/gccintro_94.html

      That way if you really wanted to assign i to 1 in the if you would need to use if ( (i=1) ) {.
      Assignment like this is very common in C/C++ for pointers, esp when working with lists.
      while ( (list = list->next) )

    19. Re:Reminded me of my first C application by Anonymous Coward · · Score: 0

      I'm still amazed at how rarely this trick is not taught in programming classes and how many programmers it still trips up.

      I'm still amazed how many programmers use crap languages designed to let you write more bugs more easily.

    20. Re:Reminded me of my first C application by Anonymous Coward · · Score: 0

      That leaves you with one error: testing a known variable. The troublesome form is actually `if (p = GetPointer( ) ) p->Foo = 1; `

    21. Re:Reminded me of my first C application by Anonymous Coward · · Score: 0

      Hah! But i actually got passed by reference and my other thread changed it between the last two instructions! Muarf, arf, arf! :-)

    22. Re:Reminded me of my first C application by Anonymous Coward · · Score: 0

      Very interesting in this regard is the ability of clang to flag exactly this error. You have to use double brackets to not get a warning for this.

    23. Re:Reminded me of my first C application by Quietust · · Score: 1

      Microsoft's compiler has a similar warning - "C4706: assignment within conditional expression", and it actually doesn't let you suppress it just by adding extra parentheses - instead, you have to add a comparison around it.
      Thus, your second example would have to be while ((list = list->next) != NULL)", which is probably more readable anyways.

      --
      * Q
      P.S. If you don't get this note, let me know and I'll write you another.
    24. Re:Reminded me of my first C application by Imagix · · Score: 1

      There's no excuse for this to still be a problem. GCC has been emitting warnings about this code construct for many, many years now. You do compile with at least "-Wall -Werror", right? In the vast majority of cases that I've found, if the compiler is throwing a warning at you, you've probably done something shady. "If you lie to the compiler it will get its revenge." (Henry Spencer)

    25. Re:Reminded me of my first C application by mykepredko · · Score: 1

      This was Microsoft C Version 1.0 and in 1985.

      myke

    26. Re:Reminded me of my first C application by gnasher719 · · Score: 1

      I'm proud to say that my solution to preventing this from happening is to never place a literal last in a condition, instead it always goes first like: if (1 = i)

      My solution is to turn on all sensible warnings in Clang, and set "warnings = errors". if (i = 1) won't compile, and neither will if ((i == 1)). Hope you can figure out why the second one gives a warning. That solution is much better than writing code that is hard to read.

    27. Re:Reminded me of my first C application by Anonymous Coward · · Score: 0

      Which is equivalent to:

      if(1){

      or

      if(true){

      in almost every circumstance.

    28. Re:Reminded me of my first C application by Kjella · · Score: 1

      I personally can't stand Yoda style, I'd rather they make an ":=" operator and banish the single "=" as an operator entirely.

      a := b <-- assignment
      a == b <-- comparison
      a = b <-- invalid code

      Pretty hard both to typo wrong and read wrong, as a transition you could use a compiler flag to set it to off/optional/mandatory.

      --
      Live today, because you never know what tomorrow brings
    29. Re:Reminded me of my first C application by whyde · · Score: 1

      I know I'm late to the party, but the better way to write this is always to put the constant first:

      if (1 == i) {

      ...so that if you forget to use '==' it will cause a syntax error.

    30. Re:Reminded me of my first C application by DrVxD · · Score: 1

      I've been programming in C for over 30 years, and honestly can't remember how long it's been since I used a compiler that wouldn't warn me about something like if( i = 1 ). Mind you, in my early days (when compilers were more concerned about code generation than correctness), I used to push everything through lint every so often - which would have generated a similar warning.

      All that being said, to this day I still favour if( 1 == x ), despite the raised eyebrows it gets from the younger crowd.

      --
      Not everything that can be measured matters; Not everything that matters can be measured.
  24. How to lose time and sanity by Okian+Warrior · · Score: 4, Interesting

    If you suspect the compiler is generating invalid machine code, try to make a minimal test case for it. If you succeed, file a bug report and add that test case; the compiler developers will appreciate it. If you don't succeed in finding a minimal test case that triggers the same issue, it's likely not a compiler bug but an issue in your program in some place where you weren't expecting it.

    Yeah, right. Let's see how that works out in practice.

    I go to the home page of the project with bug in hand (including sample code). Where do I log the problem?

    I have to register with your site. One more external agent gets my E-mail, or I have to take pains to manage multiple E-mails to avoid spam. (I don't want to be part of your community! I just thought you wanted to make your product better.)

    Once registered, I'm subscribed to your newsletter. (My temp E-mail has been getting status updates from the GCC crowd for years. My mail reader does something funky with the subject line, so responding with "unsubscribe" doesn't work for me.)

    Once entered, my E-mail and/or name is publicly available on the bug report for the next millenium. In plain text in the bug report, and sometimes in the publicly-accessible changelog - naked for the world to see (CPAN is especially fragrant).

    Some times the authors think it's the user's problem (no, really? This program causes gcc to core dump. How can that be *my* fault?) Some times the authors interpret the spec different from everyone else (Opera - I'm looking at you). Some times you're just ignored, some times they say "We're rewriting the core system, see if it's still there at the next release", and some times they say "it's fixed in the next release, should be available in 6 months".

    What you really do is figure out the sequence of events that causes the problem, change the code to do the same thing in a different way (which *doesn't* trigger the error), and get on with your life. I've given up reporting bugs. It's a waste of time.

    That's how you deal with compiler bugs: figure out how to get around them and get on with your work.

    No, I'm not bitter...

    1. Re:How to lose time and sanity by BertieBaggio · · Score: 1

      Once entered, my E-mail and/or name is publicly available on the bug report for the next millenium. In plain text in the bug report, and sometimes in the publicly-accessible changelog - naked for the world to see (CPAN is especially fragrant).

      Well, at least it smells nice.

      --
      If all you have is a grenade, pretty soon every problem looks like a foxhole -- MightyYar
    2. Re:How to lose time and sanity by kwerle · · Score: 2

      ...I have to register with your site. One more external agent gets my E-mail, or I have to take pains to manage multiple E-mails to avoid spam. (I don't want to be part of your community! I just thought you wanted to make your product better.)...

      Let me help with one aspect.

      If your email address is:
      your_address@gmail.com
      then you supply
      your_address+domain.name@gmail.com

      And if you don't use gmail, then maybe your email supplier does something similar. Or you should learn procmail if you're still managing your own.

      p.s. It looks like your www.o...r.com domain/host is down.

    3. Re:How to lose time and sanity by Anonymous Coward · · Score: 2, Informative

      You're not kidding. I recently discovered a bug in glibc that causes cos() to return values ridiculously outside the range of -1 to +1 when used along with fesetround() on 64-bit systems. After submitting the bug report (and, indeed, they posted my email address online) someone posts a link to an older bug report, from five years earlier, about a similar issue with exp() and cosh(). The bug report ends with something like "well, I fixed it for exp(), cosh() and sinh(). If any other functions have a similar issue, someone should file a separate bug report."

      The bug had been opened five years before it was closed. ...and I'm not sure they even fixed it for cos(), nor do I know if they're going to because I can't find the glibc bug tracker anymore. (how I found it the first time I have no idea) Lesson learned: Don't expect glibc to know how to do math. So I just changed the code so it no longer used fesetround().

      Sadly, I was only using it in order to get GCC to stop using glibc and use my FPU directly instead, as glibc's math is slow, but unfortunately the same tricks don't work on 64-bit systems, where glibc seems to always be used. The result is that simply by using the tricks listed in my blog post and recompiling for 32-bit, your math code runs twice as fast as it does when compiled for 64-bit. However, the same tricks cause the math functions to return wildly incorrect results when compiled for 64-bit.

    4. Re:How to lose time and sanity by Anonymous Coward · · Score: 0

      I think he means it stinks.

    5. Re:How to lose time and sanity by KiloByte · · Score: 1

      Sorry, but 99.99% people who scream about bugs in the compiler have a bug in their own code. The GCC team is small and overworked, without a proper delated test case your report is typically a waste of time. And clang, whose team has a number of paid developers, dismisses reports even more readily.

      --
      The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
    6. Re:How to lose time and sanity by Anonymous Coward · · Score: 0

      My mail reader does something funky with the subject line, so responding with "unsubscribe" doesn't work for me.

      One wonders why you continue to use a mail reader that can't manage to send an email without mangling the subject header.

    7. Re:How to lose time and sanity by Anonymous Coward · · Score: 0

      Once registered, I'm subscribed to your newsletter. (My temp E-mail has been getting status updates from the GCC crowd for years. My mail reader does something funky with the subject line, so responding with "unsubscribe" doesn't work for me.)

      You should file a bugreport with your mail reader!

    8. Re:How to lose time and sanity by V+for+Vendetta · · Score: 3, Insightful

      One wonders why you continue to use a mail reader that can't manage to send an email without mangling the subject header.

      He wrote a bug report, but it was ignored.

    9. Re:How to lose time and sanity by Kjella · · Score: 1

      I've found that to be true of most bugs actually, if you have to hand it off for someone else to fix it's highly unlikely they'll rush to fix it at a pace that solves your immediate problem so you better work around it yourself. At least with one particularly product I worked with though, I often found that bugs would get fixed 6-12 months later. Apparently every so often they'd gather up all the low priority bugs in that area of the code and go bug stomping. I guess it's easier for the developers to understand more of the code and fix many things at once than trying to snipe one and one bug, so I can understand that. If you're planning to use the product over any length of time, those kind of cleanups are worth the bug reporting time. Otherwise you'll just keep running into it again and again.

      --
      Live today, because you never know what tomorrow brings
    10. Re:How to lose time and sanity by MtHuurne · · Score: 1

      My e-mail address was harvested years ago: I have to rely on spam filters anyway, so I don't really worry about publishing it in bug trackers.

      Registration is a pain, I agree. However, on a project that I participate in (openMSX) we did decide to stop accepting anonymous bugs reports, since the majority of bug reports lacked essential information to be able to reproduce the bug. If there is no way to contact the user who filed the bug, the only thing a developer can do is close it as non-reproducible ("worksforme" in Bugzilla - a very poor choice of words in my opinion).

      How well a project responds to bugs differs a lot per project and also per individual bug. Some are fixed years later, some within a day. Some are marked invalid even though they are valid, others are indeed invalid or are duplicates. The compiler projects (GCC, LLVM, Intel C++) have been relatively good with responding to my bug reports, so I will report new bugs to them when I find any. For some other projects, I don't bother anymore unless it's a data destroyer or security issue.

  25. Bugs A Noy by ios+and+web+coder · · Score: 1

    In my own coding, I tend to *gasp* make mistakes. Sometimes, really, really dumb ones.

    One of the biggest problems with my coding, is that I am often the only real coder looking at it. Even my FOSS work seldom gets reviewed by coders.

    I can't say enough about peer review. I wish I had more. It can really suck, as one thing that geeks LOVE to do, is cut down other geeks. However, they are sometimes right, and should be heard.

    Negative feedback makes the product better. Positive feedback makes the producer feel better.

    I prefer a better product, but that's just me.

    I had an interesting bug just the other day in my FOSS project. It's an iOS (iPhone/iPad) app that uses the MapKit Framework API.

    The bug was on this line.

    The original code is here.

    So that folks don't have to look at a whole bunch of source, here's the problematic two lines:

    [mapSearchView addAnnotation:myMarker]; [mapSearchView setDelegate:self];

    When iOS 6 came out (with Apple's...wonderful...new maps), the black marker suddenly started showing as the default marker (this only works on iPads, so no one seemed to see it).

    I went nuts trying to figure it out (actually, I've been nuts for a long time, but now I have something to blame it on).

    I traced into the callbacks, and saw that they were being called with an empty annotation. Whiskey Tango Foxtrot?

    Then, just for s's and g's, tried this:

    [mapSearchView setDelegate:self]; [mapSearchView addAnnotation:myMarker];

    Damn if that didn't fix it.

    It was a case of an ambiguous API contract. The Apple maps call the annotation setup as soon as the annotation is set, and the old Google API waited until a few things were set up, so the delegate call set after the annotation worked.

    I could rail against the framework, but it was really my own fault, and I am just glad I figgered it out.

    --

    "For every complex problem there is an answer that is clear, simple, and wrong."

    -H. L. Mencken

    1. Re:Bugs A Noy by DrVxD · · Score: 1

      I can't say enough about peer review. I wish I had more

      This, this and this again. Only more so.

      one thing that geeks LOVE to do, is cut down other geeks

      Folks who are doing that in a code review have completely the wrong approach, and should be battered around the head with a clue stick.
      I've found that the best way to deal with them is to point out that this shouldn't be seen as an opportunity to cut others down, but as an opportunity to show off by educating the reviewee (and, of course, showing off is the other thing that geeks love to do ;-)

      Negative feedback makes the product better. Positive feedback makes the producer feel better.

      That depends on the actual feedback. Negative feedback like "man, this sucks" hurts the project team's morale and thus, long term, the product. Negative feedback like "you can improve this by doing X here, which is better because Y" is actually very positive - it directly improves the product, improves the team's skill set, and improves communication within the team - win, win, win.

      This is one of the things I really like pair programming - instant, on-the-fly, cod review. And it doesn't much matter if your partner has a zillion times your experience, or is just writing their first program; both halves of the pair benefit from the process. I have my team spend about 50% of their actual coding time pairing, and rotate the pairs each session. Some folks seem to have the concept of man-hours too deeply ingrained in their psyche to understand how this can be as productive as everyone flying solo, but in my experience, pairing tends to be *more* productive.

      --
      Not everything that can be measured matters; Not everything that matters can be measured.
    2. Re:Bugs A Noy by ios+and+web+coder · · Score: 1

      Thanks. That's some relevant feedback. There's cultural reasons in my organization that prevent me from implementing Pair Programming (or almost anything else that smells "agile").

      I do FOSS stuff for a community that can get real, real nasty and pithy. Negative feedback can often take form as streams of vitriol and invective. It was wonderful training for me. For a while, I would respond in kind (I can get just as nasty, if I want), and nothing ever came of it. However, I started responding in a sincere manner, and often was able to "talk folks off the ledge," and extract some useful information, and, sometimes, real allies and evangelists. I do most of my work alone, but a hell of a lot of folks use what I write.

      "This sucks fat c**k!" is often the first thing I hear, but drawing out the poster often gets me to "It won't let me enter Broadway West." (of course, followed by "That's why it sucks").

      --

      "For every complex problem there is an answer that is clear, simple, and wrong."

      -H. L. Mencken

  26. Re:I don't believe 1% of computers give wrong answ by Anonymous Coward · · Score: 0

    I believe it, I am actually surprised by the low number. The problem with memory is that a simple memory test often fails to detect errors, so a POST will not find anything but the easiest faults.

    memtest86+ is actually really smart, it tries a lot of different kind of access methods and bit patterns for certain types of errors. Even memtest86+ can fail to detect memory errors.

    Interestingly enough the best test for memory errors they found is using the gcc compiler, try to compile gentoo (including X, kde, perl (perl is huge)) with many parallel compiles so that it will hit swap slightly and hope that it won't segfault.

    This article sounds interesting, I haven't read it yet. On Dr. Dobbs there was also an article not long ago about checking (like fsck) your application data structures continuously to check for errors. I am almost thinking that you should run a CRC on all your data structures. Nothing is worse than a customer who has faulty memory.

  27. Re:Mod Down by Anonymous Coward · · Score: 0

    Ok trolling is one thing, but what you wrote is a sick joke.

  28. except if GCC is wrong by decora · · Score: 1

    which, well, it can be.

  29. 12 hours a day for weeks on end by decora · · Score: 2

    i can't believe you don't understand that the brain doesn't work 100% reliably when you force it past the breaking point like this. its work 101.

  30. Yes, hardware errors happen! by Just+Brew+It! · · Score: 1

    It's just a matter of whether you realize it or not.

    The blatant ones cause an application or OS crash. But depending on what got corrupted, it might just cause a momentary application glitch, or even cause an alteration in the contents of a file that you won't notice for weeks... if ever.

    When I build PCs, they get an overnight Memtest run at a minimum. Most of the time I also use ECC RAM to protect against random flipped bits and DIMMs that fail after being in use for a while.

  31. another trick: stop mixing up testing + assignment by decora · · Score: 1

    anything that both assigns and tests a loop index is by definition a fucking accident waiting to happen. its like driving a car without wearing a seatbelt and then deciding the 'solution' is to put the steering wheel in the back seat instead of the front.

  32. Re:another trick: stop mixing up testing + assignm by mykepredko · · Score: 1

    I agree, but

    if (i = 1) {

    is a perfectly valid "C" (and Java) statement - there was no intention of putting an assignment in a conditional statement.

    Modern compilers now issue warnings on statements like this, but at the time nothing was returned.

    myke

  33. Re:I don't believe 1% of computers give wrong answ by tepples · · Score: 2

    Fortunately it was an RTS so we could get away with fixed-point.

    Does it really vary by genre? For a game world the size of Liechtenstein, a 32-bit fixed-point length gives precision down to 10 microns or so. And even in a vast open world, you start to get glitches like the far lands in Minecraft if you stray more than 12.5 million units from the origin.

  34. Re:I don't believe 1% of computers give wrong answ by Nyder · · Score: 2

    He said 1% of computers that were used to play Guild Wars gave wrong answers. Gaming PCs are more likely to be overclocked too far, have under-dimensioned power supplies or overheating issues than the average PC. 1% doesn't sound unrealistically high to me.

    guild wars runs okay on a crappy netbook, let alone anyones PC, If you need to OC to play Guild Wars then you are using a Pentium III.

    Just saying.

    I've fixed a lot of PC. In fact, i pride myself on the fact that I can usually figure out what is wrong with a computer, software or hardware. And I've seen some funky hardware in my time. And lost a lot of hardware going bad.

    I bet more then 1% of computers are problemmatic and the owners don't know it. The last chunk of memory could be bad, but if they don't use all the memory, might never find that out. (or if they test it). They might dismissed random shutdowns without understanding there is a problem.

    In my experience, there is a lot of computer hardware out there that is crappily made and shouldn't be sold, let alone in someone's computer.

    --
    Be seeing you...
  35. QA fail by Alwin+Henseler · · Score: 2

    Worse, the article hints at a bigger problem:

    "We had "pushed" a new build out to end-users, and now none of them could play the game!"

    Which I read as: developers write & debug code, that code goes through a build server which builds it & combines with game data etc, result of that is pushed to users. The obvious step missing here: make sure the exact same stuff you're pushing to users, is working & tested thoroughly before release. Seems like a gaping Quality Assurance fail right there, forget differences between developer and production systems.

    Skip that step and you're implicitly assuming that correct code (like, what's known to work well on developer's system) will produce correct working end product. Even if developer's system and production systems are configured 100% the same, that assumption is still flawed: there's always the possibility of file corruption, eg. a random single-bit error that occurs somewhere during the build process, or anything else that goes into the end product which a developer doesn't check directly.

    Of course it's best to make sure individual steps in the process are reliable, but whatever you do: at the very least check what you kick out the door. QA 101.

  36. Confusing poing in parent article by mykepredko · · Score: 1

    Hiya,

    I just noticed the confusion when I put in "keeping it from executing" when I meant to say "keeping it from exiting [the loop]".

    Sorry about that,

    myke

    1. Re:Confusing poing in parent article by mykepredko · · Score: 1

      Should I even note that I misspelled "point"?

      myke

  37. Re:I don't believe 1% of computers give wrong answ by gadzook33 · · Score: 1

    I don't think this should be modded as flamebait. Personally I view the article with a similar degree of skepticism and incredulity. I have a good friend who works at a major chip manufacturer and specializes in fault detection. He related to me that, essentially they have never seen a case of an undetected (by the CPU) fault, despite running tests like this on massively huge systems. Between a game programmer and the company who makes their bread and butter doing this, I'm going to have to go with the latter until someone posts code or something more concrete than what I view as a lot of speculation.

  38. Even better! by Roger+W+Moore · · Score: 5, Funny

    the compiler had done exactly what it should according to the standard...

    That's even better - it means that you've found a bug in the standard! ;-)

    1. Re:Even better! by disambiguated · · Score: 2

      So true. It's a dysfunctional love/hate relationship I have with the C++ standard. And just like most abusive relationships, I refuse to leave her. :)

      I wish D would gain some momentum.

  39. More error checking by Okian+Warrior · · Score: 5, Interesting

    My previous was modded up, so here's some more checks.

    During boot, the system would execute a representative sample of CPU instructions, in order to test that the CPU wasn't damaged. Every mode of memory storage (ptr, ptr++, --ptr), add, subtract, multiply, divide, increment &c.

    During boot, all memory was checked - not a burin-in test, just a quick check for integrity. The system wrote 0, 0xFF, A5, 5A and read the output back. This checked for wires shorted to ground/VCC, and wires shorted together.

    During boot, the .bss segment was filled with a pattern, and as a rule, all programs were required to initialize all of their static variables. Each routine had an xxxINIT function which was called at boot. You could never assume a static variable was initialized to zero - this caught a lot of "uninitialized variable" errors.

    (This allowed us to reboot specific systems without rebooting the system. Call the SerialINIT function, and don't worry about reinitializing that section's static vars.)

    The program code was checksummed (1K at a time) continuously.

    When filling memory, what pattern should you use? The theory was that any program using an uninitialized variable would crash immediately because of the pattern. 0xA5 is a good choice:

    1) It's not 0, 1, or -1, which are common program constants.
    2) It's not a printable character
    3) It's a *really big* number (negative or unsigned), so array indexing should fail
    4) It's not a valid floating point or double
    5) Being odd, it's not a valid pointer

    Whenever we use enums, we always start the first one at a different number; ie:

    enum Day { Sat = 100, Sun, Mon... }
    enum Month { Jan = 200, Feb, Mar, ... }

    Note that the enums for Day aren't the same as Month, so if the program inadvertently stores one in the other, the program will crash. Also, the enums aren't small integers (ie - 0, 1, 2), which are used for lots of things in other places. Storing a zero in a Day will cause an error.

    (This was easy to implement. Just grep for "enum" in the code, and ensure that each one starts on a different "hundred" (ie - one starts at 100, one starts at 200, and so on).)

    The nice thing about safety cert is that the hardware engineer was completely into it as well. If there was any way for the CPU to test the hardware, he'd put it into the design.

    You could loopback the serial port (ARINC on aircraft) to see if the transmitter hardware was working, you could switch the A/D converters to a voltage reference, he put resistors in the control switches so that we could test for broken wires, and so on.

    (Recent Australian driver couldn't get his vehicle out of cruise-control because the on/off control wasn't working. He also couldn't turn the engine off (modern vehicle) nor shift to neutral (shift-by-wire). Hilarity ensued. Vehicle CPU should abort cruise control if it doesn't see a periodic heartbeat from the steering-wheel computer. But, I digress...)

    If you're interested in the software safety systems, look up the Therac some time. Particularly, the analysis of the software bugs. Had the system been peppered with ASSERTs, no deaths would have occurred.

    P.S. - If you happen to be building a safety cert system, I'm available to answer questions.

    1. Re:More error checking by Lehk228 · · Score: 1

      cruise control should suspend when brakes are pressed, in addition the input from cruise control should be masked out in some way by the brakes so even if cc ignores the brakes activation it can't actually fight against the stop

      --
      Snowden and Manning are heroes.
    2. Re:More error checking by lookatmyhorse · · Score: 1

      and brakes are stronger at the wheel than the engine traction, so don't panic

    3. Re:More error checking by Anonymous Coward · · Score: 0

      During boot, do all this stuff INLINE. You cannot expect that RAM is valid and working until you have proven it so.
      So, you can't put all these great checks in a function/subroutine because the first thing the compiler will do it put the return address onto the stack ... which is IN RAM.

    4. Re:More error checking by eulernet · · Score: 1

      You forgot a few other tricks, like:

      1) filling the memory space starting at 0x0000 with magic numbers. NULL is equal to 0 on most compilers, so when you use pointers, it could write at 0x0000. Sometimes NULL is -1, so you can also protect the high memory range.
      2) using different magic numbers at the top and at the bottom of the stack, or in any allocated memory space

      I also like to add more controls (pointers checking) in memory allocation, since I tend to use my own malloc. Of course, my own malloc allocates a few more bytes, some at the beginning, some at the end, and the memory is always filled with magic numbers. As you said, 0x55555555 or 0xAAAAAAAA are good for that.

    5. Re:More error checking by tippen · · Score: 1

      Another favorite in embedded systems: lock the 4K page at address 0x00000000 so the processor halts execution when someone ends up dereferencing a NULL pointer.

      Frequently have to move the IVT around to make this work, but well worth it.

    6. Re:More error checking by Okian+Warrior · · Score: 1

      Like!

      Will add to my list of "Best Practices for Safety".

  40. Re:another trick: stop mixing up testing + assignm by Anonymous Coward · · Score: 0

    It's not valid Java. Java doesn't allow assignment inside a conditional or loop statement.

  41. Re:I don't believe 1% of computers give wrong answ by Tagged_84 · · Score: 1

    I've never coded an FPS though but can imagine it would involve more use of raycasting (and a high precision) compared to that of an RTS, which I have one shipped game experience.

  42. Re:I don't believe 1% of computers give wrong answ by Anonymous Coward · · Score: 0

    Lots of girls play guildwars. I've been asked by a few to take a look at their computer and why its crashing all the time.

    Its always been cheap laptops with hello kitty stickers or something on them conveniently placed on a cutsey little carpet for their laptop tray.

    This and the cat hair from cutsey little invariable cat that likes to sit on the warm laptop do the rest.

  43. Re:I don't believe 1% of computers give wrong answ by godrik · · Score: 1

    faulty hardware is a common problem in clusters. We still have a 5 years old cluster at work that is falling in pieces due to faulty processor, faulty memory, faulty power supplies and faulty hard drives. Some errors are just weird. You never saw a machine having errors in memtest86 but stopping to have them once you swapped two dimms? I see that from time to time and I don't spend much time dealing directly with hardware.

  44. Can't really test an overclocked CPU ... by perpenso · · Score: 1

    The problem is that you can't really test an overclocked CPU. CPUs do not work perfectly or catastrophically fail (crash the software). There is a range of errors between these two extremes. Some of the overclocking induced errors are quite subtle, a simple incorrect answer. The problem is that these subtle errors are sometimes dependent upon a certain sequence of instructions or certain data patterns and the sequences and data, as well as the failing instruction and the clock speed where these errors manifest, can vary from one individual CPU to the next.

    You can run all sorts of test utilities and you will probably only discover the CPUs with the more severe problems. You will probably not be able to tell the CPUs that have subtle failures from CPUs that are functioning properly.

    1. Re:Can't really test an overclocked CPU ... by DNS-and-BIND · · Score: 1

      The typical method used by hobbyists was: overclock step by step until it crashes. Then, back off one step. You are now at the "optimal" speed - i.e. the fastest and therefore best speed. Games crash? Must be software bugs!

      --
      Shutting down free speech with violence isn't fighting fascism. It IS fascism!
    2. Re:Can't really test an overclocked CPU ... by perpenso · · Score: 2

      The typical method used by hobbyists was: overclock step by step until it crashes. Then, back off one step. You are now at the "optimal" speed - i.e. the fastest and therefore best speed. Games crash? Must be software bugs!

      And one step back from an obvious crash may be in the subtle errors region where CPU failures can't be easily distinguished from software bugs. For example the subtle error can simply be an erroneous answer, 2+2=5 sort of stuff. If that erroneous answer is part of the calculation of where to draw something on the screen the error may be of no consequence, one pixel off may be imperceptible. However if that erroneous answer is ultimately part of the calculation of an array index then being one index off may be an array overflow and result in a crash.

    3. Re:Can't really test an overclocked CPU ... by Hatta · · Score: 1

      And one step back from an obvious crash may be in the subtle errors region where CPU failures can't be easily distinguished from software bugs. For example the subtle error can simply be an erroneous answer, 2+2=5 sort of stuff

      That's why you run utilities like SuperPi. You can check that erroneous answer against known values. Benchmarks like this are designed to be as stressful as possible on the CPU, so that there's good reason to expect that if you don't see errors in SuperPi you won't see them anywhere else.

      --
      Give me Classic Slashdot or give me death!
    4. Re:Can't really test an overclocked CPU ... by sjames · · Score: 1

      The worst part is it might come up with 2+2=5 only under an extensive set of circumstances that hit once in a million times. The odds are decent that it will imperceptibly displace a pixel once in a while and it's not a big deal. Until it just happens to be calculating your taxes one day and you 'win' the bug lottery.

      The manufacturer can find that sort of thing since they can probe the surface of the chip during testing. The home overclocker can't do that.

    5. Re:Can't really test an overclocked CPU ... by perpenso · · Score: 1

      And one step back from an obvious crash may be in the subtle errors region where CPU failures can't be easily distinguished from software bugs. For example the subtle error can simply be an erroneous answer, 2+2=5 sort of stuff

      That's why you run utilities like SuperPi. You can check that erroneous answer against known values. Benchmarks like this are designed to be as stressful as possible on the CPU, so that there's good reason to expect that if you don't see errors in SuperPi you won't see them anywhere else.

      Honestly, that is not the case. These subtle erroneous answers can be dependent upon only a certain sequence of instructions being executed immediately before or certain data patterns being operated upon immediately before and the sequences or patterns may be different for each individual CPU, and the instruction offering the erroneous answer may vary for each individual CPU as well.

      These utilities can only help detect the errors towards the more serious end of the spectrum, not the more subtle.

  45. Bad capacitors by Anonymous Coward · · Score: 0

    I do both HW and SW and (too) much repair work. As many of you know, bad capacitors are epidemic. I've traced all kinds of HW and SW problems down to bad capacitors. They frequently show no external sign of failure.

    Before repairing them, I had systems that ran Linux reliably but when Windows booted on the same system, glitches, freezes, bluescreens, etc. I blame it on Windows being a little less efficient with CPU do-nothing cycles (System Idle Process), thus drawing more power and more power supply ripple and more glitching on the digital signals. I'm amazed that they're only finding 1% of systems with problems. Probably because gamers tend to buy (much) better hardware, although not all high-end hardware has good caps.

  46. Re:I don't believe 1% of computers give wrong answ by perpenso · · Score: 3, Interesting

    While at a large game company I wrote the code that collected CPU make and model, video make and model, amount of RAM, OS version, etc. Basically the type of info you see under minimum system requirements. The CPUID instruction can return a vendor string indicating who made the CPU. Intel CPUs return "GenuineIntel". On very very rare and often transient occasions the reported string had a misspelling, the misspellings generally indicated a single bit error. Whether an overclocked CPU generating subtle errors or bad RAM or a bad power supply or something else is responsible I can't say. All I really know is that outside of the CPU manufacturing facility things do go wrong in hardware. The article is consistent with various things I have seen.

  47. Doubtful by SmallFurryCreature · · Score: 2

    Higher end "gamer" motherboards come with default overclock settings, it doesn't require anything more then leaving the default settings as they are for the motherboard to attempt optimal settings rather then purely the settings your CPU/Memory themselves report. Going even further is also pretty easy requiring little more then selecting performance mode from a nice graphical screen.

    Yes, extreme overclocking is still for enthousiasts but running your hardware slightly faster then recommended by you CPU/Memory maker is childsplay today on anything beyond an office computer.

    --

    MMO Quests are like orgasms:

    You may solo them, I prefer them in a group.

    1. Re:Doubtful by fuzzywig · · Score: 1
      My mobo isn't even that high end, but it still had a 'one-click overclock' setting in the (EFI) BIOS.

      3.2GHz went straight up to 4.0. So far I've not even bothered to try getting more out of it.

    2. Re:Doubtful by DMUTPeregrine · · Score: 1

      Oh, I know, my motherboard has such settings.
      Depending on how extreme an overclock I set I will eventually get errors in stress tests, even when the computer appears to operate fine otherwise. The built-in overclock system is still running the processor outside its stated limits, and isn't guaranteed to provide a stable system. No overclock ever is. That's why it's always a good idea to stress-test after overclocking, and if the test fails decrease the overclock until it passes.

      --
      Not a sentence!
  48. Re:I don't believe 1% of computers give wrong answ by Anonymous Coward · · Score: 0

    Most wrong answers fail to crash the computer, so are unnoticed. Bit errors happen far more often than you'd believe. Copy 100 GB of data with no validation and no retries, you're almost guaranteed to have at least 1 bit error. Don't care how you do it - memory to memory, hard disk to hard disk. Copy 100 GB of data over the Internet, and you're VERY likely to have a bit error that gets past the TCP/IP checksums.

    Crashes due to "that bit is not right, don't know how that happened, but it wasn't a bug, it's hardware" happens about once per thousand hours of play for us, we've been tracking that for years and working hard to work around it.

  49. Eh... read the FIRST story, it is the more common by SmallFurryCreature · · Score: 1

    Eh... read the FIRST story, it is the more common one. As a fellow coder, I agree the FIRST story should be wiped from the face of the earth, less the fact that we nerds are not perfect in always writing code that compiles on the first run every time, ruins our image with the females of the species.

    It is of course one of the annoying things in having to deal with bug reports, so many originate between the chair and the keyboard and since most of them bug reporters pay your salary you can't kill them. Because then they won't pay you. And that is bad. Sure, you can ask yourself what Jezus would do but Jezus didn't have a mortgage to pay did he?

    --

    MMO Quests are like orgasms:

    You may solo them, I prefer them in a group.

  50. Bad magic number. by TapeCutter · · Score: 1

    That's quite an old and sensible trick with C, I was taught it at uni in the late 80's on a System V terminal. These days C/C++ compilers will flag an assignment in a logical expression with a warning.
    The strangest error I got from a compiler was with MSVC 1.52, it was a real head scratcher so I had to resort to the debugger, the culprit was an innocent looking i++ statement, when disassembled it was clear the compiler had inserted two INC statements. Thus the counter was incrementing by 2 each time it executed the statement. We rebuilt the same code on the same machine and the problem disappeared. When weird shit like that happens everyone just shrugs and keeps going, but it sticks in your head for a long time and often comes to mind at inappropriate times such as during take off in a modern plane.
    My favorite linker error also comes from Sytem V; "Bad magic number."

    --
    And did you exchange a walk on part in the war for a lead role in a cage? - Pink Floyd.
  51. I've seen this before... by Josh+Coalson · · Score: 4, Interesting
    I used to get bug reports for FLAC caused by this very same problem.

    FLAC has a verify mode when encoding which, in parallel, decodes the encoded output and compares it against the original input to make sure they're identical. Every once in a while I'd get a report that there were verification failures, implying FLAC had a bug.

    If it were actually a FLAC bug, the error would be repeatable* (same error in the same place) because the algorithm is deterministic, but upon rerunning the exact same command the users would get no error, or (rarely) an error in a different place. Then they'd run some other hardware checker and find the real problem.

    Turns out FLAC encoding is also a nice little hardware stressor.

    (* Pedants: yes, there could be some pseudo-random memory corruption, etc but that never turned out to be the case. PS I love valgrind.)

    1. Re:I've seen this before... by Anonymous Coward · · Score: 0

      Linpack is good in similar ways. Better if you're testing ECC-enabled systems, where it might do single-bit error correction and mask the error to just checking for correctness. If you runs don't produce exactly the same FLOP rate when you change out ram, you know you've got (ECC correctable) ram issues, even if the OS is choosing not to report those to you. *cough* OS X *cough*

    2. Re:I've seen this before... by organgtool · · Score: 2

      FLAC has a verify mode when encoding which, in parallel, decodes the encoded output and compares it against the original input to make sure they're identical. Every once in a while I'd get a report that there were verification failures, implying FLAC had a bug.

      While hardware could definitely be at fault, have you considered the possibility of a race condition causing the error? Race conditions may occur very infrequently and can be incredibly difficult to discover. I'm not trying to disparage the FLAC codebase (FLAC is one of my favorite open source projects), but this could be another possibility for the behavior users are experiencing.

  52. Re:I don't believe 1% of computers give wrong answ by Anonymous Coward · · Score: 0

    Yeah but people don't buy a separate PC for Guild Wars and for other machines. Gamers as a whole are disproportionately likely to overclock even if this particular game doesn't need it. It may be the overclockers have a 10% fail rate but only 10% of people are overclockers, or some other mix.

  53. Re:another trick: stop mixing up testing + assignm by TapeCutter · · Score: 2

    Excuse me but...WHOOSH...
    The GP's tip is about avoiding accidentally inserting an assignment operator "=" when you intended to put an equivalence operator "==". If you follow the GP's habit there can be no such accidents because it's illegal to assign a value to a constant. The coding tip itself has been circulating for 23yrs that I personally know about, in those days compilers did not warn you about assignments within conditionals, the compiler saw valid syntax and assumed you knew what you were doing. This is why lint was so popular at the time. Even with warnings enabled it is still a good advice since following the habit will always produce a hard error when ( by Murphy's law) you inevitably make that particular typo.

    --
    And did you exchange a walk on part in the war for a lead role in a cage? - Pink Floyd.
  54. Oh I'm sorry. by symbolset · · Score: 1

    That was me. My bad.

    --
    Help stamp out iliturcy.
  55. Re:I don't believe 1% of computers give wrong answ by Anonymous Coward · · Score: 0

    a) the server should be making the authoritative decision(s), and
    b) should be sending a quantized result to the clients.

    This is already the case. The user side of Guild Wars only renders and 'continues', actual changes in the world always occur on the server side (you can see this at a network failure). And it has to be, if you want to avoid cheaters.

  56. Re:I don't believe 1% of computers give wrong answ by sydneyfong · · Score: 1

    I thought the article was talking about 1% of computers very very occasionally giving wrong answers?

    It's not like all their operations are failing. The test was run dozens of times per second, and if it fails once every few days, it's still a really low "failure rate" for a failing computer -- the only problem is that we expect the failure rate to be Zero.

    --
    Don't quote me on this.
  57. Re:I don't believe 1% of computers give wrong answ by Anonymous Coward · · Score: 0

    Well it should be modded ignorant/overrated then.

    The return rates for most PC components are about 1-3%. Yes some of the returns are user error, but the rest are faulty stuff. And if you look at the laptop malfunction rates they're from 15-25%!

    So it should not be surprising that 1% of the PCs that "appear to work" don't actually work correctly 100% of the time.

    I think a fair number of corporate IT people handling faulty PCs every day would actually be surprised the figure is so low ;).

  58. Re:another trick: stop mixing up testing + assignm by tofarr · · Score: 1

    Incorrect - java will allow assignment code inside a conditional statement so long as it is a boolean - most decent IDEs will flag it as suspicious but it will compile. e.g: if((i = 3) == 2){ }

  59. Re:I don't believe 1% of computers give wrong answ by damnbunni · · Score: 1

    Well, the minimum CPU requirement for Guild Wars is an 800 Mhz Pentium III. That was current when Guild Wars came out.

    So yeah, overclocking something like that would probably have been reasonably common at the time.

  60. Re:I don't believe 1% of computers give wrong answ by Kergan · · Score: 1

    guild wars runs okay on a crappy netbook, let alone anyones PC, If you need to OC to play Guild Wars then you are using a Pentium III.

    But chances are the same PC is also used to play games that require a lot more resources, and that the user isn't fiddling with his hardware all the time.

  61. Explains the top end $1,000 per CPU cost by Anonymous Coward · · Score: 0

    Perhaps only 1 per 1,000 actually CAN make the speed rated per silicon wafer etched. Hence the extreme cost of the top end CPU's from Intel (look at the price per 1,000 lots of the 3960 extreme Core CPU).

  62. Re:I don't believe 1% of computers give wrong answ by Anonymous Coward · · Score: 0

    Hm. I can't remember the time I was allowed to compile stuff that I bought for Windows myself. As it is a shipped binary blob the creator of the test can guarantee that your compiler based assumptions wont happen.

  63. Re:I don't believe 1% of computers give wrong answ by Anonymous Coward · · Score: 0

    Every piece might be fine for itself. That doesn't mean they work together. A lot of cheap stuff is barley in the allowed tolerances. Usually the CPU (unless) overclocked and the GPU (unless) overclocked are the only ones klinging tight to the spec.

  64. Why don't Operating Systems test HW transparently? by Anonymous Coward · · Score: 0

    Almost every time an OS crashes due to hardware, people are going to blame it on the OS or some piece of software. Hardware is rarely the first go-to, unless the user is blatantly told.

    How difficult would it be to offer an option to test all the hardware when installing the OS? And to randomly check hardware while the computer is idle? Doesn't even need to be in depth tests, you could check just a fraction at a time and cover all the hardware over the period of a few days or weeks.

    Almost every bit of software can be tested from inside the OS. Eurosoft makes a program called QAWin32+ which can test all the memory, the entire hard drive, the processor, and a whole bunch of other hardware while the computer is being used.

  65. "Let's see how that works out in practice" by Anonymous Coward · · Score: 0

    Yeah, right. Let's see how that works out in practice.

    The only time I was on a project where someone discovered an architecture-specific bug in GCC, he reported it with an included test case, and the code was patched in branch in the next six days. Of course the next point release was 8 months later, so we just disabled the problematic optimization setting, but we could have used a nightly build of GCC.

  66. Cruise control by Okian+Warrior · · Score: 2

    The actual error was more nuanced than I had room to describe in the comment.

    Car systems pass messages to each other using an internal buss. In this particular incident, one of the systems failed in a way that made it continuously spew out messages, using all the bandwidth so that no other system could get a message through.

    The brakes will normally abort cruise control, but there's no direct wire to the engine computer - it's all messages passed around on the buss. Similar with the On/Off button (newer cars) and shift-by-wire.

    It was determined experimentally that the operator is not strong enough to overpower the engine using the brakes. The engine would just downshift for more power.

    It all boils down to the safety "mindset": you code the systems to fail safe instead of fail dangerous.

    I maintain that the engine CPU should automatically abort cruise control if it doesn't see a periodic heartbeat from the steering-wheel computer.

    1. Re:Cruise control by Anonymous Coward · · Score: 1

      It was determined experimentally that the operator is not strong enough to overpower the engine using the brakes.

      Bull. I have never in my life driven or heard of a vehicle where this was true. Breaks can apply many times more force to the wheels than the engine. Every uncontrolled acceleration where the "victim" claims to have their foot on the break, the victim has their foot on the gas.

    2. Re:Cruise control by Lehk228 · · Score: 1

      Car systems pass messages to each other using an internal buss.

      now i have to go find a pre-ECU car that runs well. who the hell thought that was a good idea.

      --
      Snowden and Manning are heroes.
    3. Re:Cruise control by Okian+Warrior · · Score: 1

      Bull. [...] Breaks can apply many times more force to the wheels than the engine.

      Incorrect. Source.

      I read a more in-depth report of the incident. Mr. Weir phoned the police, who suggested all of the things one would expect: turning off the engine, shifting to neutral, &c.

      Mr. Weir was unable to stop the car using the foot brake alone, and only "just barely" managed to stop the car using both the handbrake *and* the foot brake *while* hopped up on adrenaline.

      Anyone in worse physical shape than Mr. Weir (a 22-year old male) would have considerable difficulty stopping a vehicle under those circumstances.

      Oh, and read the report: he was speeding for 30 minutes and had time to call the police. He didn't inadvertently have his foot on the gas.

      This might indeed be bull, but I've got good evidence in support of my statements.

  67. Re:I don't believe 1% of computers give wrong answ by UnknownSoldier · · Score: 3, Interesting

    It used to, I don't know about current gen RTS's. But back ~2000 RTS typically you would run in a lock-step model. We used fixed-point to guarantee each machine was doing the _exact_ same 3D math due to the imprecision of the FPU. ANY discrepancy and your game state was boned. I believe at the time this decision was due to network implementation -- I don't know the exact reason though since I was doing rendering / optimizations.

    You also have to keep in mind the context. Back in 2000 AMD's FPU was beating the pants of the Intel's (depending on the operation as much as 1000% !!) With Intel having such a slow FPU you didn't rely on it unless you had to. Also, using C's 64-bit 'double' was prohibited for two reasons:

    a) the PS2 emulated it IN SOFTWARE !
    b) it was horrendously SLOW compared to 32-bit floats.

    Game programmers stayed as far away as possible from floats (and especially doubles!) as long as (reasonably) possible. For FPS you were forced to go the float route because while Intel hid the latency of the INT-to-FLOAT casts it was just easier to stay entirely in the float domain. That also opened the door for some clever optimizations like Carmack did with over-lapping the FPU and INT units but that was the rare case.

    On PS3 you take a HUGE Load-Hit-Store penalty if you try doing the naive INT32-to-FLOAT32 cast so fixed point has fallen out favor for lack of performance reasons.

  68. Re:I don't believe 1% of computers give wrong answ by Anonymous Coward · · Score: 0

    Err, I thought that floating point was always deterministic, even if it's not full accuracy. Do you have an example otherwise?

  69. Re:I don't believe 1% of computers give wrong answ by Anguirel · · Score: 1

    Massively Huge Systems are likely to be put together properly, run at correct speeds, appropriately cooled, and kept clean. How often did he run tests on home-built systems that have 4 years of dust, inconsistent voltage to the PSU, over-drawn PSUs, insufficient cooling, and chips not completely seated properly? Imagine a room of 100 PC MMORPG gamers. Do you think at least 1 of them has a machine that matches the above description, and that it might produce some unusual errors, particularly when the system is under high load?

    --
    ~Anguirel (lit. Living Star-Iron)
    QA: The art of telling someone that their baby is ugly without getting punched.
  70. Commercial code checkers? by NorthWay · · Score: 1

    From time to time /. puts up stories about commercial code checkers that has all kinds of cool AI built-in.

    Wouldn't a test of "if this then exit, if not this then exit" be flagged as a possible logic bug when there is more code after that last exit but before function end?

  71. Re:I don't believe 1% of computers give wrong answ by Anonymous Coward · · Score: 0

    In regards to current gen RTS's: Starcraft 2 works like that. Less bandwidth is the reason, I believe.

  72. Re:I don't believe 1% of computers give wrong answ by UnknownSoldier · · Score: 1

    Ah, I could see that being one reason. Thanks!