Slashdot Mirror


Google Finds DRAM Errors More Common Than Believed

An anonymous reader writes "A Google study of DRAM errors in their data centers found that they are hundreds to thousands of times more common than has been previously believed. Hard errors may be the most common failure type. The DIMMs themselves appear to be of good quality, and bad mobo design may be the biggest problem." Here is the study (PDF), which Google engineers published with a researcher from the University of Toronto.

3 of 333 comments (clear)

  1. Dell by ^_^x · · Score: 5, Interesting

    In my experience at work ordering Dell desktops and laptops, by far the most common defect is 1-3% of machines with bad RAM. Typically it's made by Hynix, occasionally Hyundai, and I've never seen other brands fail. On many occasions though, I've predicted Hynix, pulled it, and sure enough theirs was the piece causing the errors in Memtest86+...

    1. Re:Dell by Jah-Wren+Ryel · · Score: 5, Interesting

      Hyundai is Hynix and they are the second largest DRAM manufacturer by marketshare (roughly 20% second to Samsung's 30%).

      Its no surprise that you've only seen Hynix brand fail in Dells, chances are they are in 90%+ of Dell (and HP and Apple) boxes because they primarily buy from Hynix in the first place. Its selection bias.

      --
      When information is power, privacy is freedom.
  2. Misleading, to say the very least. by jhfry · · Score: 5, Interesting

    Read the article and remember they are talking averages here.

    They give it away with this line:

    Only 8% of DIMMs had errors per year on average. Fewer DIMMs = fewer error problems - good news for users of smaller systems

    Essentially, only 8% of their ECC DIMM's reported ANY errors in a given year.

    Also this was pretty telling:

    Besides error rates much higher than expected - which is plenty bad - the study found that error rates were motherboard, not DIMM type or vendor, dependent.

    And this:

    For all platforms they found that 20% of the machines with errors make up more than 90% of all observed errors on that platform.

    So essentially, they are saying that only 8% of DIMMSs reported errors, 90% of which were on 20% of the machines that had errors, mostly because of motherboard issues... yet DIMMs are less reliable than previously thought.

    I would imagine that if you removed all of the bad motherboards, power supplies, environmental, and other issues... that DIMMs are actually more reliable than I previously thought, not less! I wonder what percentage of CPU operations yield incorrect results. With Billions of instructions per second, even an astronomically low average of undetected cpu errors would guarantee an error at least as often as failed DIMMs.

    What I did take from the article was that without ECC ram, you have no way of knowing that your RAM has errors. I guess I should rethink my belief that ECC was a waste of money.

    --
    Sometimes the best solution is to stop wasting time looking for an easy solution.