Google Finds DRAM Errors More Common Than Believed
An anonymous reader writes "A Google study of DRAM errors in their data centers found that they are hundreds to thousands of times more common than has been previously believed. Hard errors may be the most common failure type. The DIMMs themselves appear to be of good quality, and bad mobo design may be the biggest problem." Here is the study (PDF), which Google engineers published with a researcher from the University of Toronto.
"a mean of 3,751 correctable errors per DIMM per year."
I'm much to lazy to do the math. Let's round up - 4k errors per year has to be a vanishingly small percentage for a system that is up 24/7/365, or 5 nines. The fact that these DIMMs were "stressed" makes me wonder about the validity of the test. Heat stress, among other things, will multiply errors far beyond what you will see in normal service.
"Windows is like the faint smell of piss in a subway: it's there, and there's nothing you can do about it." - Charlie Br
I've always thought it would be a nice-to-have feature for my home system to have ECC - perhaps it might degrade over time and misbehave less if it could detect and fix some errors. But my normal sources don't seem to stock many choices. E.g. Newegg appears to have 2 motherboards to choose from, both for AMD CPUs, nothing for Intel. Frys appears to have one, same, AMD only. Is this just the way things are, or do I need to be looking somewhere else? Would picking one of these motherboards end up in not working out well for my gaming rig?
is competition good, or is duplication of effort bad?
In my experience at work ordering Dell desktops and laptops, by far the most common defect is 1-3% of machines with bad RAM. Typically it's made by Hynix, occasionally Hyundai, and I've never seen other brands fail. On many occasions though, I've predicted Hynix, pulled it, and sure enough theirs was the piece causing the errors in Memtest86+...
Adding checksumming adds another place for errors to occur though -- if data is written correctly but the checksum is-miscalculated, either before it is stored or when the data is being verified -- you'll end up throwing out perfectly good data. If you also have redundancy you're probably willing to live with that, but if you're running on single disk ZFS is just adding more opportunities for data corruption in RAM.
Read the article and remember they are talking averages here.
They give it away with this line:
Only 8% of DIMMs had errors per year on average. Fewer DIMMs = fewer error problems - good news for users of smaller systems
Essentially, only 8% of their ECC DIMM's reported ANY errors in a given year.
Also this was pretty telling:
Besides error rates much higher than expected - which is plenty bad - the study found that error rates were motherboard, not DIMM type or vendor, dependent.
And this:
For all platforms they found that 20% of the machines with errors make up more than 90% of all observed errors on that platform.
So essentially, they are saying that only 8% of DIMMSs reported errors, 90% of which were on 20% of the machines that had errors, mostly because of motherboard issues... yet DIMMs are less reliable than previously thought.
I would imagine that if you removed all of the bad motherboards, power supplies, environmental, and other issues... that DIMMs are actually more reliable than I previously thought, not less! I wonder what percentage of CPU operations yield incorrect results. With Billions of instructions per second, even an astronomically low average of undetected cpu errors would guarantee an error at least as often as failed DIMMs.
What I did take from the article was that without ECC ram, you have no way of knowing that your RAM has errors. I guess I should rethink my belief that ECC was a waste of money.
Sometimes the best solution is to stop wasting time looking for an easy solution.
RAM is dirt cheap and most server systems support significantly more RAM than most people bother to install. For critical systems, ECC works but that doesn't prevent everything (double bit errors etc.). Is it time for a Redundant Array of Inexpensive DIMMs? Many HA servers now support Memory Mirroring (aka RAID-1 http://www.rackaid.com/resources/rackaid-blog/server-dysfunction/memory_mirroring_to_the_rescue/) but should there be more research into different RAID levels for memory (RAID5-6, 10, etc?)
Thanks,
--
Matt
Seriously. If you download a lot, and I do, you see quite a few checksum mismatches in the log.
Especially if the torrent is old. Some of them may be sabotage activity, but I doubt that, considering kind of files.
They are not transmission errors: TCP-IP checks for that. Not hard drive errors - again checksums. They can be intrasystem transmission errors though.
I remember folks who did complete checkers wrote that they had a lot of them too.
This machine compiled a lot of source (it was a Gentoo box), so surely if errors like these had been happening frequently we'd have known from heaps of signal-elevens killing the compiles all the time, right?
~24 hours of Memtest86 revealed nothing. Googling at the time found someone with the exact same mobo+CPU having problems gunzipping the exact same file (with the correct MD5), and I wondered if there was some specific bit-pattern in the file (or gunzip's state) that b0rked on my mobo. In retrospect I should have tried Solaris x86 on the same machine to try gunzipping the file.
At Purdue, many years ago, one of the engineers mapped the ECC RAM errors in a room with hundreds of sparc stations and found that it was mostly in a cone shape pointed toward the window. That window looked out to a pile of coal, so the culprit was assumed to be low level alpha radiation.
All ideas^H^H^H^H^Hprocesses in this post are Patent Pending. (as well as the process of patenting all postings)
When I was building the computer I'm typing this on, I had the grand idea of building it with so much RAM that I could basically work from RAM. Meaning, for example, that all my running programs and the project I was working on would have to fit in RAM.
Of course, with such a dream, I was concerned about the reliability of my memory. So I wanted ECC. I found out that having ECC memory is not just a matter of buying ECC memory. There are different kinds of ECC memory, and you need to find a combination of memory, motherboard, and CPU that works together. Many sites that offer CPUs and/or motherboards don't list support for ECC among the specifications. Searching for it is difficult, because searching for "ECC" also returns hits for things like "non-ECC" and "ECC: no".
Finally, I found a combination of motherboard and CPU that would support unbuffered ECC DDR2, and a matching pair of memory modules to go with it. And then, when I got all the parts, the RAM didn't fit in the motherboard. Turns out the RAM was FB-DIMM, which had not been listed in the advertisement. I gave up and just bought 2GB of non-ECC RAM to just get the system working. The FB-DIMM (all 8GB of it) is still sitting here, because I haven't found anyone who wants to buy it from me.
Lessons learned: 1. The saying "the nice thing about standards is that there are so many to choose from" is still relevant. I don't know why there have to be so many hardware interfaces to memory chips, but there are, so be careful. 2. Apparently, nobody really cares about ECC RAM, otherwise information would be easier to find. 3. Apparently, AMD CPUs and matching motherboards more usually support ECC RAM than Intel parts and matching motherboards.
Please correct me if I got my facts wrong.