Google Finds DRAM Errors More Common Than Believed
An anonymous reader writes "A Google study of DRAM errors in their data centers found that they are hundreds to thousands of times more common than has been previously believed. Hard errors may be the most common failure type. The DIMMs themselves appear to be of good quality, and bad mobo design may be the biggest problem." Here is the study (PDF), which Google engineers published with a researcher from the University of Toronto.
"a mean of 3,751 correctable errors per DIMM per year."
I'm much to lazy to do the math. Let's round up - 4k errors per year has to be a vanishingly small percentage for a system that is up 24/7/365, or 5 nines. The fact that these DIMMs were "stressed" makes me wonder about the validity of the test. Heat stress, among other things, will multiply errors far beyond what you will see in normal service.
"Windows is like the faint smell of piss in a subway: it's there, and there's nothing you can do about it." - Charlie Br
I use Gentoo; how does this affect me?
This really makes me want to use ZFS.
Give me Classic Slashdot or give me death!
I nmver havm any DRIM error{ on my comp}ter6
What I have seen (and generated) is the occasional (2-3/day) bus error with specific (nasty) datapatterns. Usually at a few addr. I write that off to mobo trace design and crosstalk between the signals. Failing to round the corners sufficiently, or leaving spurs is the likely problem. I think Hypertransport is a balanced design (push-pull differential like ethernet) and should be less succeptible.
Maybe this is explainable by today's story that the universe has 100x more entropy than we thought
Bad memory bits? How is thap posqible?
And here I thought people were swearing at me (%^&%$&) in email when it really was just bad bits. Whew... What a relief!
I've always thought it would be a nice-to-have feature for my home system to have ECC - perhaps it might degrade over time and misbehave less if it could detect and fix some errors. But my normal sources don't seem to stock many choices. E.g. Newegg appears to have 2 motherboards to choose from, both for AMD CPUs, nothing for Intel. Frys appears to have one, same, AMD only. Is this just the way things are, or do I need to be looking somewhere else? Would picking one of these motherboards end up in not working out well for my gaming rig?
is competition good, or is duplication of effort bad?
In my experience at work ordering Dell desktops and laptops, by far the most common defect is 1-3% of machines with bad RAM. Typically it's made by Hynix, occasionally Hyundai, and I've never seen other brands fail. On many occasions though, I've predicted Hynix, pulled it, and sure enough theirs was the piece causing the errors in Memtest86+...
Isn't google running many servers in unusually higher ambient temperatures and in very uniform custom configurations? The results may not apply to anybody else.
was only a problem for government computers.
Nullius in verba
How sick is it, that I contemplated moderating this as funny? The old troll-eske meme were so innocent, I really miss them.
... and Natalie Portman.
Read the article and remember they are talking averages here.
They give it away with this line:
Only 8% of DIMMs had errors per year on average. Fewer DIMMs = fewer error problems - good news for users of smaller systems
Essentially, only 8% of their ECC DIMM's reported ANY errors in a given year.
Also this was pretty telling:
Besides error rates much higher than expected - which is plenty bad - the study found that error rates were motherboard, not DIMM type or vendor, dependent.
And this:
For all platforms they found that 20% of the machines with errors make up more than 90% of all observed errors on that platform.
So essentially, they are saying that only 8% of DIMMSs reported errors, 90% of which were on 20% of the machines that had errors, mostly because of motherboard issues... yet DIMMs are less reliable than previously thought.
I would imagine that if you removed all of the bad motherboards, power supplies, environmental, and other issues... that DIMMs are actually more reliable than I previously thought, not less! I wonder what percentage of CPU operations yield incorrect results. With Billions of instructions per second, even an astronomically low average of undetected cpu errors would guarantee an error at least as often as failed DIMMs.
What I did take from the article was that without ECC ram, you have no way of knowing that your RAM has errors. I guess I should rethink my belief that ECC was a waste of money.
Sometimes the best solution is to stop wasting time looking for an easy solution.
RAM is dirt cheap and most server systems support significantly more RAM than most people bother to install. For critical systems, ECC works but that doesn't prevent everything (double bit errors etc.). Is it time for a Redundant Array of Inexpensive DIMMs? Many HA servers now support Memory Mirroring (aka RAID-1 http://www.rackaid.com/resources/rackaid-blog/server-dysfunction/memory_mirroring_to_the_rescue/) but should there be more research into different RAID levels for memory (RAID5-6, 10, etc?)
Thanks,
--
Matt
Seriously. If you download a lot, and I do, you see quite a few checksum mismatches in the log.
Especially if the torrent is old. Some of them may be sabotage activity, but I doubt that, considering kind of files.
They are not transmission errors: TCP-IP checks for that. Not hard drive errors - again checksums. They can be intrasystem transmission errors though.
I remember folks who did complete checkers wrote that they had a lot of them too.
I had a pair of Corsair sticks that caused me months of grief. I would get kernel panics that gave absolutely no indication that memory was to blame, and memory tests and stress tests were never able to reproduce the problem. After 9 months I decided to try ignoring all indications that something else was wrong and bought new RAM. Sure enough, 12 months since then, and I haven't had a single problem. I suspect it's an issue to do with timing more than the storage medium itself, which supports Google's theory that it's often caused by bad motherboard design.
Well, hot grits have never been too rich, not even those you have been keeping in your pants all the time.
Ezekiel 23:20
I run Linux so I am immune to this. Besides, I don't even use ECC so I have ZERO CE counts, so I am immune to that problem as well.
At Purdue, many years ago, one of the engineers mapped the ECC RAM errors in a room with hundreds of sparc stations and found that it was mostly in a cone shape pointed toward the window. That window looked out to a pile of coal, so the culprit was assumed to be low level alpha radiation.
All ideas^H^H^H^H^Hprocesses in this post are Patent Pending. (as well as the process of patenting all postings)
Alrighty then, which mainboards have the lowest error rates? TFA seems to have obfuscated that. That's MSs job, I thought Google was supposed to Do No Evile?
The cost of that cleanup, of course, will be borne by taxpayers, not industry.
Um, what was the topic again? My memory isn't what it used to be.
That window looked out to a pile of coal, so the culprit was assumed to be low level alpha radiation.
Alpha radiation is stopped by a sheet of office paper. It certainly wouldn't make it through the window, through the machine case, electromagnetic shield, circuit board, chip case, and into the silicon. Even beta radiation would be unlikely to make it that far.
What is much more likely: thermal effects. IE, infrared from the sun heating up machines near the window.
Please help metamoderate.
Which really makes me question whether these results have any validity outside of google. The study found that the majority of errors appeared to be related to the motherboard, but didn't list any information about the motherboards in use. If they are all custom built for google, then there is absolutely no way for any of us to know whether the error rate they exhibited is representative of what you'd get from average COTS server-grade motherboards currently on the market. Thus these results are meaningless to anyone who uses different motherboards, ie everyone but google.
My takeaway from this paper is that maybe google should hire more technicians who are experienced with non-ecc ram systems. They even believed, prior to this study, that soft errors were the most common error state. I could have told you from the start that was bunk. In over 15 years of burn-in tests as part of pc maintenance, the number of soft-errors observed is... 0. Either the hardware can make it through the test with no error, or there is a DIMM that will produce several errors over a 24 hour test. This doesn't mean that random soft errors never happen when I'm not looking/testing, but the 'conventional wisdom' that soft errors are the predominant memory error doesn't even pass the laugh test.
From looking at the numbers on this report, I get the feeling that hardware vendors are using ECC as an excuse to overlook flaws on flaky hardware. I would now be really interested in a study that compares the real world reliability of ECC vs non-ECC hardware that has been properly QC'd. I'll wager the results would be very interesting, even of ECC still proves itself worth the extra money.
The length difference between both traces for differential clocks to ram must be less than 10 mils (1/4th of a millimeter) so that travel time of both signals is matched to within 2 picoseconds. Think about it. With such stringent requirements, a marginal(*) PCB design can easily cause errors once in a while.
(*) a PCB design where corners have been cut would actually be a good design in this case. Unless Google does not mount the PCBs in a case. But I digress.
I'm not a coward by any name.
When I was building the computer I'm typing this on, I had the grand idea of building it with so much RAM that I could basically work from RAM. Meaning, for example, that all my running programs and the project I was working on would have to fit in RAM.
Of course, with such a dream, I was concerned about the reliability of my memory. So I wanted ECC. I found out that having ECC memory is not just a matter of buying ECC memory. There are different kinds of ECC memory, and you need to find a combination of memory, motherboard, and CPU that works together. Many sites that offer CPUs and/or motherboards don't list support for ECC among the specifications. Searching for it is difficult, because searching for "ECC" also returns hits for things like "non-ECC" and "ECC: no".
Finally, I found a combination of motherboard and CPU that would support unbuffered ECC DDR2, and a matching pair of memory modules to go with it. And then, when I got all the parts, the RAM didn't fit in the motherboard. Turns out the RAM was FB-DIMM, which had not been listed in the advertisement. I gave up and just bought 2GB of non-ECC RAM to just get the system working. The FB-DIMM (all 8GB of it) is still sitting here, because I haven't found anyone who wants to buy it from me.
Lessons learned: 1. The saying "the nice thing about standards is that there are so many to choose from" is still relevant. I don't know why there have to be so many hardware interfaces to memory chips, but there are, so be careful. 2. Apparently, nobody really cares about ECC RAM, otherwise information would be easier to find. 3. Apparently, AMD CPUs and matching motherboards more usually support ECC RAM than Intel parts and matching motherboards.
Please correct me if I got my facts wrong.
Did they consider the increase in cosmic rays during the lull in solar activity? They say the rate hasn't changed, but is that due to chips becoming more resistant to errors in recent years? Or does a cosmic ray tend to produce only one error on a chip, and denser chips have kept the error rate the same despite more errors?
Damn! I think my mainboard is set to power off before it gets that hot, maybe even if the CPU gets up there, IIRC. But I'm not stuck in a server room, anyways.
The cost of that cleanup, of course, will be borne by taxpayers, not industry.
Does anyone know if these ECC corrected errors are errors that Memtest86 would have caught?
was because they qualified memory against the motherboard and would tell us what memory to use by maker's part number. We used ECC and I don't recall any memory problems in hundreds of machine years of use. Of course every machine was burned in for days.
Not the cheapest way but, when you pay, you can pay for the quality parts up front or pay for the maintenance costs in the field.
Since parts are cheaper than people...
Of course that meant you had to be willing to take the long term view of profit being based on customer satisfaction.
Or entropy? We just discovered the same about autism and climate change. What's up? We've been working with one eye closed all this time?
For justice, we must go to Don Corleone
Working at a computer surplus I've seen examples of systems that failed a memory test (occasionally) or crashed during Ubuntu install* (more common). Usually someone missed a blown cap, but second cause of this.. I'll pull the RAM, blow dust out of the DIMM slots pop RAM back in and it's fine (I do put it back in the same slots, so I'm not just moving a bad bit around in system memory. And did run memtest86 the first few the first few times this worked for me just to make sure I wasn't just smoking crack.) Some of these have those huge dust bunnies, but the worst is actually the finer dust. I'm assuming it's slightly conductive.
*Our Ubuntu install volume is so high we wore out close to 30 CDs a month and I got sick of burning them all the time. So, it is a netbooted automated Ubuntu 8.04 install... it's great for volume install to just be able to pxe boot a box and walk away (this is using the alternate installer and preseed file.). As a bonus it's enough of a burn-in that it seems to catch most flakey systems.
I find conclusion 7 a bit presumptuous. Soft errors are also caused by alpha particles emitted by contaminants in a chip's packaging in addition to cosmic rays. You could imagine that certain DIMMs might have lower quality (i.e. more contaminated) packaging than other DIMMs.
"Karma can only be portioned out by the cosmos." -Homer Simpson
The commands to do this are:
You can watch the scrub address register incrementing using
setpci -d 1022:1203 60.L 5C.L
Similar commands work on the K8 (single-core Athlon 64), but the device is :1103, and leave the msbyte of 58.L alone (there is no L3 cache scrubber).
...who makes a good board that is tested adequately?
Because Intel has been driving DDR3 adoption, their consumer stuff omits ECC support, and they don't officially support DDR3-1600 (even though people are running DDR3-2000 quite successfully), I haven't found DDR3-1600 ECC anywhere. Pointers would be appreciated.
In general, it's KVR<speed>D<ddr#>E<CAS>, where D standard for double-sided and E for ECC. Suffixes after that are S (thermal sensor), K<n> (kit of n DIMMs), and /<x>G (total size). Note K2/4G is 2x2G sticks, ot 2x4G.
This is why God invented ECC memory.
Honestly, if I have to rebuild a Postgres database one more time I'm going to puke.
Even with ECC memory, gamma-ray- and neutrino-induced ECC memory errors cause our generic x86 systems to corrupt memory, and thus corrupt my database. Half the time it corrupts the system table indices which are always kept as a memory-mapped file. Somehow it manages to corrupt the tables themselves.
This is one of those things that the kids at Postgresql.org have no solution for.
My solution is either to use Sun hardware or an x86 server with a recognizable brand name on it with an equally recognizable brand name memory, but that simply cannot happen due to who and how our systems are procured.
Incidentally, I have syslogs full of successfully recovered ECC errors on Sun Solaris machines. Even the non-recoverable ones have not once induced a data loss. All we need to do in these cases is swap the memory module and all is well.
However, I do not have even ONE line of evidence of a recovered ECC memory error from ANY of our generic x86 Linux machines. All we need to do is restore the database from a backup. It's usually corrupted beyond repair.
Why does Google so often seem to discover things the rest of us already knew and write "whitepapers" about them like they've stumbled upon some big, new discovery?
I say to Google: next time give AOL or IBM a call before you publish.
Kriston
Good heavens, RAM is cheap enough these days, just pass through the extra 6% or 12% mark-up that it would cost to add a few ECC bits to a DIMM and make the same DIMM type STANDARD for consumer / workstation as well as server products. Overall it would reduce the cost of server memory and have only a slight impact on desktop memory due to the higher and more stable market volume benefits as well as lower costs due to warranty / QA issues.
It was considered important enough for servers to have ECC back when they might have 512MBy RAM. Now a typical mid range desktop workstation will have 6GBy or so and a 64 bit OS, and servers and power user workstations several times that much.
Given the vastly increased use of 'sleep' capability in workstations the average "uptime" before reboot / reload is likely to be in the range of several months for many systems, especially as patches less often require reboots. That is plenty of time for RAM to get an error due to radiation, EMI, a motherboard glitch, et. al. and cause corrupted data or system instability.
It is especially compelling now that so much ram is used for disk cache, and the average disk drive for a mid range consumer system is measured in the terabyte range. So there's a good likelihood that RAM errors will affect not only immediate calculations / system stability but also be persisted to disc from a write buffer and thus cause permanent corruption. Of course most common filesystems (thanks ZFS!) don't do any kind of buffer / block checksumming and are especially susceptible to happily passing on corrupted I/O buffer data between the disc and the filesystem.
A little RAM ECC would go a long way toward making PCs 'appliances' instead of tempermental geek toys that still often enough crash or glitch. Although the average person may not always see how close to the edge of corruption / failure a typical system is, if you run something like memtestx86 or the test mode of prime95 for a week or so (if it even lasts that long) it is very common to see data corrupting glitches on many systems running at stock factory "stable" configurations.
GPUs are even worse than PC motherboards in this respect, they need ECC ASAP if we're going to rely on GPGPU as anything useful to deliver accurate long running computations at bleeding edge clock speeds.
"Is there anything I can place in my AUTOEXEC.BAT to prevent memory errors? A software patch or something?"
(if you don't know what I am talking about, google NOSMOKE.EXE. Funny read)
I am currently testing DDR3 1600 ECC and non-ECC DIMMs in my lab, currently not available on the market yet, but direct samples from the vendors. Those Kingston DIMMs actually use Elpida parts, similar to the Crucial memory using Micron parts.
Perhaps they should consider putting back the A/C units in their data centers. I'm just saying.
-dZ.
Carol vs. Ghost
Comment removed based on user account deletion
Hmmm... The lottery has astronomical odds, which means I have a microscopic chance of winning.
Program Intellivision!
Hard errors may be the most common failure type. The DIMMs themselves appear to be of good quality, and bad mobo...
Shut yo' mouth!
This paper is a fraud, as said before google doesn't use ecc memory.
Anybody know of a linux kernel module that will fake ECC on a regular system? Yeah, I know, it'll be slower.
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
Thanks, that helps a lot. The memory I bought is actually Kingston, but I apparently didn't know enough about how to read their part numbers to tell that it was FB-DIMM instead of DDR2 SDRAM. The whole exercise sure taught me to look up the part numbers in the future, and not go by the description alone!
Please correct me if I got my facts wrong.