Google Finds DRAM Errors More Common Than Believed
An anonymous reader writes "A Google study of DRAM errors in their data centers found that they are hundreds to thousands of times more common than has been previously believed. Hard errors may be the most common failure type. The DIMMs themselves appear to be of good quality, and bad mobo design may be the biggest problem." Here is the study (PDF), which Google engineers published with a researcher from the University of Toronto.
"a mean of 3,751 correctable errors per DIMM per year."
I'm much to lazy to do the math. Let's round up - 4k errors per year has to be a vanishingly small percentage for a system that is up 24/7/365, or 5 nines. The fact that these DIMMs were "stressed" makes me wonder about the validity of the test. Heat stress, among other things, will multiply errors far beyond what you will see in normal service.
"Windows is like the faint smell of piss in a subway: it's there, and there's nothing you can do about it." - Charlie Br
I use Gentoo; how does this affect me?
Changing your file system solves RAM errors how?
it reduces the effects of universal entropy, obviously.
What I have seen (and generated) is the occasional (2-3/day) bus error with specific (nasty) datapatterns. Usually at a few addr. I write that off to mobo trace design and crosstalk between the signals. Failing to round the corners sufficiently, or leaving spurs is the likely problem. I think Hypertransport is a balanced design (push-pull differential like ethernet) and should be less succeptible.
Maybe this is explainable by today's story that the universe has 100x more entropy than we thought
Just as likely to crash, less likely to silently scribble bits of nonsense all over the filesystem before doing so...
Obviously, not having RAM errors would be even nicer; but, if you can at least detect trouble when it arises rather than well afterwords, you can avoid having it propagate further, and get away with using cheap redundancy instead of expensive perfection.
I've always thought it would be a nice-to-have feature for my home system to have ECC - perhaps it might degrade over time and misbehave less if it could detect and fix some errors. But my normal sources don't seem to stock many choices. E.g. Newegg appears to have 2 motherboards to choose from, both for AMD CPUs, nothing for Intel. Frys appears to have one, same, AMD only. Is this just the way things are, or do I need to be looking somewhere else? Would picking one of these motherboards end up in not working out well for my gaming rig?
is competition good, or is duplication of effort bad?
In my experience at work ordering Dell desktops and laptops, by far the most common defect is 1-3% of machines with bad RAM. Typically it's made by Hynix, occasionally Hyundai, and I've never seen other brands fail. On many occasions though, I've predicted Hynix, pulled it, and sure enough theirs was the piece causing the errors in Memtest86+...
was only a problem for government computers.
Nullius in verba
Adding checksumming adds another place for errors to occur though -- if data is written correctly but the checksum is-miscalculated, either before it is stored or when the data is being verified -- you'll end up throwing out perfectly good data. If you also have redundancy you're probably willing to live with that, but if you're running on single disk ZFS is just adding more opportunities for data corruption in RAM.
Read the article and remember they are talking averages here.
They give it away with this line:
Only 8% of DIMMs had errors per year on average. Fewer DIMMs = fewer error problems - good news for users of smaller systems
Essentially, only 8% of their ECC DIMM's reported ANY errors in a given year.
Also this was pretty telling:
Besides error rates much higher than expected - which is plenty bad - the study found that error rates were motherboard, not DIMM type or vendor, dependent.
And this:
For all platforms they found that 20% of the machines with errors make up more than 90% of all observed errors on that platform.
So essentially, they are saying that only 8% of DIMMSs reported errors, 90% of which were on 20% of the machines that had errors, mostly because of motherboard issues... yet DIMMs are less reliable than previously thought.
I would imagine that if you removed all of the bad motherboards, power supplies, environmental, and other issues... that DIMMs are actually more reliable than I previously thought, not less! I wonder what percentage of CPU operations yield incorrect results. With Billions of instructions per second, even an astronomically low average of undetected cpu errors would guarantee an error at least as often as failed DIMMs.
What I did take from the article was that without ECC ram, you have no way of knowing that your RAM has errors. I guess I should rethink my belief that ECC was a waste of money.
Sometimes the best solution is to stop wasting time looking for an easy solution.
RAM is dirt cheap and most server systems support significantly more RAM than most people bother to install. For critical systems, ECC works but that doesn't prevent everything (double bit errors etc.). Is it time for a Redundant Array of Inexpensive DIMMs? Many HA servers now support Memory Mirroring (aka RAID-1 http://www.rackaid.com/resources/rackaid-blog/server-dysfunction/memory_mirroring_to_the_rescue/) but should there be more research into different RAID levels for memory (RAID5-6, 10, etc?)
Thanks,
--
Matt
Seriously. If you download a lot, and I do, you see quite a few checksum mismatches in the log.
Especially if the torrent is old. Some of them may be sabotage activity, but I doubt that, considering kind of files.
They are not transmission errors: TCP-IP checks for that. Not hard drive errors - again checksums. They can be intrasystem transmission errors though.
I remember folks who did complete checkers wrote that they had a lot of them too.
It makes the problem magically go away by redirecting his attention to a catchy new gadget.
Ezekiel 23:20
At Purdue, many years ago, one of the engineers mapped the ECC RAM errors in a room with hundreds of sparc stations and found that it was mostly in a cone shape pointed toward the window. That window looked out to a pile of coal, so the culprit was assumed to be low level alpha radiation.
All ideas^H^H^H^H^Hprocesses in this post are Patent Pending. (as well as the process of patenting all postings)
Alrighty then, which mainboards have the lowest error rates? TFA seems to have obfuscated that. That's MSs job, I thought Google was supposed to Do No Evile?
The cost of that cleanup, of course, will be borne by taxpayers, not industry.
Are you kidding?! It's 100x greater than we thought!!
it reduces the effects of universal entropy, obviously.
Sorry, you're looking for the thread two doors over, "Universe Has 100x More Entropy Than We Thought"
That window looked out to a pile of coal, so the culprit was assumed to be low level alpha radiation.
Alpha radiation is stopped by a sheet of office paper. It certainly wouldn't make it through the window, through the machine case, electromagnetic shield, circuit board, chip case, and into the silicon. Even beta radiation would be unlikely to make it that far.
What is much more likely: thermal effects. IE, infrared from the sun heating up machines near the window.
Please help metamoderate.
Which really makes me question whether these results have any validity outside of google. The study found that the majority of errors appeared to be related to the motherboard, but didn't list any information about the motherboards in use. If they are all custom built for google, then there is absolutely no way for any of us to know whether the error rate they exhibited is representative of what you'd get from average COTS server-grade motherboards currently on the market. Thus these results are meaningless to anyone who uses different motherboards, ie everyone but google.
My takeaway from this paper is that maybe google should hire more technicians who are experienced with non-ecc ram systems. They even believed, prior to this study, that soft errors were the most common error state. I could have told you from the start that was bunk. In over 15 years of burn-in tests as part of pc maintenance, the number of soft-errors observed is... 0. Either the hardware can make it through the test with no error, or there is a DIMM that will produce several errors over a 24 hour test. This doesn't mean that random soft errors never happen when I'm not looking/testing, but the 'conventional wisdom' that soft errors are the predominant memory error doesn't even pass the laugh test.
From looking at the numbers on this report, I get the feeling that hardware vendors are using ECC as an excuse to overlook flaws on flaky hardware. I would now be really interested in a study that compares the real world reliability of ECC vs non-ECC hardware that has been properly QC'd. I'll wager the results would be very interesting, even of ECC still proves itself worth the extra money.
A second vote for ReiserFS. It even flatly denies that any errors occurred.
The length difference between both traces for differential clocks to ram must be less than 10 mils (1/4th of a millimeter) so that travel time of both signals is matched to within 2 picoseconds. Think about it. With such stringent requirements, a marginal(*) PCB design can easily cause errors once in a while.
(*) a PCB design where corners have been cut would actually be a good design in this case. Unless Google does not mount the PCBs in a case. But I digress.
I'm not a coward by any name.
When I was building the computer I'm typing this on, I had the grand idea of building it with so much RAM that I could basically work from RAM. Meaning, for example, that all my running programs and the project I was working on would have to fit in RAM.
Of course, with such a dream, I was concerned about the reliability of my memory. So I wanted ECC. I found out that having ECC memory is not just a matter of buying ECC memory. There are different kinds of ECC memory, and you need to find a combination of memory, motherboard, and CPU that works together. Many sites that offer CPUs and/or motherboards don't list support for ECC among the specifications. Searching for it is difficult, because searching for "ECC" also returns hits for things like "non-ECC" and "ECC: no".
Finally, I found a combination of motherboard and CPU that would support unbuffered ECC DDR2, and a matching pair of memory modules to go with it. And then, when I got all the parts, the RAM didn't fit in the motherboard. Turns out the RAM was FB-DIMM, which had not been listed in the advertisement. I gave up and just bought 2GB of non-ECC RAM to just get the system working. The FB-DIMM (all 8GB of it) is still sitting here, because I haven't found anyone who wants to buy it from me.
Lessons learned: 1. The saying "the nice thing about standards is that there are so many to choose from" is still relevant. I don't know why there have to be so many hardware interfaces to memory chips, but there are, so be careful. 2. Apparently, nobody really cares about ECC RAM, otherwise information would be easier to find. 3. Apparently, AMD CPUs and matching motherboards more usually support ECC RAM than Intel parts and matching motherboards.
Please correct me if I got my facts wrong.
it reduces the effects of universal entropy, obviously.
Sorry, you're looking for the thread two doors over, "Universe Has 100x More Entropy Than We Thought"
Wooooosh....
Damn! I think my mainboard is set to power off before it gets that hot, maybe even if the CPU gets up there, IIRC. But I'm not stuck in a server room, anyways.
The cost of that cleanup, of course, will be borne by taxpayers, not industry.
Well, that's the problem, these issues are clearly linked: Once we found out that the entropy is 100x greater than we had thought, all the RAM errors immediately went through the roof. We must push the entropy lower to fix the RAMs.
Ezekiel 23:20
ZFS has its own, superior implementation of RAM. Duh.
USE HOT GRITS WITH STATUE OF NATALIE PORTMAN (NAKED AND PETRIFIED)
Or entropy? We just discovered the same about autism and climate change. What's up? We've been working with one eye closed all this time?
For justice, we must go to Don Corleone
Memtest would catch the errors that occur on it's watch (if the same error were to happen on non-ecc, and therefore not be corrected by the hardware before memtest even sees it).. However, memtest does not detect the errors that happen when it's not running, which should be the point of of ECC, think of it as an always on memtest that keeps your pc going even in the face of failures.
What I see here that I find odd, however, is that Google (and presumably other large data centre's) were operating under the presumption that memory errors are just normal and to keep on going so long as the ECC was able to correct them. That's what I find hard for my small system mindset to comprehend. In my world, when hardware looks like it's unreliable, I schedule a replacement.
Changing your file system can solve a lot of errors and problems. For instance, if you have "Wife is still alive" errors, you can solve it by changing your file system to ReiserFS.
Come on, it's not that hard to figure out.
Give me Classic Slashdot or give me death!
it reduces the effects of universal entropy, obviously.
Sorry, you're looking for the thread two doors over, "Universe Has 100x More Entropy Than We Thought"
Wooooosh....
If you think I meant it, then yes, wooooosh.
I find conclusion 7 a bit presumptuous. Soft errors are also caused by alpha particles emitted by contaminants in a chip's packaging in addition to cosmic rays. You could imagine that certain DIMMs might have lower quality (i.e. more contaminated) packaging than other DIMMs.
"Karma can only be portioned out by the cosmos." -Homer Simpson
The commands to do this are:
You can watch the scrub address register incrementing using
setpci -d 1022:1203 60.L 5C.L
Similar commands work on the K8 (single-core Athlon 64), but the device is :1103, and leave the msbyte of 58.L alone (there is no L3 cache scrubber).
...who makes a good board that is tested adequately?
Adding checksumming adds another place for errors to occur though -- if data is written correctly but the checksum is-miscalculated, either before it is stored or when the data is being verified -- you'll end up throwing out perfectly good data.
Throwing out good, re-computable data is a lot less bad than writing bad un-re-computable data. At worst, you have to recompute it. Admittedly, this can be a serious problem, but it can potentially be mitigated by algorithm analysis and statistics to determine an upper bound on the size of the effect of error introduced as a function of "time" or "steps". For example, when I was working as a research analyst doing data mining, a single bit flip (or more generally, an erroneous "row") would not strongly affect the results of machine training. Certainly not enough to make us want to run the program for another week to fix it.
After all, I am strangely colored.
Well go on then. Explain how ZFS's end to end checksumming won't detect corrupted data due to RAM errors. Go on, I'm waiting.
Give me Classic Slashdot or give me death!
This is why God invented ECC memory.
Honestly, if I have to rebuild a Postgres database one more time I'm going to puke.
Even with ECC memory, gamma-ray- and neutrino-induced ECC memory errors cause our generic x86 systems to corrupt memory, and thus corrupt my database. Half the time it corrupts the system table indices which are always kept as a memory-mapped file. Somehow it manages to corrupt the tables themselves.
This is one of those things that the kids at Postgresql.org have no solution for.
My solution is either to use Sun hardware or an x86 server with a recognizable brand name on it with an equally recognizable brand name memory, but that simply cannot happen due to who and how our systems are procured.
Incidentally, I have syslogs full of successfully recovered ECC errors on Sun Solaris machines. Even the non-recoverable ones have not once induced a data loss. All we need to do in these cases is swap the memory module and all is well.
However, I do not have even ONE line of evidence of a recovered ECC memory error from ANY of our generic x86 Linux machines. All we need to do is restore the database from a backup. It's usually corrupted beyond repair.
Why does Google so often seem to discover things the rest of us already knew and write "whitepapers" about them like they've stumbled upon some big, new discovery?
I say to Google: next time give AOL or IBM a call before you publish.
Kriston
"Is there anything I can place in my AUTOEXEC.BAT to prevent memory errors? A software patch or something?"
(if you don't know what I am talking about, google NOSMOKE.EXE. Funny read)
Changing your file system solves RAM errors how?
"NTFS" - 4 bytes of memory
"ZFS" - 3 bytes of memory
That's 25% fewer bytes to get RAM errors in.
I am anarch of all I survey.
Yo dawg, I herd u want checksums for ur checksums...
Finally had enough. Come see us over at https://soylentnews.org/
Perhaps they should consider putting back the A/C units in their data centers. I'm just saying.
-dZ.
Carol vs. Ghost
Ooh, hand waving. You can do better than that. How do you get from valid data, with a valid checksum on disk, to data in RAM, to corrupted data in RAM, to a checksum for that corrupted data, to corrupted data and checksum on disk, without ZFS noticing that the checksums don't match?
Give me Classic Slashdot or give me death!
Comment removed based on user account deletion
Hmmm... The lottery has astronomical odds, which means I have a microscopic chance of winning.
Program Intellivision!
Anybody know of a linux kernel module that will fake ECC on a regular system? Yeah, I know, it'll be slower.
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
Fair enough.
Give me Classic Slashdot or give me death!
Thanks, that helps a lot. The memory I bought is actually Kingston, but I apparently didn't know enough about how to read their part numbers to tell that it was FB-DIMM instead of DDR2 SDRAM. The whole exercise sure taught me to look up the part numbers in the future, and not go by the description alone!
Please correct me if I got my facts wrong.