Google Finds DRAM Errors More Common Than Believed

← Back to Stories (view on slashdot.org)

Google Finds DRAM Errors More Common Than Believed

Posted by kdawson on Tuesday October 6, 2009 @06:57AM from the forget-me-not dept.

An anonymous reader writes "A Google study of DRAM errors in their data centers found that they are hundreds to thousands of times more common than has been previously believed. Hard errors may be the most common failure type. The DIMMs themselves appear to be of good quality, and bad mobo design may be the biggest problem." Here is the study (PDF), which Google engineers published with a researcher from the University of Toronto.

9 of 333 comments (clear)

Re:Percentage? by gspear · 2009-10-06 07:05 · Score: 5, Informative

From the study's abstract:
"We find that temperature, known to strongly impact DIMM error rates in lab conditions, has a suprisingly small effect on error behavior in the field, when taking all other factors into account."
Re:Percentage? by Runaway1956 · 2009-10-06 07:06 · Score: 5, Informative

No, I don't believe so. They use server boards, custom made to their specs. And, I'm pretty sure that those specs include ECC memory - that is the standard for servers, after all. http://news.cnet.com/8301-1001_3-10209580-92.html If you're really interested, that story gives you a starting point to google from.

--
"Windows is like the faint smell of piss in a subway: it's there, and there's nothing you can do about it." - Charlie Br
Bus errors! by redelm · 2009-10-06 07:11 · Score: 5, Informative

Hard DRAM errors are rather hard to explain if the cells are good -- maybe a bad write. After much DRAM testing (I use memtest86+ weeklong), I've yet to find bad cells.
What I have seen (and generated) is the occasional (2-3/day) bus error with specific (nasty) datapatterns. Usually at a few addr. I write that off to mobo trace design and crosstalk between the signals. Failing to round the corners sufficiently, or leaving spurs is the likely problem. I think Hypertransport is a balanced design (push-pull differential like ethernet) and should be less succeptible.
Dell by ^_^x · 2009-10-06 07:20 · Score: 5, Interesting

In my experience at work ordering Dell desktops and laptops, by far the most common defect is 1-3% of machines with bad RAM. Typically it's made by Hynix, occasionally Hyundai, and I've never seen other brands fail. On many occasions though, I've predicted Hynix, pulled it, and sure enough theirs was the piece causing the errors in Memtest86+...
1. Re:Dell by Jah-Wren+Ryel · 2009-10-06 07:43 · Score: 5, Interesting
  
  Hyundai is Hynix and they are the second largest DRAM manufacturer by marketshare (roughly 20% second to Samsung's 30%).
  Its no surprise that you've only seen Hynix brand fail in Dells, chances are they are in 90%+ of Dell (and HP and Apple) boxes because they primarily buy from Hynix in the first place. Its selection bias.
  
  --
  When information is power, privacy is freedom.
Misleading, to say the very least. by jhfry · 2009-10-06 07:25 · Score: 5, Interesting

Read the article and remember they are talking averages here.
They give it away with this line:

Only 8% of DIMMs had errors per year on average. Fewer DIMMs = fewer error problems - good news for users of smaller systems
Essentially, only 8% of their ECC DIMM's reported ANY errors in a given year.
Also this was pretty telling:

Besides error rates much higher than expected - which is plenty bad - the study found that error rates were motherboard, not DIMM type or vendor, dependent.

And this:

For all platforms they found that 20% of the machines with errors make up more than 90% of all observed errors on that platform.
So essentially, they are saying that only 8% of DIMMSs reported errors, 90% of which were on 20% of the machines that had errors, mostly because of motherboard issues... yet DIMMs are less reliable than previously thought.
I would imagine that if you removed all of the bad motherboards, power supplies, environmental, and other issues... that DIMMs are actually more reliable than I previously thought, not less! I wonder what percentage of CPU operations yield incorrect results. With Billions of instructions per second, even an astronomically low average of undetected cpu errors would guarantee an error at least as often as failed DIMMs.
What I did take from the article was that without ECC ram, you have no way of knowing that your RAM has errors. I guess I should rethink my belief that ECC was a waste of money.

--
Sometimes the best solution is to stop wasting time looking for an easy solution.
Re:Percentage? by silent_artichoke · 2009-10-06 07:32 · Score: 5, Funny

You know, maybe googling it isn't the best idea in this case. Memory errors and all...
clearly not a radiation engineer by SuperBanana · 2009-10-06 08:12 · Score: 5, Insightful

That window looked out to a pile of coal, so the culprit was assumed to be low level alpha radiation.
Alpha radiation is stopped by a sheet of office paper. It certainly wouldn't make it through the window, through the machine case, electromagnetic shield, circuit board, chip case, and into the silicon. Even beta radiation would be unlikely to make it that far.
What is much more likely: thermal effects. IE, infrared from the sun heating up machines near the window.

--
Please help metamoderate.
Re:Percentage? by Austerity+Empowers · 2009-10-06 11:28 · Score: 5, Informative

I work on server design, specifically motherboards. ECC is a feature, it helps prevent bit errors from passing through undetected. It is not a method for preventing errors from happening in the first place, nor does it influence the number of bit errors. That is a property of the motherboard design, the chipset, the DIMM PCB and the DRAM. Second, just because you provide a spec for a mobo, does not mean that it is all inclusive. Generally people specify form factor, power, features. They don't specify quality and in most cases don't give a criteria for what it means for a feature to "work". In fact most customers I've talked to don't really understand what quality means from hardware (and sometimes in general). Hardware management, much like software, is designed with similar principles of impact/effort: if customers don't care, we don't test. In other words if it ain't listed on the box, or the salesman won't write it down, just assume it wasn't done.
In spite of the fact that computer motherboards are digital electronics, there is in fact anything but a binary determination of "work" and "not work". Digital signals are an engineering approximation, one which falls apart at high speeds, dense routing and inexpensive design. Well designed and tested motherboards have a well known bit error rate, and reliable companies will not ship a new design until they meet their target. I do this on systems I design, but they aren't cheap, not by a lot. It is a very expensive, time consuming process, one which most companies really want to get rid of. Not all systems are so thoroughly tested, in fact the vast majority of boards out there, server or otherwise, aren't tested much at all.
Forking money for ECC is very similar to paying the mob to protect you. Yes, it will give you more peace of mind, but what you really want is to not be having these problems to begin with. For people who care about data integrity, you should be asking what the bit error rate is and how they know. If they don't know, then you don't want it, ECC or no ECC. Don't assume "the industry" is equal, and don't assume that because a vendor's product X is really good that their product Y is really good too: you WILL be wrong, particularly on computers.