Google Finds DRAM Errors More Common Than Believed
An anonymous reader writes "A Google study of DRAM errors in their data centers found that they are hundreds to thousands of times more common than has been previously believed. Hard errors may be the most common failure type. The DIMMs themselves appear to be of good quality, and bad mobo design may be the biggest problem." Here is the study (PDF), which Google engineers published with a researcher from the University of Toronto.
"We find that temperature, known to strongly impact DIMM error rates in lab conditions, has a suprisingly small effect on error behavior in the field, when taking all other factors into account."
No, I don't believe so. They use server boards, custom made to their specs. And, I'm pretty sure that those specs include ECC memory - that is the standard for servers, after all. http://news.cnet.com/8301-1001_3-10209580-92.html If you're really interested, that story gives you a starting point to google from.
"Windows is like the faint smell of piss in a subway: it's there, and there's nothing you can do about it." - Charlie Br
What I have seen (and generated) is the occasional (2-3/day) bus error with specific (nasty) datapatterns. Usually at a few addr. I write that off to mobo trace design and crosstalk between the signals. Failing to round the corners sufficiently, or leaving spurs is the likely problem. I think Hypertransport is a balanced design (push-pull differential like ethernet) and should be less succeptible.
I work on server design, specifically motherboards. ECC is a feature, it helps prevent bit errors from passing through undetected. It is not a method for preventing errors from happening in the first place, nor does it influence the number of bit errors. That is a property of the motherboard design, the chipset, the DIMM PCB and the DRAM. Second, just because you provide a spec for a mobo, does not mean that it is all inclusive. Generally people specify form factor, power, features. They don't specify quality and in most cases don't give a criteria for what it means for a feature to "work". In fact most customers I've talked to don't really understand what quality means from hardware (and sometimes in general). Hardware management, much like software, is designed with similar principles of impact/effort: if customers don't care, we don't test. In other words if it ain't listed on the box, or the salesman won't write it down, just assume it wasn't done.
In spite of the fact that computer motherboards are digital electronics, there is in fact anything but a binary determination of "work" and "not work". Digital signals are an engineering approximation, one which falls apart at high speeds, dense routing and inexpensive design. Well designed and tested motherboards have a well known bit error rate, and reliable companies will not ship a new design until they meet their target. I do this on systems I design, but they aren't cheap, not by a lot. It is a very expensive, time consuming process, one which most companies really want to get rid of. Not all systems are so thoroughly tested, in fact the vast majority of boards out there, server or otherwise, aren't tested much at all.
Forking money for ECC is very similar to paying the mob to protect you. Yes, it will give you more peace of mind, but what you really want is to not be having these problems to begin with. For people who care about data integrity, you should be asking what the bit error rate is and how they know. If they don't know, then you don't want it, ECC or no ECC. Don't assume "the industry" is equal, and don't assume that because a vendor's product X is really good that their product Y is really good too: you WILL be wrong, particularly on computers.