Reliability of Computer Memory?
olddoc writes "In the days of 512MB systems, I remember reading about cosmic rays causing memory errors and how errors become more frequent with more RAM. Now, home PCs are stuffed with 6GB or 8GB and no one uses ECC memory in them. Recently I had consistent BSODs with Vista64 on a PC with 4GB; I tried memtest86 and it always failed within hours. Yet when I ran 64-bit Ubuntu at 100% load and using all memory, it ran fine for days. I have two questions: 1) Do people trust a memtest86 error to mean a bad memory module or motherboard or CPU? 2) When I check my email on my desktop 16GB PC next year, should I be running ECC memory?"
My first computer was a 80286 with 1 MB of RAM. That RAM was all parity memory. Cheaper than ECC, but still good enough to positively identify a genuine bit flip with great accuracy. My 80386SX had parity RAM, so did my 486DX4 120. I ran a computer shop for some years, so I went through at least a dozen machines ranging from the 386 era through the Pentium II era, at which point I sold the shop and settled on a AMDK62 450. And right about the time that the Pentium was giving way to the Pentium II, non-parity memory started to take hold.
What protection did parity memory provide, anyway? Not much, really. It would detect with 99.99...? % accuracy when a memory bit had flipped, but provided no answer as to which one. The result was that if parity failed, you'd see a generic "MEMORY FAILURE" message and the system would instantly lock up.
I saw this message perhaps three times - it didn't really help much. I had other problems, but when I've had problems with memory, it's usually been due to mismatched sticks, or sticks that are strangely incompatible with a specific motherboard, etc. none of which caused a parity error. So, if it matters, spend the money and get ECC RAM to eliminate the small risk of parity error. If it doesn't, don't bother, at least not now.
Note: having more memory increases your error rate assuming a constant rate of error (per megabyte) in the memory. However, if the error rate drops as technology advances, adding more memory does not necessarily result in a higher system error rate. And based on what I've seen, this most definitely seems to be the case.
Remember this blog article about the end of RAID 5 in 2009? Come on... are you really going to think that Western Digital is going to be OK with near 100% failure of their drives in a RAID 5 array? They'll do whatever it takes to keep it working because they have to - if the error rate became anywhere near that high, their good name would be trashed because some other company (Seagate, Hitachi, etc) would do the research and pwn3rz the marketplace.
I have no problem with your religion until you decide it's reason to deprive others of the truth.
As for ECC in memory... The problem is that ECC carries a heavy performance hit on write. If you only want to write 1 byte, you still have to read in the whole QWord, change the byte, and write it back to get the ECC to recalculate correctly. It is because of that performance hit that ECC was deprecated. The problem goes away to a large extent if your cache is write-back rather than write-through; though there will be still a significant number of cases where you have to write a set of bytes that has not yet been read into cache and does not comprise a whole ECC word.
AFAIK, on modern computer systems all memory is always written in chunks larger than a byte. I seriously doubt there's any system out there that can perform single-bit writes either in the instruction set, or physically down the bus. ECC is most certainly not "depreciated" -- all standard server memory is always ECC, I've certainly never seen anything else in practice from any major vendor.
The real issue is that ECC costs a little bit more than standard memory, including additional traces and logic in the motherboard and memory controller. The differential cost of the memory is some fixed percentage (it needs extra storage for the check bits), but the additional cost in the motherboard is some tiny fixed $ amount. Apparently for most desktop motherboard and memory controllers that few $ extra is far too much, so consumers don't really have a choice. Even if you want to pay the premium for ECC memory, you can't plug it into your desktop, because virtually none of them support it. This results in a situation where the "next step up" is a server class sytem, which is usually at least 2x the cost of the equivalent speed desktop part for reasons unrelated to the memory controller. Also, because no desktop manufacturers are buying ECC memory in bulk, it's a "rare" part, so instead of, say, 20% more expensive, it's 150% more expensive.
I've asked around for ECC motherboards before, and the answer I got was: "ECC memory is too expensive for end-users, it's an 'enterprise' part, that's why we don't support it." - Of course, it's an expensive 'enterprise' part BECAUSE the desktop manufacturers don't support it. If they did, it'd be only 20% more expensive. This is the kind of circular marketing logic that makes my brain hurt.
I find that when a Windows machine, from Windows 2000 on up, when taken care not to install too many programs and/or immature or junk-ware, then Windows remains quite stable and usable. The trouble with Windows is the culture. It seems everything wants to install and run a background process or a quick-launcher or a taskbar icon. It seems many don't care about loading old DLLs over newer ones. There is a lot of software misbehavior in Windows-world. (To be fair, there is software misbehavior in MacOS and Linux as well, but I see it far less often.) But Windows by itself is typically just fine.
Since the problem is Windows culture and not Windows itself, one has to educate one's self in order to avoid the pitfalls that people tend to associate with Windows itself.
Agreed. People who will sit and tell me with a straight face that Vista, in their experience, is unstable are either very unlucky, or liars. Windows stopped being generally unstable years ago. Get with the times.
I'm not convinced, I have a fairly old desktop at work I keep for Outlook use only. After a few days outlook's toolbar becomes unresponsive, and whenever I shut it down it stalls and requires a poweroff. Task manager doesn't say I'm using that much memory (still got cached files in physical ram).
I don't use windows much, I'm not used to the tricks that keep it running, where I probably use those tricks subconciously to keep my linux workstation and laptop running.
I wonder if Windows continued increase in stability is, at least partly, people subconciously learning how to adapt to it.
To all the posters who think the parent is a bad mechanic I will tell you my anecdote: I have never had a harddrive fail. Never. Not on a fresh computer and not on a decade old one.
Either I have magic hands, harddrives don't fail that often or /.ers can't handle harddrives.
Or people can beat the odds. Chances, sometimes you win in a casino.
Knowledge is power. Knowledge shared is power lost.
People who will sit and tell me with a straight face that Vista, in their experience, is stable are either very lucky, or Microsoft shills.
See? I can say the opposite, and provide just as much evidence? Do I get modded to 5 as well? Where's your statistics on the stability of Vista? Did it work well for you, therefore, it works well for everyone else?
I worked for a company that bought a laptop of every brand, so that when the higher-ups went into meetings with Dell, HP, Apple, etc. they had laptops that weren't made by a competitor. They have had problems like laptops not starting-up the first time due to incompatible software. That was a recent as 6 months ago. My mother-in-law bought a machine that has plenty of Vista-related problems (audio cutting out, USB devices not working, random crashes in explorer) on new mid-range hardware that came with Vista. But I have a neighbor who found it fixed lots of problems with gaming under XP.
There's plenty of issues. Vista's problems weren't just made-up because you didn't experience them.
Everybody's experience is different. Quit making blanket statements based on nothing.
I was recently running a server that archived about 2 terabytes of data and got something like one or two bit errors per year that could not be traced to any known source. This server boasted ECC RAM, so the problem didn't occur in the ram, and it was unlikely for it to have occurred in the FSB.
If you go with non-ECC, I would suggest running memtest86+. If you get errors, swap the memory. If swapping the memory still doesn't take care of it, swap motherboards! I recently had a memory problem in one of my customers' racks, and running memtest86+ got nothing until I had it running on my bench for over a week. There may be some problems with memtest86+...I even had another bit-error that memtest86+ did not find, but a Linux commandline memory tester found a problem almost immediately
The problem here is that different testing/usage patterns result in different probabilities of finding potentially bad words, e.g. words that may only be bad if you read from them a hundred cycles consecutively. But, if you do see a failure in memtest86+ or the CLI tester, you got yourself a serious problem. The point to take from this is that if you don't see errors, that doesn't mean you don't have errors!
Having said this, I still don't think memory errors among PCs are that common. We have more RAM on machines these days, but at the same time, the manufacturing processes have become better. I have a personal conviction in believing that though the likelihood of word error due to the increased amount of words in memory has increased, the RAM itself has become so much more "solid" that the increase of memory is negligible. Now, if you do dumb things with your computer like running it without a case or not giving it ventilation( learned this the hard way) or overclocking it, you *WILL* still run into problems. But if you design a system with quality and integrity, you typically shouldn't have these issues with memory!
One last thing to point out: there is quality hardware, and there is cheap hardware. My PC-Chips motherboard ran for three months and two days, and I didn't have a problem. Two days out of warrant. Now, take my MSI motherboard. It sets the timing for all memory modules to have the values of a single module. This resulted in stable single module operation, but got flaky for all four modules. I Finally moved to ECC before I figuerd out that I had to manually set the correct timings. This board is an ultra board, but apparently, it does not include use of generic (Micron, Corsair, etc!! - tried 'em all) memory modules. People on the Newegg reviews board have memory issues with this board as well that they could not fix with a BIOS update, and it appears that sometimes a design just is bad! Even the "good" manufacturers do not spend a lot of effort to fix issues in some cases.
My words of advice: Do your homework. Read through the reviews. AND DON'T BUY HARDWARE AS SOON AS IT COMES OUT!
Dude! Take a chill pill. This is not FUD. The gp is just relating his experience, and here's a shock, YMMV! So just sit back and have another beer.
BTW, I've also had major hassles with windows - mostly related to viruses. As it happens this forced me to switch 100% to linux and I'm happy here, but not everyone who switches is. Personally I like the bandwidth I save from not constantly downloading AV updates, and the speed increase from not running AV. But hey, where you are computing power and bandwidth are probably cheap. Again, YMMV.
I have determined that my sig is indeterminate.
Not only that, but PC manufacturers install all sorts of crapware. Just because you have a new PC doesn't mean it has a "clean install" of Windows on it.
/usr/games/fortune
You are right and you are wrong. Yes, it's true that Vista, XP or even Windows 2k are rock solid, but only as long as you don't add third party hardware driveres of dubious quality. Unfortunately many hardware venders don't spend as much effort as they should to develop good drivers. Just using the drivers that comes with windows leaves you with a rather small set of supported hardware, so people install whatever drivers that comes with the hardware they buy, and as a result they get BSOD if they are unlucky, and then they blame Microsoft.
God is REAL! Unless explicitly declared INTEGER
if you look at the username it's not him at all, it's someone with ID 1344097 pretending to be him. Still, what he says is sensible, and what's wrong with this piece? If it doesn't interest you, why are you reading the comments?
which is totally what she said
Or they're running crappy hardware. Most people blame Windows when their hardware is constantly running on the edge of failure. They have a computer that works fine out of the box, but crashes when the PSU can't keep up with the fifth USB device plugged in. Maybe some heat sinks are clogged with dust.
The OS running on the cheapest hardware with the most clueless user base has the highest failure rate? You don't say!
They use Linux, so the kernel is stable and updates aren't needed unless new features are implemented that the older kernel doesn't support. Since it is already a production machine that scenario never arises. If only you weren't an Anonymous Coward I'd go on (but kudos to at least being smart enough to hide).
;-0
And for those who will go to the security well here, we call it a trade-off. For many systems uptime is more important. It generally isn't a very big risk to run an older Linux kernel though it is more risky than not updating. In a world of blind men, the one-eyed man is king. We can sacrifice a modicum of security, exchanging our plate mail for chain mail, and still feel confident because we are surrounded with weaponless peasants
Guns don't kill people; Physics kills people! - John Lithgow as Dick Solomon on Third Rock From The Sun
Or could it be that they have a queue full of machines waiting for reinstalls, etc? No. It couldn't be that, since we all know that the thousands of people saying they have had major problems are liars, and we have as evidence a few people who claim that they haven't had major problems, or don't know that they have problems ;-)
Guns don't kill people; Physics kills people! - John Lithgow as Dick Solomon on Third Rock From The Sun
I worked for a company that bought a laptop of every brand, so that when the higher-ups went into meetings with Dell, HP, Apple, etc. they had laptops that weren't made by a competitor. They have had problems like laptops not starting-up the first time due to incompatible software. That was a recent as 6 months ago. My mother-in-law bought a machine that has plenty of Vista-related problems (audio cutting out, USB devices not working, random crashes in explorer) on new mid-range hardware that came with Vista. But I have a neighbor who found it fixed lots of problems with gaming under XP.
On the other hand, my Linux server freezes up and needs to be reset (sometimes even reboot -f doesn't work) every few days due to a kernel bug, probably some unfortunate interaction with the hardware or BIOS. (I'm using no third-party drivers, only stock Ubuntu 8.04.) And hey, in the ext4 discussions that popped up recently, it emerged that some people had their Linux box freeze every time they quit their game of World of Goo. Just yesterday I had to kill X via SSH on my desktop because the GUI became totally unresponsive, and even the magic SysRq keys didn't seem to work. Computers screw up sometimes.
What's definitely true is that Windows 9x was drastically less stable any Unix. Nobody could use it and claim otherwise with a straight face. Blue screens were a regular experience for everyone, and even Bill Gates once blue-screened Windows during a freaking tech demo.
This is just not true of NT. I don't know if it's quite as stable as Linux, but reasonably stable, sure. Nowhere near the hell of 9x. I used XP for several years and now Linux for about two years, and in my experience, they're comparable in stability. The only unexpected reboots I had on a regular basis in XP was Windows Update forcing a reboot without permission. Of course there were some random screwups, as with Linux. And of course some configurations showed particularly nasty behavior, as with Linux (see above). But they weren't common.
Of course, you're right that none of us have statistics on any of this, but we all have a pretty decent amount of personal experience. Add together enough personal experience and you get something approaching reality, with any luck.
MediaWiki developer, Total War Center sysadmin