To ECC Or Not To ECC?
MetaHiro asks: "I'm going to be upgrading my system in a couple of weeks. I've been looking around the net for reviews and/or benchmarks for ECC vs. non-ECC in both speed and whether or not it's worth it to shell out the extra bucks for ECC. I'm also wondering whether or not i should buy PC2100 ECC instead of PC 2700 non-ECC ram or wait until PC2700 ECC becomes available."
For what are you using your system? If it's just another gaming PC then ECC isn't worth it.
* Origin: XBase BBS (2:490/4100) Well the good old days may not return and rocks might melt and sea may burn.
How is this helpful? The philosophy behind that seems to be rather than allow my programs to continue with a corrupt bit of data, it's better to halt all operation and LOSE ALL MY DATA and perhaps corrupt my hard drive. That's "help" I don't need.
Is this universal, or just my OS (W2K), BIOS, or hardware? Is there a way for ECC to simply and calmly report a problem without locking up my machine in the process?
Please Rate my comment (and help support Fre
I've got a Tyan S2460 MB w/ 1Gig PC2100 ECC RAM & 2 1.4Mhz Athlon MPs. It uses PhoenixServer BIOS, and the BIOS gives me these options re: ECC
SERR Signal Condition: (ECC error conditions that SERR# be asserted [sic])
ECC Config: (No ECC, Checking Only, Checking and Correction and Checking, Correction with Scrubbing)
So, my question would be: if this is basically a home machine with no mission-critical stuff running on it, but I'd like to get some benefits from my expensive ECC RAM without BSODing, what settings would be best in this scenario? Right now I've got everything turned off (Signal Condition: None, ECC Disabled). Oh, and what the heck is scrubbing?
And yes, I've attempted to RTFM but it's little more than a pamphlet and I can't find good, clear info about this.
Please Rate my comment (and help support Fre
My understanding is that the speed penalty is very small. Is that correct?
Scrubbing detects and corrects memory errors that are in memory addresses that are idle. This prevents correctable errors from turning into uncorrectable errors in sections of memory that are infrequently accessed by the CPU.
Mea navis aericumbens anguillis abundat
You might want to try a few searches on groups.google.com for S2460 and the brand of memory you are using. I have a S2460 and I learned the hard way that it is a very fussy board when it comes to what memory you use with it. In particular, Crucial memory can cause problems (although I have now got my board working OK with Crucial memory). The basic rule appears to be that it the memory is not on the Tyan approved list it will be problematic.
(Sorry to reply to my own post)
Some other thing you might want to try:
- get the 1.4 version of the BIOS
- set 'maxmem' in boot.ini
- if you have an nVidia GeForce, make sure you have the most recent drivers - some earlier drivers causes a lot of problems
ECC is useless if you are running something like windows 98 which crashes on its own so much anyway - and would bsod on an ECC error. ECC cant work with Celeron only PII, PIII, PIV which also have their own ECC caches. Something like WINNT/2000/XP it is worth it because they have long uptimes. To the poster who bsod with ECC off- turn it on like the other poster said.
ECC in the old PC133 PC100 etc style memory changed the minimum number of wait states you can have (CAS3 instead of CAS2).
I believe ECC is worth having, even if you are not using the computer to run "mission critical" tasks. Memory problems on a computer without parity or ECC can be very difficult to diagnose. The symptoms may look like a flakey operating system or application, not a hardware problem. I had one computer that would only fail when someone ran the FORTRAN compiler. The symptoms looked exactly like a bug in the FORTRAN compiler. It turned out that one of the DRAM chips had a pattern sensitivity problem that was triggered by the image of the FORTRAN compiler. These kinds of problems can be difficult to detect and fix without hardware support. The memory diagnostics in the power-on self-test in the BIOS will detect hard errors, but not more subtle errors.
Mea navis aericumbens anguillis abundat
I've got a Tyan Tiger MP too. Funny thing is, I get all kinds of lockups, blue screens, and reboots when ECC is disabled. The only way I can keep my machine running is to turn "Both" and "ECC Scrubbing" on in the BIOS. Hope it helps...
set SERR to None which wont BSOD the machine by raising an error NMI and set ECC to Checking, Correction w/ Scrubbing
Whether you use parity, non-parity, or even ECC, you should ALWAYS test your RAM sticks with MemTest86.
Test them when newly purchased (I've received duds from brand-name online memory warehouses.) Test them every few months (they can and do go bad.) Especially test when your computer exhibits otherwise unexplainable behavior, like: Windows BSoD, kernel panics, characters changing themselves on disk willy-nilly, programs crashing for no good reason, or going bad on disk and needing reinstallation. Disk files that go corrupt. Any of the above, even (or especially) when it seems inconsistent, can be caused by a few bad blocks in a RAM stick.
MemTest86 is a program that boots and runs off floppy (has its own boot loader, no OS), and t-h-o-r-o-u-g-h-l-y tests your ram. It even detects adjecent cell errors, where a 1 in cell n can threshold bias the 0 in cell n+1 or n-1 until it is considered a 1.
It even knows how to differentiate between cache memory errors and RAM errors. Just do it (after nightmare hardware problems, MemTest86 showed me what was broken- can't say enough good things about it.) It's user interface could be more informative, but when it spots and error, you'll know.
Big Daddy, Johnny, Burp, Aunt Zelda, Scott, Slurp, Big Momma
What the fuck are you smoking? 3400 wrong /.ers is a light day of comments.
Huh? If ECC isn't worth it, then RAID 5 (the minimum-acceptable "poor man's" form of RAID) certainly isn't, for the same reasons.
If you're correct, then I can say this, using the same logic: "In my experiences, RAID-5 is not worthwhile. There are too many ways the data can get corrupted before it ever hits the disk."
Heck, all the built-in hard drive ECC, SMART technology, sector relocation, CRC-checking, etc. are useless, if we follow your argument to its logical conclusion.
Since ECC and RAID-5 are similar technology and perform similar roles in similar ways, and since RAM is always far more important than disk, at least once the OS is booted, then, ECC is more important than RAID, yet make data centers skip on ECC and spend on RAID. What's silly is that if MEMORY IS CORRUPT, THEN DISK CERTAINLY WILL BE -- PERMANENTLY.
A 1-bit error is the most common kind of memory error and can crop up for a multitude of reasons, including static, voltage spikes, bad motherboard timings, cosmic rays, etc. And, you'll still catch the 2-bit errors, the second most common kind. I'd be willing to bet that 1 and 2 bit errors account for 99+% of all memory errors, unless you got a bad chip. ECC was NEVER designed to fix all errors, just the 99+% we actually encounter.
The thing about some
If you're anti-ECC for ANY reason, then, to follow your logic, you should also be anti-RAID and anti-tape backup.
Funny, I actually remember reading specs for some no-name companies raid5 controller that was used in one of the linux hardware companies boxes. The thing used EDO DRAM without parity or ECC checking. I was like great, RAID 5 to protect my data and shitty cache ram to corrupt it!
So, the conclusion was that parity RAM was justified in the original IBM PC, because RAM errors were really common back then, but such a technology would be obsolete today.
Anyone have more info regarding this?
Did you know you can fertilize your lawn with used motor oil?
There seems to be a misunderstanding regadring ECC and Parity memory, at least in relation to PC's.
PC memory has either some extra bits (one for every eight bits) for ciclic redundancy, or it hasn't. There is no dedicated ECC circuity on PC memory, (Exept maybe IBM Chipkill memory). The difference between parity memory and ECC memory lies on how the memory controller takes advantage of the extra bits. To get an idea on how ECC really works, see Hamming code.
Regards
Roberto de Iriarte
roberto at spock dot cl
Errk? WTF? Your logic is, ahh, most confusing...
Harddrives fail early, frequently, and often. They are complex mechanical devices with moving parts. While they have an expected lifespan of hundreds of thousands of hours, the actual lifespan is random, guaranteeing a considerable number of premature failures. Manufacturing issues further limit the expected lifespan. Quite frankly, I have seen these things fail left and right. In a large installation, you average replacing a certain number of harddrives each day.
RAM, unless it's damaged by static during install, is pretty solid. Same can be said for the motherboards. Now, granted, some fly-by-night company might cut corners and sell you crap. But diagnostics will pick this up pretty quickly.
I seem to recall hearing something about google switching from harddrives to RAM-based drives... Factoring in the replacement costs, the RAM approach was cheaper and significantly more reliable...
Bottom line: RAID allows me to recover the last N years worth of data after my harddrive dies. I've had half a dozen harddrives fail on me, personally, so far. My RAM has never failed. Ever. There was that one issue with my motherboard being defective and not handling 2 DIMMS. ECC would not have made any difference. Even with that RAM problem, I still suffered no data-loss on my harddrive.
Now, in contrast to my RAM successes, I have had my CPU fan fail. The CPU incinerated itself. The Linux kernel actually paniced. But ya know something? After replacing my CPU, I still had all my data, intact, on the harddrive.
My point is: ECC only protects you against failures directly on the memory stick itself. These failures are extraordinarily rare. They are rare compared to your chances of being hit by floods, tornados, lightning, etc. IMHO, the cost exceeds the benefits. And there are far more important issues to worry about.
Thanks for that tip FreakyGeeky! Before now I could only run linux or Win2K Server without the BSoD blues... But now my 2KPro is going again, and I can even play Ghost Recon without crashing! woo hoo!@
Censorship appears alive and well on Slashdot. Some low-life decided to mark my post down (-1), despite the fact that it actually *IS* relevant to the topic at hand. So I'm reposting...
BTW: NASA uses computers with multiple, redundant CPU's to detect problems. Do you use multiple CPU's as real-time backups? Where do you think a bit is more likely to become corrupted? In the CPU or in the RAM? (Well, neither, unless the heatsink fails...)
---
In my experiences, ECC is not worthwhile. There are too many ways the data can get corrupted before it ever hits the memory stick. ECC only helps if the information is accurately present on the memory data lines attacked the the RAM module, and then only when the RAM module itself fails. Otherwise you are just recording, with error-correction, incorrect data. And lets face it: If the memory module itself is fried, ECC ain't going to help.
Testing: I had some rather painful experiences with a FIC-503+ motherboard. Turned out to have a design defect that caused problems when both DIMM slots were utilized, regardless of the RAM type.
To test it, under linux (of course), with a minimal boot, running as few processes as possible, I created a large file (${FILE}) of non-uniform data by cat'ing (combining) several arbitrary convenient large files. About 2x - 3x the total size of all my RAM. I then did:
Any problems showed up right away. (Cksum returned different numbers.)
This was a simpler approach, though not quite as good, as the general make 100 linux kernels and diff the make-logs.
You might also look at: http://www.bitwizard.nl/sig11/
Granted, HD's are far more failure prone than RAM and can be shut off. But so?
I've had RAM die more than once in more than one machine, etc. I've even had ONE MEMORY CELL/WORD DIE! (I've had one of those "support 1400 end-user machines" and 50+ server support jobs, in addition to other places I've worked over the last 9 years.) So, I probably have seen things you've not seen and you've seen things I haven't. Your "in my experience" is anecdotal evidence, not scientific evidence.
No, ECC cannot correct everything. But you're only partially right when you say, "ECC only protects you against failures directly on the memory stick itself." What about power glitches, especially ones that get past the UPS? Often, this IS a 100% correctable error, with ECC. Why do the manufacturers of servers (especially high-end stuff) rely on ECC RAM -- for uptimes? This is especially true of mainframes and high-end Unix servers. Just go ask IBM or HP/COMPAQ....
The fact that maybe 1 in 1,000 or even 1 in 10,000 DIMMs ever fails outside of manufacturer testing vs. 1 in 10 HD's is irrelevent. RAM still fails, plain and simple.
I'd bet that Google uses ECC RAM....
These days, a 256M DDR DIMM w/ ECC is only a few bucks more, for an extra level of safety and only a modest performance loss. Buy ECC, love ECC.
(Use RAID, too -- on servers. Again, RAID 5 and ECC are very similar technologies. If RAID 5, on failure-prone HD's is considered "fault tolerant enough," then why not ECC RAM?)
Again, it's ONLY A FEW BUCKS! Don't believe it's only a few bucks? Price it on http://www.pricewatch.com/ Today's prices (2002-05-06):
256M RAM PC2100: $37
256M RAM PC2100 w/ ECC: $42
512M RAM PC2100: $100
512M RAM PC2100 w/ ECC: $113
Well, that about defates that argument.
Yes, as far as more important issues:
UPSes
Backups
World Hunger