To ECC Or Not To ECC?
MetaHiro asks: "I'm going to be upgrading my system in a couple of weeks. I've been looking around the net for reviews and/or benchmarks for ECC vs. non-ECC in both speed and whether or not it's worth it to shell out the extra bucks for ECC. I'm also wondering whether or not i should buy PC2100 ECC instead of PC 2700 non-ECC ram or wait until PC2700 ECC becomes available."
For what are you using your system? If it's just another gaming PC then ECC isn't worth it.
* Origin: XBase BBS (2:490/4100) Well the good old days may not return and rocks might melt and sea may burn.
How is this helpful? The philosophy behind that seems to be rather than allow my programs to continue with a corrupt bit of data, it's better to halt all operation and LOSE ALL MY DATA and perhaps corrupt my hard drive. That's "help" I don't need.
Is this universal, or just my OS (W2K), BIOS, or hardware? Is there a way for ECC to simply and calmly report a problem without locking up my machine in the process?
Please Rate my comment (and help support Fre
I've got a Tyan S2460 MB w/ 1Gig PC2100 ECC RAM & 2 1.4Mhz Athlon MPs. It uses PhoenixServer BIOS, and the BIOS gives me these options re: ECC
SERR Signal Condition: (ECC error conditions that SERR# be asserted [sic])
ECC Config: (No ECC, Checking Only, Checking and Correction and Checking, Correction with Scrubbing)
So, my question would be: if this is basically a home machine with no mission-critical stuff running on it, but I'd like to get some benefits from my expensive ECC RAM without BSODing, what settings would be best in this scenario? Right now I've got everything turned off (Signal Condition: None, ECC Disabled). Oh, and what the heck is scrubbing?
And yes, I've attempted to RTFM but it's little more than a pamphlet and I can't find good, clear info about this.
Please Rate my comment (and help support Fre
My understanding is that the speed penalty is very small. Is that correct?
Scrubbing detects and corrects memory errors that are in memory addresses that are idle. This prevents correctable errors from turning into uncorrectable errors in sections of memory that are infrequently accessed by the CPU.
Mea navis aericumbens anguillis abundat
You might want to try a few searches on groups.google.com for S2460 and the brand of memory you are using. I have a S2460 and I learned the hard way that it is a very fussy board when it comes to what memory you use with it. In particular, Crucial memory can cause problems (although I have now got my board working OK with Crucial memory). The basic rule appears to be that it the memory is not on the Tyan approved list it will be problematic.
(Sorry to reply to my own post)
Some other thing you might want to try:
- get the 1.4 version of the BIOS
- set 'maxmem' in boot.ini
- if you have an nVidia GeForce, make sure you have the most recent drivers - some earlier drivers causes a lot of problems
ECC is useless if you are running something like windows 98 which crashes on its own so much anyway - and would bsod on an ECC error. ECC cant work with Celeron only PII, PIII, PIV which also have their own ECC caches. Something like WINNT/2000/XP it is worth it because they have long uptimes. To the poster who bsod with ECC off- turn it on like the other poster said.
ECC in the old PC133 PC100 etc style memory changed the minimum number of wait states you can have (CAS3 instead of CAS2).
I believe ECC is worth having, even if you are not using the computer to run "mission critical" tasks. Memory problems on a computer without parity or ECC can be very difficult to diagnose. The symptoms may look like a flakey operating system or application, not a hardware problem. I had one computer that would only fail when someone ran the FORTRAN compiler. The symptoms looked exactly like a bug in the FORTRAN compiler. It turned out that one of the DRAM chips had a pattern sensitivity problem that was triggered by the image of the FORTRAN compiler. These kinds of problems can be difficult to detect and fix without hardware support. The memory diagnostics in the power-on self-test in the BIOS will detect hard errors, but not more subtle errors.
Mea navis aericumbens anguillis abundat
Do yourself a favor and check out a new month-old internet site called Serence. Since April 2002 they have had a free / no ads / no spyware download for your Windows desktop called Klipfolio and this thing is great. According to the site statistics 3400+ slashdotters have already downloaded the Slashdot Klip and after joining them today, I can see why. (No, I don't have any personal vested interest in this, I just think it's cool.) The Slashdot Klip stays on your desktop and downloads the XML feed from Andover/Slashdot containing current article headlines, alerting you when there is a new one. Klips from a few dozen other founding news sources with XML newsfeeds are also available in a scrolling, dockable, resizable, skinnable package. In the lower left corner of the previous link you can suggest a new klip feed to Serence you'd like to see - a great thing for you to do, the more sites that use this, the better for all of us. You can even start up your own personal Klip feed! Rack up your favorite sites in one desktop package and you are really in control...click on a headline, up comes the article, click on the site symbol, up comes the home page. Like any dot-com, Serence's success depends on market penetration and this is one idea I think deserves to be slashdotted so it has a shot at succeeding...
Offtopic? Not really, the article I'm posting under went out as a Slashdot Klip headline. And what good is maxed out karma if not to risk it in spreading the word about a cool new Slashdot feature?
I've got a Tyan Tiger MP too. Funny thing is, I get all kinds of lockups, blue screens, and reboots when ECC is disabled. The only way I can keep my machine running is to turn "Both" and "ECC Scrubbing" on in the BIOS. Hope it helps...
set SERR to None which wont BSOD the machine by raising an error NMI and set ECC to Checking, Correction w/ Scrubbing
Whether you use parity, non-parity, or even ECC, you should ALWAYS test your RAM sticks with MemTest86.
Test them when newly purchased (I've received duds from brand-name online memory warehouses.) Test them every few months (they can and do go bad.) Especially test when your computer exhibits otherwise unexplainable behavior, like: Windows BSoD, kernel panics, characters changing themselves on disk willy-nilly, programs crashing for no good reason, or going bad on disk and needing reinstallation. Disk files that go corrupt. Any of the above, even (or especially) when it seems inconsistent, can be caused by a few bad blocks in a RAM stick.
MemTest86 is a program that boots and runs off floppy (has its own boot loader, no OS), and t-h-o-r-o-u-g-h-l-y tests your ram. It even detects adjecent cell errors, where a 1 in cell n can threshold bias the 0 in cell n+1 or n-1 until it is considered a 1.
It even knows how to differentiate between cache memory errors and RAM errors. Just do it (after nightmare hardware problems, MemTest86 showed me what was broken- can't say enough good things about it.) It's user interface could be more informative, but when it spots and error, you'll know.
Big Daddy, Johnny, Burp, Aunt Zelda, Scott, Slurp, Big Momma
In my experiences, ECC is not worthwhile. There are too many ways the data can get corrupted before it ever hits the memory stick. ECC only helps if the information is accurately present on the memory data lines attacked the the RAM module, and then only when the RAM module itself fails. Otherwise you are just recording, with error-correction, incorrect data. And lets face it: If the memory module itself is fried, ECC ain't going to help.
Testing: I had some rather painful experiences with a FIC-503+ motherboard. Turned out to have a design defect that caused problems when both DIMM slots were utilized, regardless of the RAM type.
To test it, under linux (of course), with a minimal boot, running as few processes as possible, I created a large file (${FILE}) of non-uniform data by cat'ing (combining) several arbitrary convenient large files. About 2x - 3x the total size of all my RAM. I then did:
Any problems showed up right away. (Cksum returned different numbers.)
This was a simpler approach, though not quite as good, as the general make 100 linux kernels and diff the make-logs.
You might also look at: http://www.bitwizard.nl/sig11/
So, the conclusion was that parity RAM was justified in the original IBM PC, because RAM errors were really common back then, but such a technology would be obsolete today.
Anyone have more info regarding this?
Did you know you can fertilize your lawn with used motor oil?
There seems to be a misunderstanding regadring ECC and Parity memory, at least in relation to PC's.
PC memory has either some extra bits (one for every eight bits) for ciclic redundancy, or it hasn't. There is no dedicated ECC circuity on PC memory, (Exept maybe IBM Chipkill memory). The difference between parity memory and ECC memory lies on how the memory controller takes advantage of the extra bits. To get an idea on how ECC really works, see Hamming code.
Regards
Roberto de Iriarte
roberto at spock dot cl
Thanks for that tip FreakyGeeky! Before now I could only run linux or Win2K Server without the BSoD blues... But now my 2KPro is going again, and I can even play Ghost Recon without crashing! woo hoo!@
Censorship appears alive and well on Slashdot. Some low-life decided to mark my post down (-1), despite the fact that it actually *IS* relevant to the topic at hand. So I'm reposting...
BTW: NASA uses computers with multiple, redundant CPU's to detect problems. Do you use multiple CPU's as real-time backups? Where do you think a bit is more likely to become corrupted? In the CPU or in the RAM? (Well, neither, unless the heatsink fails...)
---
In my experiences, ECC is not worthwhile. There are too many ways the data can get corrupted before it ever hits the memory stick. ECC only helps if the information is accurately present on the memory data lines attacked the the RAM module, and then only when the RAM module itself fails. Otherwise you are just recording, with error-correction, incorrect data. And lets face it: If the memory module itself is fried, ECC ain't going to help.
Testing: I had some rather painful experiences with a FIC-503+ motherboard. Turned out to have a design defect that caused problems when both DIMM slots were utilized, regardless of the RAM type.
To test it, under linux (of course), with a minimal boot, running as few processes as possible, I created a large file (${FILE}) of non-uniform data by cat'ing (combining) several arbitrary convenient large files. About 2x - 3x the total size of all my RAM. I then did:
Any problems showed up right away. (Cksum returned different numbers.)
This was a simpler approach, though not quite as good, as the general make 100 linux kernels and diff the make-logs.
You might also look at: http://www.bitwizard.nl/sig11/