Slashdot Mirror


To ECC Or Not To ECC?

MetaHiro asks: "I'm going to be upgrading my system in a couple of weeks. I've been looking around the net for reviews and/or benchmarks for ECC vs. non-ECC in both speed and whether or not it's worth it to shell out the extra bucks for ECC. I'm also wondering whether or not i should buy PC2100 ECC instead of PC 2700 non-ECC ram or wait until PC2700 ECC becomes available."

11 of 46 comments (clear)

  1. I can't understand ECC by EricLivingston · · Score: 4, Informative
    I've got PC2100 ECC in my server at home, and I've turned ECC checking off in the BIOS. What I can't fathom is this: In the past several months I've gotten a couple of parity errors in my memory. However, instead of warning me in some way and allowing me to gracefully shut down, the error raises a non-maskable interrupt which halts the machine in its tracks, giving me a Blue Screen of Death and requiring a hard reset.

    How is this helpful? The philosophy behind that seems to be rather than allow my programs to continue with a corrupt bit of data, it's better to halt all operation and LOSE ALL MY DATA and perhaps corrupt my hard drive. That's "help" I don't need.

    Is this universal, or just my OS (W2K), BIOS, or hardware? Is there a way for ECC to simply and calmly report a problem without locking up my machine in the process?

    --
    Please Rate my comment (and help support Fre
    1. Re:I can't understand ECC by geirt · · Score: 3, Informative

      Because you turned off the ECC checking ....

      The idea with ECC is that the ECC controller (a peace of hardware between the CPU and the RAM) should detect the bit error and correct it "on the fly", so that the application should not be affected by the bit error at all. You get a speed penalty by doing ECC, because the ECC controller have to calculate a check sum for every write, and check the check sum for every read. Even worse, the ECC check sum is block based, so the ECC controller have to read the whole block to calculate the check sum even if the CPU only reads a single byte. The same goes for writes. To use ECC you need special RAM which is 72 bit wide instead of 64 bit, the extra bits are used for the check sum. This also explains why ECC RAM is more expensive that non ECC ram.

      --

      RFC1925
    2. Re:I can't understand ECC by martyb · · Score: 5, Informative

      I've got PC2100 ECC in my server at home, and I've turned ECC checking off in the BIOS. What I can't fathom is this: In the past several months I've gotten a couple of parity errors in my memory. However, instead of warning me in some way and allowing me to gracefully shut down, the error raises a non-maskable interrupt which halts the machine in its tracks, giving me a Blue Screen of Death and requiring a hard reset.

      If you turned off ECC, then when there's a single bit error, the parity can detect it, but the ECC is not there to correct it -- and your computer raises an interrupt to flag it. MicroSoft takes that to be a Very Bad Thing and throws up a BSOD. Turn on the ECC and you'll be protected from single bit errors and keep on running. If you're interested, what follows is a brief summary of parity and ECC from my long-ago experience and memory (which does NOT have ECC; so if anything I've written is wrong, I'd apprecate corrections from those with more recent experiece/knowledge!)

      But. ECC as implemented on PCs can't fix everything. It was years and years ago, but I once had to write some ECC routines to validate programs read into a diagnostic computer for VAXes and DEC-10s. Like in most things, there's a tradeoff between price, speed, and reliability.

      First off, memory with no parity. (For the sake of example, I'll refer to storage units as bytes, but this could just as easily be applied to larger units of storage; e.g. 16 or 32 bit words.) A byte is stored simply as 8 bits. If there is an error writing or reading a bit from memory, there's no indication that anything is wrong. Your programs just keep running with bad values which, if in an instruction, can rapidly cause a crash. If the error is in data, someone's paycheck may be way off. Very Not Good.

      Next, let's consider memory with parity. Parity comes in two forms: even parity and odd parity. For the sake of example, say we have "even parity". So, for a byte that contains an odd number of one bits, the parity bit would be set to one. If the byte contains an even number of one bits, then the parity bit would be set to zero. When a byte is read from memory, the parity is computed again and compared against the parity that was written when the byte was originally stored in memory. If the stored parity matches the calculated parity, all is well. If there is a discrepancy, it would raise an interrupt and you get the BSOD. But what happens if there are TWO bits that are in error? They'd cancel out each other in the parity calculation and it would appear things are okay. No BSOD, but things are not right.

      Finally, let's look at memory with ECC Although there are various levels of ECC, generally what is implemented in PCs fits the mold of "single error correction, double error detection". So, if there is a single bit error in a memory access, the ECC can detect and fix it. This is done by storing even more bits in addition to the byte to be stored. These bits are computed in such a way that if there is a single bit error, the use of the extra bits can identify and fix> the bit that is in error. If there are two bits that are wrong (which would go unnoticed in the parity scheme) the ECC bits can be used to identify that there are two bits in error.

      For the truly paranoid, or where uptime is absolutely mandatory, it's possible to construct ECCs such that any bit errors could be detected and corrected. But, the tradeoff is that it would take a lot more bits and it would take more time to perform these calculations -- on every single memory access. And, of course, it would cost a lot more to have all those extra bits around as well as the circuitry to perform the ECC calculations.

      So, if a BSOD is just an incovenience for you, ignore the ECC. You'll get better performance at a lower cost for the memory.
      If you're developing accounting programs or some medical application, then the downtime from a BSOD would be a Majorly Bad Event. ECC would protect you from single bit errors and your application would keep running; definitely a Very Good Thing.

      In short, unless you're doing something completely out of the ordinary for a home user, just stick with the usual parity-backed memory. Hope this helps!

    3. Re:I can't understand ECC by tzanger · · Score: 3, Informative

      You get a speed penalty by doing ECC, because the ECC controller have to calculate a check sum for every write, and check the check sum for every read. Even worse, the ECC check sum is block based, so the ECC controller have to read the whole block to calculate the check sum even if the CPU only reads a single byte.

      It's my understanding that ECC works on the smallest data width available, which is 64 bits on SDRAM anyway. When your P4/2.1G grabs a single byte and there's a cache miss, it fetches the entire row (8 bytes) from RAM anyway. Block-based, yes, but no bigger than normal.

      As for speed performance, give me a break. The RAM is zillions of dollars more expensive not only because of the extra memory cells but also because (IIRC) the code generation is done on-chip. The checking is done by the chipset, IIRC, and that is all done in hardware and I'm willing to bet in the same amount of time that you can do a normal read in anyway. I haven't been able to google up performance benchmarks on it though.

      Why the original poster was getting parity errors was not only because he had ECC checking turned off in BIOS, but also because ECC cannot correct all bit errors. Most ECC checks can correct single-bit errors, but double-bit and higher errors cannot be corrected since enough correction code is not stored. It's just like hard drives, CDROMs and DVDs: they can spot and correct incredible amounts of relatively small errors without telling anybody about it, but if the error is just too large it has to pass back a "bad data" message.

      In fact some of the only reasons that we have such huge amounts of storage in formats that we can actually paw up with our hands is because there is such a vast amount of error correction coding on the media.

    4. Re:I can't understand ECC by waytoomuchcoffee · · Score: 5, Informative

      ECC is NOT parity checking. Parity checking is able to tell if one bit is wrong, and if so, to send a parity error (it keeps an extra bit to check against). However, if can't tell which bit was flipped, so it can't correct it. ECC, on the other hand, CAN tell which bit is bad, and therefore can correct it. It can also detects a two-bit error, but has to send a parity error, because it can't correct them both.

      Actually, it's your chipset that reads the data from the memory and sends the parity error, and/or makes the actual corrections in the case of ECC. Even though your ECC is turned off, parity is still active. You chipset is reading the extra bit in your ECC memory, sees it doesn't add up to the rest of the bits, and sends out a parity error. The solution is turn your ECC back on, and they should go away, as it will use the ECC info from your ECC memory to correct it instead (unless they are the much rarer two-bit kind -- if you get these often your memory is probably defective).

      Also, to comment on someone else, the older ECC correction slowed your system down by around 5 percent. Recent changes will slow it down 1-2 percent, no big deal.

    5. Re:I can't understand ECC by geirt · · Score: 3, Informative

      tzanger wrote:
      > It's my understanding that ECC works on the smallest data width available, which is 64 bits on SDRAM
      > anyway. When your P4/2.1G grabs a single byte and there's a cache miss, it fetches the entire row
      > (8 bytes) from RAM anyway. Block-based, yes, but no bigger than normal.

      Wrong. SDRAM is 64 bit wide, but can be written in units of 8 bit. A read is always 64 bit wide. A single 8 bit write to ECC RAM becomes a 64 bit read, check ECC, update byte to be written, recalculate ECC, write. So you are going to get a small speed penalty compared to the non ECC case of just a single 8 bit wide write.

      --

      RFC1925
  2. Re:I can't understand ECC - specifics by EricLivingston · · Score: 3, Interesting
    Thank you for that explanation - I think I understand it better, though I'm confused a bit because the two times I've gotten a BSOD were with ECC enabled, which is why I turned it off to see if they would go away...

    I've got a Tyan S2460 MB w/ 1Gig PC2100 ECC RAM & 2 1.4Mhz Athlon MPs. It uses PhoenixServer BIOS, and the BIOS gives me these options re: ECC

    SERR Signal Condition: (ECC error conditions that SERR# be asserted [sic])

    • None
    • Single Bit
    • Multiple Bit
    • Both

    ECC Config: (No ECC, Checking Only, Checking and Correction and Checking, Correction with Scrubbing)

    • Disabled
    • Checking Only
    • Checking and Correction
    • Checking, Correction w/ Scrubbing

    So, my question would be: if this is basically a home machine with no mission-critical stuff running on it, but I'd like to get some benefits from my expensive ECC RAM without BSODing, what settings would be best in this scenario? Right now I've got everything turned off (Signal Condition: None, ECC Disabled). Oh, and what the heck is scrubbing?

    And yes, I've attempted to RTFM but it's little more than a pamphlet and I can't find good, clear info about this.

    --
    Please Rate my comment (and help support Fre
  3. Re:I can't understand ECC - specifics by Detritus · · Score: 4, Informative
    You want "Checking, Correction w/ Scrubbing".

    Scrubbing detects and corrects memory errors that are in memory addresses that are idle. This prevents correctable errors from turning into uncorrectable errors in sections of memory that are infrequently accessed by the CPU.

    --
    Mea navis aericumbens anguillis abundat
  4. ECC is worth having by Detritus · · Score: 3, Insightful

    I believe ECC is worth having, even if you are not using the computer to run "mission critical" tasks. Memory problems on a computer without parity or ECC can be very difficult to diagnose. The symptoms may look like a flakey operating system or application, not a hardware problem. I had one computer that would only fail when someone ran the FORTRAN compiler. The symptoms looked exactly like a bug in the FORTRAN compiler. It turned out that one of the DRAM chips had a pattern sensitivity problem that was triggered by the image of the FORTRAN compiler. These kinds of problems can be difficult to detect and fix without hardware support. The memory diagnostics in the power-on self-test in the BIOS will detect hard errors, but not more subtle errors.

    --
    Mea navis aericumbens anguillis abundat
  5. Re:ECC where useful / speed compromise by Detritus · · Score: 4, Informative

    ECC protection of main memory is distinct from ECC protection of CPU cache memory. They are independent. You can have ECC main memory with or without ECC cache. On PCs, the ECC encoder/decoder for cache is on the CPU chip, the ECC encoder/decoder for main memory is part of the chipset.

    --
    Mea navis aericumbens anguillis abundat
  6. MemTest86 Early, MemTest86 Often by schmaltz · · Score: 5, Informative

    Whether you use parity, non-parity, or even ECC, you should ALWAYS test your RAM sticks with MemTest86.

    Test them when newly purchased (I've received duds from brand-name online memory warehouses.) Test them every few months (they can and do go bad.) Especially test when your computer exhibits otherwise unexplainable behavior, like: Windows BSoD, kernel panics, characters changing themselves on disk willy-nilly, programs crashing for no good reason, or going bad on disk and needing reinstallation. Disk files that go corrupt. Any of the above, even (or especially) when it seems inconsistent, can be caused by a few bad blocks in a RAM stick.

    MemTest86 is a program that boots and runs off floppy (has its own boot loader, no OS), and t-h-o-r-o-u-g-h-l-y tests your ram. It even detects adjecent cell errors, where a 1 in cell n can threshold bias the 0 in cell n+1 or n-1 until it is considered a 1.

    It even knows how to differentiate between cache memory errors and RAM errors. Just do it (after nightmare hardware problems, MemTest86 showed me what was broken- can't say enough good things about it.) It's user interface could be more informative, but when it spots and error, you'll know.

    --
    Big Daddy, Johnny, Burp, Aunt Zelda, Scott, Slurp, Big Momma ... where's Siggy?