Slashdot Mirror


Tracking Down a Single-Bit RAM Error

Hanji writes "We have discussed here before the potential effects of and protections against cosmic ray radiation, but for the average computer user, it's an obscure threat that doesn't affect them in any real way. Well, here's a blog post that describes a strange segfault and, after extensive debugging, traces it down to a single bit flip, probably caused by a stray cosmic ray. Lots of helpful descriptions of Linux debugging techniques in this one, and a pretty clear demonstration that this can be a real problem. I know I'm never buying a desktop without ECC RAM ever again!" The author acknowledges that it might not have been a cosmic ray-based error, but the troubleshooting steps are interesting no matter what the cause.

8 of 277 comments (clear)

  1. It's not cosmic. It's from the die/package by EmagGeek · · Score: 5, Informative

    Soft errors in DRAM are far more likely to be the result of alpha particle decay from materials in the die and packaging.

    1. Re:It's not cosmic. It's from the die/package by Anonymous Coward · · Score: 3, Informative

      People don't realize that lead is mildly radioactive, and the decay from solders on the connectors or chassis can also cause bit flips. Very old processed lead, such as that used for the roofs of some European cathedrals, has been used to build supercomputers since more of the radioactivity has decayed.

      I'm unclear as how this "processing" of the lead has reduced its natural radiaoctivity...

      Pb-210 is in the U-238 and Rn-222 decay chains, so lead ore in the ground has a constant source of Pb-210 being generated due to uranium contamination. Likewise, radon gas can seep into the lead ore deposits and provide a fresh influx of Pb-210. Once the lead is smelted and purified, the uranium contanimation is removed and it's not being exposed to radon so the number of Pb-210 atoms in the sample starts decreasing significantly.

  2. Re:erm.... by JesseL · · Score: 3, Informative

    Would it really be so hard to read the article before posting?

    --
    "Prefiero morir de pie que vivir siempre arrodillado!"
  3. Also by Sycraft-fu · · Score: 5, Informative

    Disks have a lot, and I mean a LOT of ECC on them. It is not a situation of "I need to write a 1 so I'll place one at this location on the drive." They use a complex encoding scheme so that bit errors on the disk don't yield data errors to the user.

    Then there's the fact that bits aren't even stored as bits really. All current drives use (E)PRML which is (Enhanced) Partial Response Maximum Likelihood. What this means is bits aren't encoded as a high-low state or FM wave or any of that. They are written using flux reversals, but the level is not carefully controlled, it can't be. So when you read the data the drive actually looks at an analogue wave. It encodes the partial response it gets, and then finds the maximumly likely pattern that matches.

    Sounds like voodoo but works really well. Things are not simple thresholds or the like, it is a complex system and ends up being quite robust and resilient to error.

    So it is highly unlikely that you had a bit flipped on a disk. Would require some amazing circumstances to happen. The RAM error is far more likely. Not just the cosmic ray thing but, as the parent noted, bad RAM. Normally when RAM fails, it fails catastrophically and it is immediately apparent. Not always though. It can not only fail on single bit locations, but only during certian ops. That is why memtest does so many different tests. One kind might works fine, another might fail. Rare, but I've seen it on a few systems.

  4. Re:Old, old story by Anonymous Coward · · Score: 5, Informative

    For a more recent analysis (by folks at Google and U.Toronto) see "DRAM Errors in the Wild: A Large-Scale Field Study" in ACM SIGMETRICS/Performance 09.

    They did an extensive analysis of DRAM failures from many vendors and debunk several myths as well as indicating that the soft error rate can be much higher than previously thought.

    Well worth a read...

    http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

     

  5. Re:Ugh, single bit errors by rudy_wayne · · Score: 3, Informative

    I'm not sure why you'd want ECC ram in a desktop, unless it's some sort business critical machine that you're willing to spend 5 or 6 times what a normal desktop costs.

    This may have been true at one time, but ECC RAM is no longer that expensive. I just looked at prices on Newegg:

    8 GB DDR3 $214.99

    8 GB DDR3 ECC $274.99

    In some cases, depending on the brand and the speed, ECC is actually *CHEAPER*.

  6. Roman ingots to shield particle detector by drerwk · · Score: 3, Informative

    Roman ingots to shield particle detector
    http://www.nature.com/news/2010/100415/full/news.2010.186.html

  7. Re:Ugh, single bit errors by billcopc · · Score: 3, Informative

    Depends on the type of desktop. ECC these days doesn't cost much more than non-ECC... Dell and HP may not want to admit it, but I buy ECC DDR3 all the time as I build a lot of white-box servers, and frankly even the lamest "gaming" Ram carries a higher premium than ECC.

    The tricky thing is that while most (all?) current AMD boards can take ECC ram (unbuffered, not registered), no consumer Intel boards can handle ECC - you need to step up to a Xeon processor and chipset. Luckily the single-processor setups don't cost all that much more than their mid-range consumer equivalents, but you do have to sacrifice buzzy features like USB 3.0, SLI/Crossfire, eSATA and overclocking. One exception to this is the EVGA Classified SR-2, which has absolutely everything, but it's $600 and requires a special oversized chassis (or a lot of dremel work).

    I'm going to put this out there: if someone is genuinely concerned about bit errors to a degree where the loss of work due to a minor crash or reboot is significant enough, go ahead and spend an extra 10% on ECC. Even if you pack that board with 96gb of memory, it's still cheaper than six months of therapy and thorazine :P

    --
    -Billco, Fnarg.com