Slashdot Mirror


Tracking Down a Single-Bit RAM Error

Hanji writes "We have discussed here before the potential effects of and protections against cosmic ray radiation, but for the average computer user, it's an obscure threat that doesn't affect them in any real way. Well, here's a blog post that describes a strange segfault and, after extensive debugging, traces it down to a single bit flip, probably caused by a stray cosmic ray. Lots of helpful descriptions of Linux debugging techniques in this one, and a pretty clear demonstration that this can be a real problem. I know I'm never buying a desktop without ECC RAM ever again!" The author acknowledges that it might not have been a cosmic ray-based error, but the troubleshooting steps are interesting no matter what the cause.

18 of 277 comments (clear)

  1. Takes me back by tsotha · · Score: 4, Interesting

    When I was in college one of my physics professors told us he doubted programs would ever get bigger than a few hundred kilobytes because cosmic rays would cause the larger programs to fail too frequently.

  2. Easter Earthquake by ushering05401 · · Score: 5, Interesting

    I don't know about cosmic rays, but immediately following the Easter day Earthquake in Guadalupe Victoria (about three hundred miles from where I was located) I tried to fire up my laptop and then my desktop, both of which had been suspended to RAM. Neither one would wake up, though the lappie displayed a garbled screen. No errors in the log files (Ubuntu 9.10 on the sys76 lappie, Deb Lenny on desktop).

  3. RAM error? by Camel+Pilot · · Score: 5, Interesting

    Forget a RAM error, I have seen a bit on a file on the disk flip.

    After years of successful operation a Perl script quite working. On investigation a G was transformed to a W a difference of one bit. The file mod date was years old.

    1. Re:RAM error? by marcansoft · · Score: 5, Interesting

      I experienced almost exactly that issue with a RAM error. My system was apparently stable, and then one day I got a syntax error in a system Perl script: one character had changed. The script was owned by root and otherwise untouched. After puzzling over it for quite a while I realized it could be a RAM error and ran memtest86. It reported a single permanently stuck bit in my 512MB of RAM. I found a kernel patch to manually mark problem RAM areas as reserved and kept on running with that RAM for a few years.

      Are you sure that perl script issue was caused by a drive error? A RAM error can cause the same apparent problem, if the corruption happens in the kernel's cache. However, it shouldn't be permanent as it will not be written back to disk (the cache won't be dirty) unless someone actually modifies the file.

  4. It's not cosmic. It's from the die/package by EmagGeek · · Score: 5, Informative

    Soft errors in DRAM are far more likely to be the result of alpha particle decay from materials in the die and packaging.

  5. faulty RAM by mojo-raisin · · Score: 4, Interesting

    I've been working with some large microarray datasets recently, and so had to double my computer's memory to 8GB.

    As I've done for years, I went to Fry's to get some Corsair chips... installed F13 64bit to replace my older 32bit distro... and crash-o-matic began. Mostly from Chrome and Mercurial.

    I ran memtester86+ and sure enough, verified my first purchase of faulty memory.

    So, I went back to Fry's and exchanged for another pair of Corsair 2GB chips. This time, I ran memtester86+ first thing... ANOTHER bad set, so back it sent to Fry's.

    *Third* set of memory was Kingston, and a trip through memtester86+ verified no errors. Yay!

    Computer has been stable, too.

    With more and more RAM in computers, my next box will have ECC.

  6. fascinating by vux984 · · Score: 4, Insightful

    Its interesting to me because my first instinct would have been to assume something got corrupted and my first step would have been to reboot. If the problem persisted through a reboot then I might have gone down the rabbit hole in similiar
    fashion to try and find and fix the root cause.

    There are enough sofware bugs, kernel bugs, driver bugs, hardware hiccups due to marginal equipment, power fluctuation, interference, random noise... and i suppose even cosmic radiation that I would rarely think to spend the time to trace a transient problem unless it was reproducible accross reboots, or at least happened on multiple separate occasions.

  7. Old, old story by jmichaelg · · Score: 5, Interesting

    Back in the early 80's, HP published a paper on random bit errors in RAM. They looked at chips from a variety of vendors and determined that the RAM coming out of Japan was the most reliable. That paper caused a lot of US RAM vendors to shutter their doors as there was a sea change in purchasing habits.

    A few years later, I ran into John Scully while we were waiting for a flight. I mentioned the paper to him and asked him how Apple could seriously expect to sell a Macintosh specifically aimed at the Scientific community if it didn't have ECC. He blithely said "it's not a problem..." 20+ years hence and most of us still don't have ECC so it seems he was right.

    1. Re:Old, old story by Anonymous Coward · · Score: 5, Informative

      For a more recent analysis (by folks at Google and U.Toronto) see "DRAM Errors in the Wild: A Large-Scale Field Study" in ACM SIGMETRICS/Performance 09.

      They did an extensive analysis of DRAM failures from many vendors and debunk several myths as well as indicating that the soft error rate can be much higher than previously thought.

      Well worth a read...

      http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

       

  8. Also by Sycraft-fu · · Score: 5, Informative

    Disks have a lot, and I mean a LOT of ECC on them. It is not a situation of "I need to write a 1 so I'll place one at this location on the drive." They use a complex encoding scheme so that bit errors on the disk don't yield data errors to the user.

    Then there's the fact that bits aren't even stored as bits really. All current drives use (E)PRML which is (Enhanced) Partial Response Maximum Likelihood. What this means is bits aren't encoded as a high-low state or FM wave or any of that. They are written using flux reversals, but the level is not carefully controlled, it can't be. So when you read the data the drive actually looks at an analogue wave. It encodes the partial response it gets, and then finds the maximumly likely pattern that matches.

    Sounds like voodoo but works really well. Things are not simple thresholds or the like, it is a complex system and ends up being quite robust and resilient to error.

    So it is highly unlikely that you had a bit flipped on a disk. Would require some amazing circumstances to happen. The RAM error is far more likely. Not just the cosmic ray thing but, as the parent noted, bad RAM. Normally when RAM fails, it fails catastrophically and it is immediately apparent. Not always though. It can not only fail on single bit locations, but only during certian ops. That is why memtest does so many different tests. One kind might works fine, another might fail. Rare, but I've seen it on a few systems.

    1. Re:Also by Scaba · · Score: 5, Funny

      Then there's the fact that bits aren't even stored as bits really. All current drives use (E)PRML which is (Enhanced) Partial Response Maximum Likelihood. What this means is bits aren't encoded as a high-low state or FM wave or any of that. They are written using flux reversals, but the level is not carefully controlled, it can't be. So when you read the data the drive actually looks at an analogue wave. It encodes the partial response it gets, and then finds the maximumly likely pattern that matches.

      I doubt this is true. The disk would have to be spinning at 88 mph in order to activate the flux capacitor, and the power brick would need to supply 1.21 gigawatts to the drive, which exceeds the capacity of even the most tricked-out gaming PC. I think you'd better check your science, my friend.

  9. Cosmic rays, my ass. Occam's Razor time. by Anonymous Coward · · Score: 5, Insightful

    You are on the right track. As someone with over a quarter century of background in combined embedded software and hardware design (the most recent decade for life-dependant systems), it always amazes me how quickly pseudo-technical people jump to wild speculation for observations that they cannot explain.

    They fail to understand that a hardware system is an imperfect representation of the theory (probably the biggest failure in the schooling of software developers and even some hardware is to get this message into their heads). While they feel comfort in the theory of a binary system, they utterly fail to understand that our real systems, like us, are imperfect and, like us, live in an analog world. Simple things like temperature variations, noise from common (rather than cosmic) sources, marginal design timing, imperfect components, simple intermittents, etc., are 10^24 times more likely the cause.

    But they're not as fascinating as wild speculation, are they?

  10. Re:Ugh, single bit errors by hawguy · · Score: 4, Interesting

    I think the original article showed why you'd want ECC in a desktop machine -- random bit errors do happen in real life. I don't see how a warranty makes this less of an issue -- if my machine silently corrupts data due to a bit error, getting a $50 replacement DIMM isn't really going to satisfy me. Does ECC really cost 5X over non-ECC?

    If he was processing data or editing a spreadsheet, then that bit error could have corrupted his data. If he was compiling a program for distribution (perhaps to thousands of machines), that bit error could have corrupted his executable, causing errors on all of the machines it was deployed to.

    After reading this article, the question that comes to mind is why am I *not* running ECC on my desktop?

  11. Re:This would be important by Vellmont · · Score: 4, Interesting


    Billions of years in the ground, and only a few centuries on the roof and all of the radioactivity is gone! Wow!

    The author needs to provide a reference, but there's a few ways I can think of that a processing stage, and a few centuries would produce something less radioactive than something produced more recently. I think all of them stem from the ore containing a source material that gets separated through the refining process, but the daughter products from the source don't. Here's one scenario:

    Ore = Lead + radio-isotope a + radio-isotope B.

    radio-isotope A decays to radio-isotope B

    radio-isotope A: 4 billion year half-life.
    radio-isotope B: 20 year half life, decays to stable isotope C.

    during refining, radio isotope A gets nearly completely refined out to parts per trillion. radio isotope B is similar to lead chemically, and remains at 1 parts per million (at time of refining).

    200 years go by. (10 half lives of radio isotope B)
    radio isotope B is now at 1/2^10 concentration, or about 1 part per trillion. Significantly less than when it was first refined. The added radioactivity from radio-isotope A decaying into B is negligible due to the long half-life of A.

    These numbers and process are obviously made up to show how it MIGHT work. It still remains to be seen if it's actually true or not.

    --
    AccountKiller
  12. Re:Ugh, single bit errors by Timothy+Brownawell · · Score: 5, Insightful

    I'm not sure why you'd want ECC ram in a desktop, unless it's some sort business critical machine that you're willing to spend 5 or 6 times what a normal desktop costs. For day to day use, ECC is overkill.

    My desktop has 8GB of ECC in it. This cost I think $40 more than non-ECC, and meant I got an Althon II x4 instead of a Core i5. That "5 or 6 times what a normal desktop costs" is either bullshit or Intel-onlyism (which is just another kind of bullshit).

  13. Ksplice ... go figure by GNUALMAFUERTE · · Score: 5, Interesting

    The guy that posted this is a Ksplice developer. In case you didn't knew, KSplice allows you to patch your running kernel without rebooting. Nice.

    Anyway, this guys sees a random memory error. He conveniently goes on a debugging rampage, while we all know the most logical first step would be rebooting that damn machine. Random memory errors do happen.

    He says he "hasn't gotten around" to memtesting his RAM yet. So, let me get this straight ... he implies that random cosmic rays caused the error, but he hasn't yet tested his ram for what is the most possible cause of the issue?

    Then he goes on to explain that you don't even need to reboot your machine due to damn cosmic radiation. Or kernel updates. Because you have Ksplice.

    Come on.

    --
    WTF am I doing replying to an AC at 5 A.M on a Friday night?
  14. Re:Cosmic rays, my ass. Occam's Razor time. by Anonymous Coward · · Score: 5, Interesting

    On the subject of the imperfect nature of machines, I found this post by Richard D. James (aka Aphex Twin, a noted electronic music composer) quite interesting. He describes how the physical machinery of analog electronic music machines means it is near impossible to duplicate them in digital programs.

    link

    Author: analord
    Date: 02-07-05 03:14

    some people bought the analogue equipment when it was unfashionable and very cheap though.
    some of us are over 30 you know!
    anyone remember when 303`s were £50? and coke was 16p a tin? crisps 5p

    also you have overlooked A LOT of other points because its not all about the overall frequency response of the recording system its how the sound gets there in the first place.
    here are some things which you can`t get from a plugin,they are often emulated but due to their hugely complex nature are always pretty crass aproximations..

    the sound of analogue equpiment including EQ, changes very noticably over even a few hours due to temperature changes within a circuit.
    Anyone who has tried to make tracs on a few analogue synths and make them stay in tune can tell you this,you leave a trac running for a few hours come back and think Im sure I didnt fucking write that,I must be going mental!

    this affects all the components in a synth/EQ in an almost infinte amount of tiny ways.
    and the amount differs from circuit to circuit depending on the design.

    the interaction of different channels and their respective signals with an analogue mixer are very complex,EQ,dynamics....
    any fx, analogue or digital that are plugged into it all have their own special complex characteristics and all interact with each other differently and change depending on their routing.
    Nobody that ive heard of has even begun to start emulating analogue mixer circuitry in software,just the aesthetics,it will come but im sure it will be a crap half hearted effort like most pretend synth plugins are.
    they should be called PST synths, P for pretend not virtual.

    Every piece of outboard gear has its own sound ,reverbs,modulation effects etc
    real room reverb, this in itself companies have spent decades trying to emulate and not even got close in my opinion, even the best attempts like Quantec and EMT only scratch the surface.

    analogue EQ is currently impossible in theory to be emulated digitally,quite intense maths shit involed in this if youre really that interested,you could look it up...good luck.

    your soundcard will always make things sound like its come from THAT soundcard..they ALL impose their different sound characteristics onto whatever comes out of them they are far from being totally neutral devices.

    all the components of a circuit like resistors and capacitors subtley differ from each other depending on their quality but even the most high quality milatary spec ones are never EXACTLY the same.

    no two analogue synths can ever be built exactly the same,there are tiny human/automated errors in building the circuits,tweaking the trimpots for example which is usually done manually in a lot of analogue shit.
    just compare the sound of 2 808 drum machines next to each other and you will see what I mean,you always thought an 808 was an 808 right?
    same goes for 303`s they all sound subltey different,different voltage scaling of the oscillator is usually quite noticable.

    VST plugins are restricted by a finite number of calculations per second these factors are WAY beyond their CURRENT capability.

    Then there is the question of the physicallity of the instrument this affects the way a human will emotionally interact with it and therfore affect what they will actually do with it! often overlooked from the maths heads,this is probably the biggest factor I think.
    for example the smell of analogue stuff as well as the look of it puts y

  15. ha! by serbanp · · Score: 5, Insightful

    The really impressive thing is that this guy resisted the urge to just reboot his machine. Otherwise, the clues would have vanished and the expr binary would have run again without any issue.

    Maybe that's why the first step one takes when something behaves weird on a Windows system is to reboot it...