Reliability of Computer Memory?
olddoc writes "In the days of 512MB systems, I remember reading about cosmic rays causing memory errors and how errors become more frequent with more RAM. Now, home PCs are stuffed with 6GB or 8GB and no one uses ECC memory in them. Recently I had consistent BSODs with Vista64 on a PC with 4GB; I tried memtest86 and it always failed within hours. Yet when I ran 64-bit Ubuntu at 100% load and using all memory, it ran fine for days. I have two questions: 1) Do people trust a memtest86 error to mean a bad memory module or motherboard or CPU? 2) When I check my email on my desktop 16GB PC next year, should I be running ECC memory?"
Recently I had consistent BSODs with Vista64 on a PC with 4GB...
This was a surprise?
My experience with memtest is you can trust the results if it says the memory is bad, however if the memory passed it could still be bad. Troubleshooting your scenario should involve replacing the DIMM's in questions with known good modules while running Windows.
brandelf -t FreeBSD
wrap your _whole_ computer in tinfoil to deflect those pesky cosmic rays. it also works to keep them out of your head too.
If a system gives memtest86 errors, I break it down and swap components until it doesn't. The test pattern it uses can find subtle errors you're unlikely to run into with any application-based testing even when run for a few days. Any failures it reports should be taken seriously. Also: you should pay a attention to the memory speed value it reports, that's a surprisingly effective simple benchmark for figuring out if you've setup your RAM optimally. The last system I built, I ended up purchasing 4 different sets of RAM, and there was about a 30% delta between how well the best and worst performed on the memtest86 results--correlated extremely well with other benchmarks I ran too.
At the same time, I've had memory that memtest86 said was fine, but the system itself still crashed under a heavy Linux-based test. I consider both a full memtest86 test and a moderate workload Linux test to be necessary before I consider a new system to have baseline usable reliability.
There are a few separate problems here that are worthwhile to distinguish among. A significant amount of RAM doesn't work reliably when tested fully. Once you've culled those out, only using the good stuff, some of that will degrade over time to where it will no longer pass a repeat of the initial tests; I recently had a perfectly good set of RAM degrade to useless in only 3 months here. After you take out those two problematic sources for bad RAM, is the remainder likely enough to have problems that it's worth upgrading to ECC RAM? I don't think it is for my home systems, because I'm OK with initial and periodic culling to kick out borderline modules. And things like power reliability cause me more downtime than RAM issues do. If you don't know how or have the time to do that sort of thing yourself though, you could easily be better off buying more redundant RAM.
1) Yes
2) No
Now to be serious. Home PC do not come yet with 6GB or 8GB. Most new home PC still seem to have between 1GB and 4GB. Where the 4GB variety is rare because of the fact that most home PCs still come with a 32-bit operating system. 3GB seems to be the sweet spot for higher-end-home-pcs. Your home PC will most likely not have 16GB next year. Your workstation at work, perhaps, but then even perhaps.
At the risk of sounding like "640KByte is enough for everyone", I have to ask why you think why you need 16GB to check your email next year. I'm typing this on a 6 year old computer, I'm running quite a few applications at the same time and I know a second user is logged in. Current memory usage: 764Meg RAM. As a general rule, I know that Windows XP runs fine on 512Meg RAM and is comfortable with 1GB RAM. The same is true for GNU/Linux running Gnome.
Now, at work with Eclipse loaded, a couple of application servers, a database and a few VMs... Yeah, there indeed you get memory starved quickly. You have to keep in mind that such usage pattern is not that of a typical office worker. I can imagine that a heavy Photoshop user would want every bit of RAM he can get too. The Word-wielding-office-worker? I don't think so.
Now, I can't speak for Vista. I heard it runs well on 2GB systems, but I can't say. I got a new work laptop last week and booted briefly in Vista. It felt extremely sluggish and my machine does have 4Gig RAM. Anyway, I didn't bother and put Debian Lenny/amd64 on it and didn't look back.
I my idea, you have quite a twisted sense of reality regarding to the computers people actually use.
Oh, and frankly... If cosmic rays would be a big issue by now with huge memories, don't you think that more people would be complaining? I can't say why Ubuntu/amd64 ran fine on your machine. Perhaps GNU/Linux has built-in error correction and marks bad RAM as "bad".
Ahhh...the great dumpster continuum. Many a free computer will be found there. -- sowth (748135)
Is ECC memory worth the money in a machine you use to check your E-mail? Can't you just reboot and/or replace the memory if errors occur?
I could see it happening when the cost of ECC memory is no higher than normal memory, and using ECC memory has no or minimal impact on performance, until then, I won't expect to start seeing it desktop machines.
If you want ECC memory on your desktop, feel free to build your own machine with a motherboard that supports ECC memory. Some high end desktops do support ECC memory already.
My first computer was a 80286 with 1 MB of RAM. That RAM was all parity memory. Cheaper than ECC, but still good enough to positively identify a genuine bit flip with great accuracy. My 80386SX had parity RAM, so did my 486DX4 120. I ran a computer shop for some years, so I went through at least a dozen machines ranging from the 386 era through the Pentium II era, at which point I sold the shop and settled on a AMDK62 450. And right about the time that the Pentium was giving way to the Pentium II, non-parity memory started to take hold.
What protection did parity memory provide, anyway? Not much, really. It would detect with 99.99...? % accuracy when a memory bit had flipped, but provided no answer as to which one. The result was that if parity failed, you'd see a generic "MEMORY FAILURE" message and the system would instantly lock up.
I saw this message perhaps three times - it didn't really help much. I had other problems, but when I've had problems with memory, it's usually been due to mismatched sticks, or sticks that are strangely incompatible with a specific motherboard, etc. none of which caused a parity error. So, if it matters, spend the money and get ECC RAM to eliminate the small risk of parity error. If it doesn't, don't bother, at least not now.
Note: having more memory increases your error rate assuming a constant rate of error (per megabyte) in the memory. However, if the error rate drops as technology advances, adding more memory does not necessarily result in a higher system error rate. And based on what I've seen, this most definitely seems to be the case.
Remember this blog article about the end of RAID 5 in 2009? Come on... are you really going to think that Western Digital is going to be OK with near 100% failure of their drives in a RAID 5 array? They'll do whatever it takes to keep it working because they have to - if the error rate became anywhere near that high, their good name would be trashed because some other company (Seagate, Hitachi, etc) would do the research and pwn3rz the marketplace.
I have no problem with your religion until you decide it's reason to deprive others of the truth.
With memory becoming so plentiful these days (I haven't seen many home PC's with 6 or 8GB granted, but we're getting there) it seems that a single error on a large capacity chip is getting more and more trivial. Isn't it a waste to throw away a whole DIMM? Why isn't it possible to "remap" this known-bad address, or allocate some amount of RAM for parity the way software like PAR2 works? Hard drive manufacturers already remap bad blocks on new drives. Also it seems to me that, being a solid state device, small failures in RAM aren't necessarily indicative of a failing component like bad sectors on a hard drive are. Am I missing something really obvious here or is it really just easier/cheaper to throw it away?
First, it was not cosmic rays; memory was tested in a lead vault and showed the same error rate. Turns out to have been alpha particles emitted by the epoxy / ceramic that the memory chips were encapsulated in.
That said: Quite clearly given your experience, Vista and Ubuntu load the memory subsystem quite differently. It is possible that Vista, with its all-over-the-map program flow, is missing cache a lot more often and so is hitting DRAM harder; I don't have the background to really know. I believe that Memtest86, in order to put the most strain on memory and thus test it in the most pessimal conditions, tries to access memory in patterns that equally hit physical memory hardest. But, what I have found is that some OSs, apparently including Ubuntu, will run on memory that is marginal, memory that Memtest86 picks up as bad.
As for ECC in memory... The problem is that ECC carries a heavy performance hit on write. If you only want to write 1 byte, you still have to read in the whole QWord, change the byte, and write it back to get the ECC to recalculate correctly. It is because of that performance hit that ECC was deprecated. The problem goes away to a large extent if your cache is write-back rather than write-through; though there will be still a significant number of cases where you have to write a set of bytes that has not yet been read into cache and does not comprise a whole ECC word.
That said, it is still used on servers...
But I don't expect it will reappear on desktops any time soon. Apparently they have managed to control the alpha radiation to a great extent, and so the actual radiation-caused errors are now occurring at a much lower rate, significantly lower than software-induced BSODs.
My experience with a server that recorded about 15TB of data is something like 6 bit-errors per year that could not be traced to any source. This was a server with ECC RAM, so the problem likely occured in busses, network cards, and the like, not in RAM.
For non-ECC memory, I would strongly syggest running memtest86+ at least a day before using the system and if it gives you errors, replace the memory. I had one very persistend bit-error in a PC in a cluster, that actually reqired 2 days of memtest86+ to show up once, but did occure about once per hour for some computations. I also had one other bit-error that memtest86+ did not find, but the Linux commandline memory tester found after about 12 hours.
The problem here is that different testing/usage patterns result in different occurence probability for weak bits, i.e. bits that only sometimes fail. Any failure in memtest86+ or any other RAM tester indicates a serious problem. The absence of errors in a RAM test does not indicate the memory is necessarily fine.
That said, I do not believe memory errors have become more common on a per computer basis. RAM has become larger, but also more reliable. Of course, people participating in the stupidity called "overclocking" will see a lot more memory errors and other errors as well. But a well-designed system with quality hardware and a thourough initial test should typically not have memory issues.
However there is "quality" hardware, that gets it wrong. My ASUS board sets the timing for 2 and 4 memory modules to the values for 1 module. This resulted in stable 1 and 2 module operation, but got flaky for 4 modules. Finally I moved to ECC memory before I figuerd out that I had to manually set the correct timings. (No BIOS upgrade available that fixed this...) This board has a "professional" in its name, but apparently, "professional" does not include use of generic (Kingston, no less) memory modules. Other people have memory issues with this board as well that they could not fix this way, seems that somethimes a design just is bad or even reputed manufacturers do not spend a lot of effort to fix issues in some cases.In can only advise you to do a thourough forum-search before buying a specific mainboard.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Then it would proba%ly alter not just one byte, b%t a chain of them. The cha%n of modified bytes would be stru%g out, in a regular patter%. Now if only there were so%e way to read memory in%a chain of bytes, as if it w%re a string, to visu%lize the cosmic ray mod%fication. hmmm...
intellectual property law is philosophically incoherent. it is your moral duty to ignore it or sabotage it
Not all memory is created equal. Memory can be bad if Memtest detects errors, or you can simply be running it at the wrong settings. Usually there are both "normal" and "performance" settings for memory on higher end motherboards, or sometimes you can tweak all sorts of cycle-level stuff manually (CAS latency etc.).
Try running your memory with the most conservative settings before you assume it's bad.
.: Max Romantschuk
Depending on where it fails (if it fails in a the same spot) you can relatively easily work around it and not throw out the remaining good portion of the stick. I wrote a howto..
http://gquigs.blogspot.com/2009/01/bad-memory-howto.html
I've been running on Option 3 for quite some time now. No, it's not as good as ECC, but it doesn't cost you anything.
With today's wide buses, parity RAM is ECC RAM. It's worth paying the extra couple dollars.
Several years back I experienced disk corruption that seemed to be due to a bitflip that had happened in RAM and got committed to disk. That machine didn't have ECC RAM. I went to ECC for everything after that. That was back in the 128MB days, and no I don't overclock.
(Well, not aggressively. My machine is overclocked by about 1%.)
Program Intellivision!
1) Do people trust a memtest86 error to mean a bad memory module or motherboard or CPU?
Well, I'd add some other possibilities such as:
Bad power supply,
Memory isn't seated properly in it's socket.
Incorrect timing set in bios.
Memory is incompatable with your motherboard.
etc..
But yeah, if memtest86 says there's a problem then there really is something wrong.
is to swap the memory modules to find out which is causing the problem, if not motherboard. Also i don't see how memory tests running inside an OS can be effective, i'd much rather boot off of a smaller system on a DVD, USB-stick or floppy to run a memory test. Dell servers have those Dell Diagnostics CDs that are very small in memory footprint just in order to run diagnostics on memory. But even they're not perfect so you often have to take memory out and see if you can reproduce errors.
The probability of a cosmic ray at precisely the right angle and speed to cause a single bit error and cause an app to crash is somewhere on the same order as your chances of getting hit by a car, getting struck by lightning, getting torn apart by rabid wolves, and having sex in the back of a red 1948 Buick convertible at a drive-in movie theater on Tuesday night, Feb. 29th under a blue moon... all at the same time.... Sure, given enough bits, it's bound to happen sooner or later, but it isn't something I'd worry about. :-)
The probability of RAM just plain being defective---failing to operate correctly due to bugs in handling of certain low power states, having actual bad bits, having insufficient decoupling capacitance to work correctly in the presence of power supply rail noise, etc---is probably several hundred thousand orders of magnitude greater (probably on the order of a one in several thousand chance of a given part being bad versus happening to a given part a few times before the heat death of the universe).
Memory test failures (other than mapping errors) are pretty much always caused by hardware failing. If running memtest86 in Linux works correctly for days, this probably means one of three things:
I couldn't tell you which of these is the case without swapping out parts, of course. You should definitely take the time to replace whatever is bad even if it seems to be "working" in Linux. In the worst case, you have a few bad bits of RAM, they're somewhere in the middle of your disk cache in Linux, and you are slowly and silently corrupting data periodically on its way out to disk.... You definitely need to figure out what's wrong with the hardware and why it is only failing in Windows, and it sounds like the only way to do that is to swap out parts, boot into Windows, and see if the problem is still reproducible in under a couple of days, repeating with different part swaps until the problem goes away. Don't forget to try a different power supply.
Check out my sci-fi/humor trilogy at PatriotsBooks.
several hundred thousand orders of magnitude
We've crossed beyond the realm of the astronomical and into something else entirely. Surely you meant several orders of magnitude, aka, hundreds of thousands of times? Let's keep things on this side of the googol.
Evidently, the key to understanding recursion is to begin by understanding recursion. The rest is easy.
really, it's not that much more expensive. Search newegg for unbuffered ecc, if you are using a desktop class system that can't handle registered ram.
You wouldn't put data you care about on a hard drive without raid, would you?
Was it cosmic rays, or Alpha particle decay from impure materials that was going to do in our memory soon? IIRC it was the latter.
"It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."
Yes. I do, anyway; I've never had it report a false-positive, and it's always been one of the three (and even if it was cosmic rays, it wouldn't consistently come up bad, then, would it?). Then again, it could also mean that you could be using RAM requiring a higher voltage than what your motherboard is giving it. If it's brand-name RAM, you should look up the model number and see what voltage the RAM requires. Things like Crucial Ballistix and Corsair Dominator usually require around 2.1v.
Depends. If you're doing really important stuff then sure. ECC memory is quite a boon in that case. If you're just using your desktop for word processing and web browsing, it's a waste of money.
Screw the rules, I have green hair!
and having sex in the back of a red 1948 Buick convertible at a drive-in movie theater on Tuesday night, Feb. 29th under a blue moon... all at the same time....
Mom?
I usually wear medieval armour. Not only does that work as efficient as tinfoil, it's also very fashionable.
When rudely swiping at other people, at least stop dribbling nonsense like "several hundred thousand orders of magnitude greater". I don't think you know what you are talking about. >>10^100000?
So I discount the rest of your "contribution" accordingly. Actually, several other parts of your answer are independently rubbish too: have you considered a career in tabloid journalism? Wish I had mod points...
Rgds
Damon
http://m.earth.org.uk/
Not only silicon... NASA astronauts consistently observe bright flashes in orbit, whether their eyes are open or closed. It is believed that these flashes are the result of cosmic rays interacting with the astronauts' retinas.
Rampant carbon sequestration destroyed the Dinosaurs' tropical paradise. I'm here to help repair the damage.
You're right that I've never run memtest86 at all. I hadn't regularly worked with any hardware based on an Intel architecture until about two years ago, and haven't experienced any RAM problems in that relatively short period. That is the sole valid criticism in your post, and even that was redundant. The rest of your post consists of you putting words in my mouth that I did not say.
Regarding point A., many Linux systems do perform at least rudimentary RAM checks. What I said was that it is remotely possible that it got lucky and detected the problem during such screening, then flagged that page of physical RAM as defective. I never said anything about checking every write to RAM. That was you putting words in my mouth, and completely ludicrous words that I'd have to know almost nothing about hardware to say, at that. NIce straw man.
Regarding point B., that's not a baseless troll argument by any stretch of the imagination. First, running a lean Linux distro will almost certainly thrash pages around far less than 64-bit Vista simply because the OS uses far less RAM. Second, last time I used it, Linux wired down a -lot- of pages down in the kernel. All of those pages are just going to sit there. If anything, this was a criticism of Linux's tendency to wire too many pages, not any sort of "pro-Linux" comment. Maybe it might be taken to mean that Linux is less likely to eject pages belonging to one process in favor of another process---indeed, my experience has been that it does seem to do so less frequently than some other operating systems, though this can either be good or bad depending on the workload in question---but that was in no way implied by my previous comment, nor certainly was there any value judgment on my part as to whether such behavior is good or bad.
Likewise on point C., I was actually being harshly critical of Linux's power management, albeit without coming right out and saying it. Nowhere in my statement did I in ANY way insinuate that failing to switch into the lowest power states was in any way a good thing. It isn't. Poor power management leads to diminished battery life in portables and increased electric bills from computers of all types.
Before you go painting me as a pro-Linux troll, you need to learn some reading comprehension skills and stop trying to put words in my mouth. It only makes you look like a troll yourself.
Check out my sci-fi/humor trilogy at PatriotsBooks.
http://www.ida.liu.se/~abdmo/SNDFT/docs/ram-soft.html
This references an IBM study, which is what I think I actually remember but could not find quickly this morning.
"In a study by IBM, it was noted that errors in cache memory were twice as common above an altitude of 2600 feet as at sea level. The soft error rate of cache memory above 2600 feet was five times the rate at sea level, and the soft error rate in Denver (5280 feet) was ten times the rate at sea level."
My bet is that it is cerenkov radiaton as a high speed charged particle breaks the speed of light in the fluid in the eyeball.
Indeed, these flashes have pretty much already been identified as the result of Cerenkov radiation.
Looks like these truths are not so self-evident after all...
I think it occurs quite a bit more often than once every few days. It is however rare that you'll notice since the data corrupted is often not in code that would lead to a crash, actual program code being such a small percentage of RAM usage these days. A flip in graphical data, sound or text data will very likely go unnoticed. Same goes for flips in code-paths that are rarely used.
I was recently running a server that archived about 2 terabytes of data and got something like one or two bit errors per year that could not be traced to any known source. This server boasted ECC RAM, so the problem didn't occur in the ram, and it was unlikely for it to have occurred in the FSB.
If you go with non-ECC, I would suggest running memtest86+. If you get errors, swap the memory. If swapping the memory still doesn't take care of it, swap motherboards! I recently had a memory problem in one of my customers' racks, and running memtest86+ got nothing until I had it running on my bench for over a week. There may be some problems with memtest86+...I even had another bit-error that memtest86+ did not find, but a Linux commandline memory tester found a problem almost immediately
The problem here is that different testing/usage patterns result in different probabilities of finding potentially bad words, e.g. words that may only be bad if you read from them a hundred cycles consecutively. But, if you do see a failure in memtest86+ or the CLI tester, you got yourself a serious problem. The point to take from this is that if you don't see errors, that doesn't mean you don't have errors!
Having said this, I still don't think memory errors among PCs are that common. We have more RAM on machines these days, but at the same time, the manufacturing processes have become better. I have a personal conviction in believing that though the likelihood of word error due to the increased amount of words in memory has increased, the RAM itself has become so much more "solid" that the increase of memory is negligible. Now, if you do dumb things with your computer like running it without a case or not giving it ventilation( learned this the hard way) or overclocking it, you *WILL* still run into problems. But if you design a system with quality and integrity, you typically shouldn't have these issues with memory!
One last thing to point out: there is quality hardware, and there is cheap hardware. My PC-Chips motherboard ran for three months and two days, and I didn't have a problem. Two days out of warrant. Now, take my MSI motherboard. It sets the timing for all memory modules to have the values of a single module. This resulted in stable single module operation, but got flaky for all four modules. I Finally moved to ECC before I figuerd out that I had to manually set the correct timings. This board is an ultra board, but apparently, it does not include use of generic (Micron, Corsair, etc!! - tried 'em all) memory modules. People on the Newegg reviews board have memory issues with this board as well that they could not fix with a BIOS update, and it appears that sometimes a design just is bad! Even the "good" manufacturers do not spend a lot of effort to fix issues in some cases.
My words of advice: Do your homework. Read through the reviews. AND DON'T BUY HARDWARE AS SOON AS IT COMES OUT!
I see you've never experienced the joys of J2EE.
...if you catch it while hibernating
Be careful. Vista hibernates with one eye open. It can wake itself up from hibernation to do updates. I dual boot my laptop with Linux Mint (an Ubuntu variant). Every week, I'd go to turn on my computer only to find that the battery was dead. Checking the startup logs showed that linux was starting up at about 3:00 in the morning. After googling, I found out that many people were having that problem. The suggested solution was to turn off Vista automatic updates. I checked my Vista, and sure enough, it was set to update at 3am. I turned that setting off: no battery issue. I turned it back on: battery drained.
My CMOS settings pages do not have any facility for waking the laptop at a specific time, so I don't know how Vista manages it. I only know that it can. So beware! Vista hibernates with one eye open.
When our name is on the back of your car, we're behind you all the way!
if you look at the username it's not him at all, it's someone with ID 1344097 pretending to be him. Still, what he says is sensible, and what's wrong with this piece? If it doesn't interest you, why are you reading the comments?
which is totally what she said
The real issue with memory cells flipping is not cosmic rays -- at least not with terrestrially deployed memory, it's alpha particle emissions from radioactive decay of the plastics in the memory package. Yes, the plastics surrounding the silicon.
A lot of work has been done to reduce the radioactivity of plastics used in IC packaging from normal background levels that you don't worry about in day-to-day life, to as quiet as possible, by carefully selecting source materials that have few naturally occurring radioisotopes.
From my chip-designer days I recall that the minimum charge required on a dynamic memory cell (like the ones in your computer's DRAM) to prevent spurious bit flips is one million electrons, give-or-take. The various designs back then were coming up with ways to reduce the footprint of the elements used to store that charge.
That said, it's been about 10 years since I've been in that line of work, and things have probably changed -- strike that, they've definitely changed -- substantially.
Put my fist through my alarm clock with its ding-dong death inside my ear. - The Blackjacks.
I do get it, but some people (like me) automatically skip past sigs, and the guy created his frickin name as KDawson's name plus his userID, what about that don't you get?
which is totally what she said
You must be unlucky or the cause.
This would make a great slogan for Microsoft's new ad campaign:
"Not an actor, but he plays one on TV."
http://www.ida.liu.se/~abdmo/SNDFT/docs/ram-soft.html
This references an IBM study, which is what I think I actually remember but could not find quickly this morning.
"In a study by IBM, it was noted that errors in cache memory were twice as common above an altitude of 2600 feet as at sea level. The soft error rate of cache memory above 2600 feet was five times the rate at sea level, and the soft error rate in Denver (5280 feet) was ten times the rate at sea level."
IBM research is a wonderful resource in the area of soft errors. I do remember exactly reading your quote, I didn't bother to track the exact article, but it should be part of this special issue http://www.research.ibm.com/journal/rd40-1.html, the banner article mentions Denver but doesn't have the exact quote. The web shows it would be "Terrestrial Cosmic Rays", the second article in that issue. They have a more recent special issue on the same subject http://www.research.ibm.com/journal/rd52-3.html
Some simple tests:
Being one who has maintained an 1100 node cluster with 8800 pieces of ECC RAM I can tell you we chase bad RAM sticks ALL THE TIME! It's not necessarily due to cosmic activity, the RAM just exhibits bad behavior as the circuits get older and things start to separate and break down due to thermal load over time. Even a small defect that would let the RAM pass the manufacturers tests will eventually lead to a DIMM failure down the road. Most average human beings will never determine why their machine crashes every few days if it is a RAM issue. Some power users will even overlook it because they have too much faith in RAM that *was* good when they bought it, but now that it's two or three years old ...
I wouldn't trust a single app to verify your RAM. Run a couple different tests and see if you can nail down the problem. I can look and see how we're tracking that and get back to you.