Reliability of Computer Memory?
olddoc writes "In the days of 512MB systems, I remember reading about cosmic rays causing memory errors and how errors become more frequent with more RAM. Now, home PCs are stuffed with 6GB or 8GB and no one uses ECC memory in them. Recently I had consistent BSODs with Vista64 on a PC with 4GB; I tried memtest86 and it always failed within hours. Yet when I ran 64-bit Ubuntu at 100% load and using all memory, it ran fine for days. I have two questions: 1) Do people trust a memtest86 error to mean a bad memory module or motherboard or CPU? 2) When I check my email on my desktop 16GB PC next year, should I be running ECC memory?"
Recently I had consistent BSODs with Vista64 on a PC with 4GB...
This was a surprise?
My experience with memtest is you can trust the results if it says the memory is bad, however if the memory passed it could still be bad. Troubleshooting your scenario should involve replacing the DIMM's in questions with known good modules while running Windows.
brandelf -t FreeBSD
wrap your _whole_ computer in tinfoil to deflect those pesky cosmic rays. it also works to keep them out of your head too.
I doubt this is any major problem, I've had my main computer up and running for a few months without a reboot and so far I haven't had any malfunction or program crash.
It's the lowest of the low end of the market that doesn't use ECC, or at least Parity RAM. For anything where reliability and veracity is important, you simply must use ECC.
If a system gives memtest86 errors, I break it down and swap components until it doesn't. The test pattern it uses can find subtle errors you're unlikely to run into with any application-based testing even when run for a few days. Any failures it reports should be taken seriously. Also: you should pay a attention to the memory speed value it reports, that's a surprisingly effective simple benchmark for figuring out if you've setup your RAM optimally. The last system I built, I ended up purchasing 4 different sets of RAM, and there was about a 30% delta between how well the best and worst performed on the memtest86 results--correlated extremely well with other benchmarks I ran too.
At the same time, I've had memory that memtest86 said was fine, but the system itself still crashed under a heavy Linux-based test. I consider both a full memtest86 test and a moderate workload Linux test to be necessary before I consider a new system to have baseline usable reliability.
There are a few separate problems here that are worthwhile to distinguish among. A significant amount of RAM doesn't work reliably when tested fully. Once you've culled those out, only using the good stuff, some of that will degrade over time to where it will no longer pass a repeat of the initial tests; I recently had a perfectly good set of RAM degrade to useless in only 3 months here. After you take out those two problematic sources for bad RAM, is the remainder likely enough to have problems that it's worth upgrading to ECC RAM? I don't think it is for my home systems, because I'm OK with initial and periodic culling to kick out borderline modules. And things like power reliability cause me more downtime than RAM issues do. If you don't know how or have the time to do that sort of thing yourself though, you could easily be better off buying more redundant RAM.
1) Yes
2) No
Now to be serious. Home PC do not come yet with 6GB or 8GB. Most new home PC still seem to have between 1GB and 4GB. Where the 4GB variety is rare because of the fact that most home PCs still come with a 32-bit operating system. 3GB seems to be the sweet spot for higher-end-home-pcs. Your home PC will most likely not have 16GB next year. Your workstation at work, perhaps, but then even perhaps.
At the risk of sounding like "640KByte is enough for everyone", I have to ask why you think why you need 16GB to check your email next year. I'm typing this on a 6 year old computer, I'm running quite a few applications at the same time and I know a second user is logged in. Current memory usage: 764Meg RAM. As a general rule, I know that Windows XP runs fine on 512Meg RAM and is comfortable with 1GB RAM. The same is true for GNU/Linux running Gnome.
Now, at work with Eclipse loaded, a couple of application servers, a database and a few VMs... Yeah, there indeed you get memory starved quickly. You have to keep in mind that such usage pattern is not that of a typical office worker. I can imagine that a heavy Photoshop user would want every bit of RAM he can get too. The Word-wielding-office-worker? I don't think so.
Now, I can't speak for Vista. I heard it runs well on 2GB systems, but I can't say. I got a new work laptop last week and booted briefly in Vista. It felt extremely sluggish and my machine does have 4Gig RAM. Anyway, I didn't bother and put Debian Lenny/amd64 on it and didn't look back.
I my idea, you have quite a twisted sense of reality regarding to the computers people actually use.
Oh, and frankly... If cosmic rays would be a big issue by now with huge memories, don't you think that more people would be complaining? I can't say why Ubuntu/amd64 ran fine on your machine. Perhaps GNU/Linux has built-in error correction and marks bad RAM as "bad".
Ahhh...the great dumpster continuum. Many a free computer will be found there. -- sowth (748135)
I have found that memtester (http://pyropus.ca/software/memtester/), which is run as a user-level process under linux, does an excellent job of finding bad ram. I had two instances of memory modules that passed memtest86+ but failed memtester.
All of my computers that run for days on end without rebooting have ECC ram in them (Home server, workstation at work). Others must be rebooted every now and then.
Are there laptops that use ECC RAM? I wish I could buy some.
In Soviet Russia, articles before post read *you*!
Is ECC memory worth the money in a machine you use to check your E-mail? Can't you just reboot and/or replace the memory if errors occur?
I could see it happening when the cost of ECC memory is no higher than normal memory, and using ECC memory has no or minimal impact on performance, until then, I won't expect to start seeing it desktop machines.
If you want ECC memory on your desktop, feel free to build your own machine with a motherboard that supports ECC memory. Some high end desktops do support ECC memory already.
My first computer was a 80286 with 1 MB of RAM. That RAM was all parity memory. Cheaper than ECC, but still good enough to positively identify a genuine bit flip with great accuracy. My 80386SX had parity RAM, so did my 486DX4 120. I ran a computer shop for some years, so I went through at least a dozen machines ranging from the 386 era through the Pentium II era, at which point I sold the shop and settled on a AMDK62 450. And right about the time that the Pentium was giving way to the Pentium II, non-parity memory started to take hold.
What protection did parity memory provide, anyway? Not much, really. It would detect with 99.99...? % accuracy when a memory bit had flipped, but provided no answer as to which one. The result was that if parity failed, you'd see a generic "MEMORY FAILURE" message and the system would instantly lock up.
I saw this message perhaps three times - it didn't really help much. I had other problems, but when I've had problems with memory, it's usually been due to mismatched sticks, or sticks that are strangely incompatible with a specific motherboard, etc. none of which caused a parity error. So, if it matters, spend the money and get ECC RAM to eliminate the small risk of parity error. If it doesn't, don't bother, at least not now.
Note: having more memory increases your error rate assuming a constant rate of error (per megabyte) in the memory. However, if the error rate drops as technology advances, adding more memory does not necessarily result in a higher system error rate. And based on what I've seen, this most definitely seems to be the case.
Remember this blog article about the end of RAID 5 in 2009? Come on... are you really going to think that Western Digital is going to be OK with near 100% failure of their drives in a RAID 5 array? They'll do whatever it takes to keep it working because they have to - if the error rate became anywhere near that high, their good name would be trashed because some other company (Seagate, Hitachi, etc) would do the research and pwn3rz the marketplace.
I have no problem with your religion until you decide it's reason to deprive others of the truth.
With memory becoming so plentiful these days (I haven't seen many home PC's with 6 or 8GB granted, but we're getting there) it seems that a single error on a large capacity chip is getting more and more trivial. Isn't it a waste to throw away a whole DIMM? Why isn't it possible to "remap" this known-bad address, or allocate some amount of RAM for parity the way software like PAR2 works? Hard drive manufacturers already remap bad blocks on new drives. Also it seems to me that, being a solid state device, small failures in RAM aren't necessarily indicative of a failing component like bad sectors on a hard drive are. Am I missing something really obvious here or is it really just easier/cheaper to throw it away?
First, it was not cosmic rays; memory was tested in a lead vault and showed the same error rate. Turns out to have been alpha particles emitted by the epoxy / ceramic that the memory chips were encapsulated in.
That said: Quite clearly given your experience, Vista and Ubuntu load the memory subsystem quite differently. It is possible that Vista, with its all-over-the-map program flow, is missing cache a lot more often and so is hitting DRAM harder; I don't have the background to really know. I believe that Memtest86, in order to put the most strain on memory and thus test it in the most pessimal conditions, tries to access memory in patterns that equally hit physical memory hardest. But, what I have found is that some OSs, apparently including Ubuntu, will run on memory that is marginal, memory that Memtest86 picks up as bad.
As for ECC in memory... The problem is that ECC carries a heavy performance hit on write. If you only want to write 1 byte, you still have to read in the whole QWord, change the byte, and write it back to get the ECC to recalculate correctly. It is because of that performance hit that ECC was deprecated. The problem goes away to a large extent if your cache is write-back rather than write-through; though there will be still a significant number of cases where you have to write a set of bytes that has not yet been read into cache and does not comprise a whole ECC word.
That said, it is still used on servers...
But I don't expect it will reappear on desktops any time soon. Apparently they have managed to control the alpha radiation to a great extent, and so the actual radiation-caused errors are now occurring at a much lower rate, significantly lower than software-induced BSODs.
My experience with a server that recorded about 15TB of data is something like 6 bit-errors per year that could not be traced to any source. This was a server with ECC RAM, so the problem likely occured in busses, network cards, and the like, not in RAM.
For non-ECC memory, I would strongly syggest running memtest86+ at least a day before using the system and if it gives you errors, replace the memory. I had one very persistend bit-error in a PC in a cluster, that actually reqired 2 days of memtest86+ to show up once, but did occure about once per hour for some computations. I also had one other bit-error that memtest86+ did not find, but the Linux commandline memory tester found after about 12 hours.
The problem here is that different testing/usage patterns result in different occurence probability for weak bits, i.e. bits that only sometimes fail. Any failure in memtest86+ or any other RAM tester indicates a serious problem. The absence of errors in a RAM test does not indicate the memory is necessarily fine.
That said, I do not believe memory errors have become more common on a per computer basis. RAM has become larger, but also more reliable. Of course, people participating in the stupidity called "overclocking" will see a lot more memory errors and other errors as well. But a well-designed system with quality hardware and a thourough initial test should typically not have memory issues.
However there is "quality" hardware, that gets it wrong. My ASUS board sets the timing for 2 and 4 memory modules to the values for 1 module. This resulted in stable 1 and 2 module operation, but got flaky for 4 modules. Finally I moved to ECC memory before I figuerd out that I had to manually set the correct timings. (No BIOS upgrade available that fixed this...) This board has a "professional" in its name, but apparently, "professional" does not include use of generic (Kingston, no less) memory modules. Other people have memory issues with this board as well that they could not fix this way, seems that somethimes a design just is bad or even reputed manufacturers do not spend a lot of effort to fix issues in some cases.In can only advise you to do a thourough forum-search before buying a specific mainboard.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
my experience has been to never use computer memory around reza lockwood, she walks around and the whole building shakes (she's about five hundy or so) and simms and dimms get unseated. then the document that you are working on disappears. if you are like me you hate it when this happens
One thing I've noticed is that Linux and Windows have much different access patterns for memory.
So one OS may show problems while the other is running just fine. So... there could likely BE a problem, but it just does not show up as often in one OS or another.
These days 'memory' problems are not always caused by the ram itself. A lot of boards based on Nvidia's 680i chipset for example exhibited a problem when using all 4 slots on the board. They would run fine on 2 slots with most combinations of of dimms... but would start having issues when you use all 4.
Anyhow, I would be less worried about cosmic rays and more about the general configuration of your PC.
Ram, CPU (especially with the memory controller on chip these days), Motherboard, etc... all can create stability problems in regards to memory.
(As a side comment, both a friend and I were bitten by the 680i problem back when the boards had come out. Bummed us out to the point of refusing to use another nvidia northbridge.)
Then it would proba%ly alter not just one byte, b%t a chain of them. The cha%n of modified bytes would be stru%g out, in a regular patter%. Now if only there were so%e way to read memory in%a chain of bytes, as if it w%re a string, to visu%lize the cosmic ray mod%fication. hmmm...
intellectual property law is philosophically incoherent. it is your moral duty to ignore it or sabotage it
Not all memory is created equal. Memory can be bad if Memtest detects errors, or you can simply be running it at the wrong settings. Usually there are both "normal" and "performance" settings for memory on higher end motherboards, or sometimes you can tweak all sorts of cycle-level stuff manually (CAS latency etc.).
Try running your memory with the most conservative settings before you assume it's bad.
.: Max Romantschuk
Depending on where it fails (if it fails in a the same spot) you can relatively easily work around it and not throw out the remaining good portion of the stick. I wrote a howto..
http://gquigs.blogspot.com/2009/01/bad-memory-howto.html
I've been running on Option 3 for quite some time now. No, it's not as good as ECC, but it doesn't cost you anything.
1) Do people trust a memtest86 error to mean a bad memory module or motherboard or CPU?
Well, I'd add some other possibilities such as:
Bad power supply,
Memory isn't seated properly in it's socket.
Incorrect timing set in bios.
Memory is incompatable with your motherboard.
etc..
But yeah, if memtest86 says there's a problem then there really is something wrong.
is to swap the memory modules to find out which is causing the problem, if not motherboard. Also i don't see how memory tests running inside an OS can be effective, i'd much rather boot off of a smaller system on a DVD, USB-stick or floppy to run a memory test. Dell servers have those Dell Diagnostics CDs that are very small in memory footprint just in order to run diagnostics on memory. But even they're not perfect so you often have to take memory out and see if you can reproduce errors.
Memtest86 is the usual test tool for a couple of reasons (and only one of those is price).
Chances are very good you have a problem. Definitely worth checking it out.
1) Re-run the test and see if the error is in the same place. If it is, you can pretty much guarantee the RAM is bad at that position.
2) Swap the memory out and try again. You're best to do this while you still can under warranty.
Bottom line is you're not paranoid and you probably do have a problem. You can either deal with it up front or live with a compromised system that eventually bites you on the backside.
These posts express my own personal views, not those of my employer
Please, please tell me this is an early April fool's joke. If not, dear submitter, I hope that you're either very tired or very drunk right now because you literally just asked:
"Windows is crashing randomly and the program that I ran to test the memory is reporting errors. Does that mean the memory in my computer is bad?"
You should have also tried running a hacked version of OS X to serve as a tie-breaker.
It has been completely stupid to dispense with parity memory and ECC memory in PC's. Apple was the first to go to 8-bit memory bytes long ago (and they still cost more!) and now it seems everyone below the server level is happy playing without a net. Even GPU cards, if used for highly parallel FP calculations should have the ability to detect when a memory error has happened and signal the application to handle. Completely stupid, and beyond completely stupid, that we trust our calculations to a system now that can't even determine if it has made an error!
"It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."
really, it's not that much more expensive. Search newegg for unbuffered ecc, if you are using a desktop class system that can't handle registered ram.
You wouldn't put data you care about on a hard drive without raid, would you?
Was it cosmic rays, or Alpha particle decay from impure materials that was going to do in our memory soon? IIRC it was the latter.
"It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."
Yes. I do, anyway; I've never had it report a false-positive, and it's always been one of the three (and even if it was cosmic rays, it wouldn't consistently come up bad, then, would it?). Then again, it could also mean that you could be using RAM requiring a higher voltage than what your motherboard is giving it. If it's brand-name RAM, you should look up the model number and see what voltage the RAM requires. Things like Crucial Ballistix and Corsair Dominator usually require around 2.1v.
Depends. If you're doing really important stuff then sure. ECC memory is quite a boon in that case. If you're just using your desktop for word processing and web browsing, it's a waste of money.
Screw the rules, I have green hair!
My ASUS mobo (A8N-SLI) would reduce the memory timings if I put 4 memory modules in automatically. I hated that so I used the BIOS to undo it. I ran MemTest to make sure it was okay.
Oddly, the only RAM I've ever really had problems with was some bad-ass Corsair memory I bought for my 800 FSB P4 early on. The timings in the SPD would prevent the system from booting, even if it was the only RAM in the system. I override this in the BIOS (on one of the rare occasions that it booted) and it was okay unless I cleared CMOS. After a few months I removed that RAM to send it back and put in the cheapest PC3200 RAM I could find at Fry's. That fixed it, and I altered the settings in the CMOS over time, I could overclock this RAM to the same speed as the Corsair stuff. And it would work. And if I cleared CMOS it would just slow back down instead of failing.
To Corsair's credit, they replaced my RAM, although the replacement was in the same series, it did not have the same timings as the original RAM. But at least it worked.
It's funny, if you look up DDR SRAM on wikipedia, it has pictures of essentially the Corsair RAM I used. The version number is the same as the later RAM I got that worked, the earlier stuff was v1.1 or something but otherwise looked the same. The C2 in the name is supposed to mean it is CAS latency 2 RAM, but as I mentioned, the replacement RAM I received was not actually as fast, it was CL3.
http://lkml.org/lkml/2005/8/20/95
Most failures will appear when a pc is heavely stressed.
Combine the Mem86 test togheter with some continues running programs who are using memory, harddisk, CPU and network.
If a systems survivals this test a whole day, it's in perfect shape.
I have the same configuration 4GB DDR2-6400 , and Vista/Win7 would either run for days without a crash, or it would have one day where it would crash every 30 minutes. no BSOD, just a blank screen like the video card stopped working. (HCCP DVI monitor too), when I would press the reset button windows would come back but there would be blue lines all over the place, turning the monitor off and then back on would fix that.
But it wasn't until I disconnected one of the front chassis fans this all (apparently) went away. I disconnected it because it kept rattling. That had me think that either the +5v rail was overloaded on the power supply (all 4 fans were in parallel, not including the CPU, GPU and PSU fans themselves.) But it could also have been the proximity to the hard drives, as the fan disconnected was in the array in front of the hard drives. Cooling or vibration? I don't know, but if it doesn't black-screen of death to me for another two weeks I'm going to call that the culprit.
But as for the OP, only really rubbish memory is going to be picked up as bad by memtext86 and such. If you run commercial software like pccheck/thetroubleshooter you'll often find that certain hardware repeatedly fails, but works anyways. This is because the software is much more of a "stress test" that forces the hardware way to register a fail state under conditions that may never be met by the operating system.
In the day of 500$ computers, it's quite naive to assume you'd get good equipment without building it yourself, however unless you spend 2000$ on (all quality parts) hardware you won't get decent parts... you may as well buy a mac pro.
If you aren't going for ECC memory, you should be buying whatever is is the maximum performance memory for your mainboard, and stay away from rubbage boards made by biostar. Want to know if the motherboard manufacturer is any good? See if the manufacturer has updated the bios more than 3 years after it was manufactured of the previous generation board. You will usually find the rubbage boards made by biostar and MSI in eMachines/Gateway systems and virtually all low-end desktops and servers.
Are all biostar and MSI boards bad? yes in my experience. These are the brands that traditionally used the third party north and south bridge chips that don't quite work.
I'd pick Asus or Gigabyte over any other brand based on experience. These are the only two companies that produce over-engineered desktop boards that don't fail. I'm not including primary server boards, but the dell and tyan boards used in my datacenter haven't failed once, but all the biostar's have. They are the ones that don't have or support ECC memory. Go figure.
While lots of people are making fun of your seemingly paranoidic concern toward the destructive deathly cosmic ray, I'm here to support you. We get hit by cosmic ray every second, our skulls are just not thick enough to resist all those penatrations, that's why we'd lose memory from time to time. Have you ever found yourself forgot something that was just happened an hour ago? That's why. Wear tinfoil hat is the only safeguard against unexpected memory degradation.
Beside cosmic ray, other form of harmful radiation should not be neglected. The radiation emitted from computer processors and monitor can also cause a deformation of your unborn children, therefore you should buy anti-radiation suit for your pregnant love ones. Remember, the harmful effect is irreversible, you don't want to take the risk.
Last but not least, reports has shown high corealation between impotence and prolonged computer use. I've friends having their balls fried by sitting near computer with 2.4GHz CPU, because the frequency is exactly the same as your microwave oven. We just can't tell how many poor dudes having their balls disabled this way, sad.
You can't be too careful. At the end of days when the street are stuffed with mindless, impotent zomies, guess who has the last laugh.
I've seen lots of RAM errors as the speed of memory has increased, especially with the AMD64 Hammer chips. What it usually boils down to is someone not manufacturing their components such that they truly meet their spec.
If you slow down your memory and the errors go away, it's not cosmic rays. AFAIK, cosmic rays will flip bits regardless of how fast the RAM is being run at.
That's only a little more than a few orders of magnitude of orders of magnitude.
paintball
I'm old enough to remember that rubbish in the press as well. But it started before 512Meg I remember asking a clerk if my $450 4Meg sim would degrade in potentially high radiation env's. Like the crap that comes from my microwave.
For a cosmic ray to have enough energy to flip a bit of memory would be fairly impressive. Has to hit the right spot + be the right energy to stimulate a relatively large device to think it's got a fresh signal. Not too likely.
I'd be more worried about that same cosmic ray causing a DNA error and giving me cancer.
Now of course if you were design inter stellar probes you have a definite concern on your hands. Once out past the Oort cloud ( OK farther than that ) you no longer have the magnet shield of our sun. Now we are in the Cosmic ray bath. This is where the odds of a bit flip is starting to get high. Now lets add the fact that you little probe is going to be out there a LONG time. I can bet a few bucks that yah you are going to suffer from a bit flip or 7. :)
---
Oh Vista 64bit is a nightmare. 3 machines of mine have had it. ALL had major issues. back to XP 32 and ZERO issues. All of them just sing along now. My home server and a few minor devices run Ubuntu NEVER had an issue.
Am I looking forward to Windows 7? Nope. It means Win XP will really die and MS won't patch it. ( Prediction. Win 7 will suck as bad as Vista when people figure out it doesn't actually work on a EEEEEEeeeeeeeeeeePC after they install crap. ) ( Second prediction. The much touted touch screen interface additions in Win7 actually are really annoying to use. )
----
Back to the Oort cloud. Why are you sending a PC out there again?
The stuff is so cheap now, it only costs a tiny bit more to buy the brand name stuff so you're fine.
When it was 800$ AUD for 64mb of ram, the cheap '500$ stuff!!' was an option, sadly it was the wrong one but all we could afford back then. :/
If anything needs more quality control it's either hard disks or high end gaming video cards, which literally seem to burn out between 3 and 24 months nowadays
Wrap your computer (or RAM) into 10 cm thick of lead, and see if the error persists.
If not, buy lead producers' stocks to suck off profit off the heavy metal computers' buyers of tomorrow.
I usually wear medieval armour. Not only does that work as efficient as tinfoil, it's also very fashionable.
I've built many a system, and discovered that
1. removing & re-socketing the DIMM in question, sometimes 3 or 4 times ( testing between ),
*almost always* fixes the problem:
it's a CONNECTION problem, not a DIMM problem.
( dust in the socket? slight pressure-differences between "pins" and contact-pads?
whatever you do, don't wobble the DIMM when socketing it, or that'll push the contacts away *just enough* to make the connection erratic ).
2. Also, ASUS makes AMD based motherboards that DO have ECC, and some of us, who LIVE by our computers, insist on 'em.
( *drastically* cheaper than an Opteron/Xeon system, but still ECC reliability? Yowza, baby! )
3. The PSU oft is the culprit: RAM that tests fine, when the disks aren't in use, is being under-volted by the PSU, so it *becomes* erratic, when the *whole system* is under load.
Cheers, people!
sorry buddy, but random crashes on a vista machine are a feature not a bug. I've been using linux for the last 6 years and not once have I had an unexplained crash. I started a new job in a developer environment which is exclusively windows and within the first week my vista machine had to be wiped and reinstalled to solve the problem of repeated and unexplained crashes. Colleagues have had their machines simply reboot for no reason, and the performance of the machine is painful. I can unzip the same file faster in a virtual box linux guest than I can in the vista host, pardon my language, but it's a f**king joke that vista is being pushed as a "replacement"
prepare the survey weasels.
I recently had a lot of problems with the new rig I bought. It ran stable for a couple of weeks and then started BSOD'ing on me and failing in memtest86+
To cut a long story short, it turned out that the ASUS P5Q motherboard by default (auto) gives only 1.8V to the RAM. But reading on my ram I noticed a tiny 2.1V label. Setting the bios to that manually made the system run stable.
So be sure to check your bios settings before concluding that the memory is bad!
I'd mod you down but you're already at -1. Stop whining about kdawson and whine about the posts instead! n00b
Requiem for the American Dream
much MUCH smaller.
Travelling 1 um through paper doesn't get you to the other side. It will get you through several bit sites in modern RAM.
Get the latest BIOS for your motherboard and flash it to that version.
I recently had a perfectly good set of RAM degrade to useless in
only 3 months here
This could caused by contact failure, especially if tin/lead
contacts are mixed with gold connectors. Any electrical contact is subject to
fretting corrosion that eventually makes the contact unreliable.
Here are some articles showing why fretting corrosion occurs and
what to do about it:
http://www.chemassociates.com/products/findett/PPEs_Swedish_Cell.pdf
http://www.nyelubricants.com/lubenotes/LN_Sta_Sep_Elec-04-2.pdf
http://archives.sensorsmag.com/articles/0500/78/main.shtml
http://eprints.soton.ac.uk/51024/01/Final_Tribology_paperJMcB__(A).pdf
An old radio engineer's trick from the 1930's is to coat the contact
with ordinary vaseline. It is a hydrocarbon and cleans the grime and
oxides from the surface allowing a true metal-to-metal contact. This
reduces the contact resistance by a factor of ten and stabilizes it.
The vaseline leaves a film that lubricates the contact and
eliminates the fretting corrosion. It works on memory cards, power
connections, SATA connectors, pcb contact fingers, and any other
connector in the PC.
For more information, please see my post on mysteryonion's page on
solving Kenmore front load washer fault codes at
http://www.flickr.com/photos/mysteryonionpatch/471156850
To find it, search for "monettsys". It is dated Wed Feb 25, 2009,
11:58:03 pm, near the bottom of the page.
Regards,
Mike Monett
I always run Memtest86 (or Memtest86+) for at least a couple of full passes (preferably overnight) whenever I build a new system or upgrade RAM. It gives you a pretty good "zeroth order" indication of whether your motherboard and RAM are stable. I don't even bother trying to install an OS until I can get Memtest86 to run clean. As others have already noted, if it reports errors, you can be pretty certain that you have a problem; if it runs clean overnight the RAM is probably OK, but there is still a slight chance that there may be some issues.
Regarding ECC, I think it is a travesty that most consumer desktop motherboards do not even have the option of enabling it in the BIOS. When I put together PCs, I try to use motherboards which I know support ECC, and pay the extra few dollars for ECC memory modules. I've found that Asus desktop motherboards often do support ECC (unlike most of the other brands).
I would be interested to see if you could use memory errors as a method to detect cosmic rays.
xterm -n 8
Why are the editors above criticism? There are plenty of better issues to cover, not only that but it is mostly kdawson's style to post below average to simply out-right bad articles with biased and/or misleading summaries with provocative titles.
Articles tagged Flamebait
It always seems to be mostly the same few people that keep popping up, guess who is one of them?
Since nobody's mentioned it yet:
More recent versions of Red Hat come with EDAC (formerly known as bluesmoke) enabled and will throw parity errors to the syslog ...
http://bluesmoke.sf.net/
http://buttersideup.com/edacwiki/Main_Page
Its predecessor, Linux-ECC, also has a plug by DJB for its use with some decent details:
http://cr.yp.to/hardware/ecc.html
http://www.anime.net/~goemon/linux-ecc/
o/~ Join us now and share the software
When you take a look at computers from a movie's perspective such as Tron you can see all the reason why we having so many issues with error rates being high it's just the programs not wanting to work with each other. ECC Is nice for Servers and big data crunchers who can't have an error except once every million or ten bytes. If you're willing to shell out the cash for it more power to you. But 80% if not more of the computer users don't even notice issues the errors because they almost never end in a blue screen of death. Personally I blame it all on Microsoft
This is a Mac, what you have there is an embarrassment to your fellow computer users.
My experience with Memtest86(+) has been good so far. Several years ago after plugging in more RAM, my PC seemed to work fine at first, but then crashed during games.
So I started up memtest86. The first tests always passed without errors, but then I think the blockmove test produced showed lots of errors.
I tried different bios-settings and the only thing to get it work was setting 'command rate' to 2T. The errors in Memtest were gone and the crashes in those games as well.
Since then I believe it's a very nice tool and if my PC seems unstable, I run it a few hours to make sure, every test passes multiple times without error.
( Last time, after upgrading to 4GB RAM, I ran memtest for 8 hours non-stop over night and there was not a single error. )
1) consistent BSODs with Vista64... ..ran fine for days...
2) memtest86 and it always failed within hours...
3) 64-bit Ubuntu at 100% Ubuntu at 100%
Saw a similar thing with an unusual cause:
Is it an intel with the seperate northbridge chip?
Turns out vista puts more strain on the northbrige than ubuntu. The seriously overworked and pinickety nb chips used on many mobos will start playing up when under heavy use.
So memory erros and BSODs on vista but not linux. Added extra cooling to the mobo north bridge, all problems went away.
http://www.ida.liu.se/~abdmo/SNDFT/docs/ram-soft.html
This references an IBM study, which is what I think I actually remember but could not find quickly this morning.
"In a study by IBM, it was noted that errors in cache memory were twice as common above an altitude of 2600 feet as at sea level. The soft error rate of cache memory above 2600 feet was five times the rate at sea level, and the soft error rate in Denver (5280 feet) was ten times the rate at sea level."
If you buy cheap shit RAM, you'll get exactly the same as the OP. Get decent RAM and you'll not have an issue. For the last decade, I've bought nothing but Crucial and never had an issue with it. The odd time I've bought cheapass generic in that time, it's bitten me in the ass without fail.
I only please one person per day. Today is not your day. Tomorrow isn't looking good either. - Scott Adams
MacPro Towers can use ECC memory, lots of it.
music lover since 1969
I was recently running a server that archived about 2 terabytes of data and got something like one or two bit errors per year that could not be traced to any known source. This server boasted ECC RAM, so the problem didn't occur in the ram, and it was unlikely for it to have occurred in the FSB.
If you go with non-ECC, I would suggest running memtest86+. If you get errors, swap the memory. If swapping the memory still doesn't take care of it, swap motherboards! I recently had a memory problem in one of my customers' racks, and running memtest86+ got nothing until I had it running on my bench for over a week. There may be some problems with memtest86+...I even had another bit-error that memtest86+ did not find, but a Linux commandline memory tester found a problem almost immediately
The problem here is that different testing/usage patterns result in different probabilities of finding potentially bad words, e.g. words that may only be bad if you read from them a hundred cycles consecutively. But, if you do see a failure in memtest86+ or the CLI tester, you got yourself a serious problem. The point to take from this is that if you don't see errors, that doesn't mean you don't have errors!
Having said this, I still don't think memory errors among PCs are that common. We have more RAM on machines these days, but at the same time, the manufacturing processes have become better. I have a personal conviction in believing that though the likelihood of word error due to the increased amount of words in memory has increased, the RAM itself has become so much more "solid" that the increase of memory is negligible. Now, if you do dumb things with your computer like running it without a case or not giving it ventilation( learned this the hard way) or overclocking it, you *WILL* still run into problems. But if you design a system with quality and integrity, you typically shouldn't have these issues with memory!
One last thing to point out: there is quality hardware, and there is cheap hardware. My PC-Chips motherboard ran for three months and two days, and I didn't have a problem. Two days out of warrant. Now, take my MSI motherboard. It sets the timing for all memory modules to have the values of a single module. This resulted in stable single module operation, but got flaky for all four modules. I Finally moved to ECC before I figuerd out that I had to manually set the correct timings. This board is an ultra board, but apparently, it does not include use of generic (Micron, Corsair, etc!! - tried 'em all) memory modules. People on the Newegg reviews board have memory issues with this board as well that they could not fix with a BIOS update, and it appears that sometimes a design just is bad! Even the "good" manufacturers do not spend a lot of effort to fix issues in some cases.
My words of advice: Do your homework. Read through the reviews. AND DON'T BUY HARDWARE AS SOON AS IT COMES OUT!
why do you think the bsod is a memory glitch? particularly after running another OS that ran fine? Don't they teach troubleshooting anywhere? You should go learn it.
-- Programming with boost is like building a house with lego. It's a cool but I wouldn't want to live in it
I dont see why this dude is at -1
This is pretty stupid.
While you keep the cosmic rays out you also keep the electromagnetic waves in.
You get a standing wave in which energy is injected all the time.
More precisely until a plasma forms and the computer explodes.
Don't do this !
Actually. 3GB isn't as sweet a spot as people seem to assume.
In a system where you have 2 DIMM slots (eg a laptop), it's very much advisable to put in 4GB, being 2x2GB modules and still have dual-channel access to your memory.
Dual-channel doesn't work with DIMMs that are different in size.
For most 'normal' boards, this is the same. Using 4x1 or 2x2 is performance-wise better than using 1+2. The cost involved in this is negligible nowadays, so there's no reason not to do it, even if you do 'lose' 1G.
The 3GB limit also entirely depends on whether your system maps 1G to PCI/Graphics/etc. Some systems only map 512M, some 768, etc. So depending on that, you might actually end up with 3.5 or 3.25 usable.
Regards,
Splut
Coz eternity my friend, is a long *ing time.
2) A prerequisite (and an expensive and hard-to-find one) for ECC memory is having a motherboard/chipset that *supports* ECC memory. Usually this means a server-class motherboard, but Intel usually has at least one high-end desktop motherboard on their line-up with ECC capability.
3) I always buy machines with support for ECC memory on the motherboard/chipset (except notebooks, where it doesn't seem to be an option), and always used ECC memory on them.
Best Regards,
Durval Menezes.
I have never met a computer that didn't like me.
Are you sure your motherboard applies the correct voltage for the memory modules? This can be verified in the BIOS I believe?
I'm not sure but, I will say this. Some memory brands definitely have more issues that others. I bought (2) 1GB sticks of Patriot memory and had to RMA both after about six months and the pair I got back lastest two months before they too started crashing my machine. I finally just yanked them and am running the original (2) sticks of 1GB Crucial that I've had since the beginning. I had bought them to run VMs, but I'm not running them now so I don't need the extra memory now anyhow.
amd desktop cpus can use ECC. Intel needs xeon and Intel xeon cost more and have less io then desktop i7 boards aka only 1 pci-e x16 slot / some don't even have 1 full x16 slot.
...if you catch it while hibernating
Be careful. Vista hibernates with one eye open. It can wake itself up from hibernation to do updates. I dual boot my laptop with Linux Mint (an Ubuntu variant). Every week, I'd go to turn on my computer only to find that the battery was dead. Checking the startup logs showed that linux was starting up at about 3:00 in the morning. After googling, I found out that many people were having that problem. The suggested solution was to turn off Vista automatic updates. I checked my Vista, and sure enough, it was set to update at 3am. I turned that setting off: no battery issue. I turned it back on: battery drained.
My CMOS settings pages do not have any facility for waking the laptop at a specific time, so I don't know how Vista manages it. I only know that it can. So beware! Vista hibernates with one eye open.
When our name is on the back of your car, we're behind you all the way!
If you want to prevent that kind of thing, remove the battery and power supply.
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
Both Vista and Ubuntu have trouble with my older GeForce 6200 graphics card... Vista sometimes "snows" up when the screen goes black due to idle, but it stays up for weeks at a time and I use it for development day in and out. Ubuntu, on the other hand, blows away the nvidia driver every time I get a kernel update, leaving me stuck at 800x600, and that -really- sucks.
This is my sig.
One thing that many people seem to forget is that you have the OS talking through the chipset to various components in the system. If you are using an Intel processor from the pre-i7 days, the chipset driver may be the source of many problems if you are on the Intel side of things.
Now, the fact that Linux works just shows that Linux may be doing things in a sane and more organized way when it comes to accessing the CPU and chipset(no matter how crazy the kernel code may seem). I have noticed that 64 bit driver support is rather weak on the Windows side of things, so that could be the source of your problems as well.
I would not jump to the conclusion that memory is your problem, but look for other factors. How good is your power supply since power issues can cause all sorts of unpredictable behavior?
I have to ask why you think why you need 16GB to check your email next year.
Between the sex spammers throwing in virus-laden teaser videos and the 419 spammers throwing in virus-laden bootleg Hollywood videos once 64-bit OSes become standard, I'm quite sure I'll need more than 16GB to read email next year.
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
I had a similar issue. I was using an NVIDIA mother board (to SLI my NVIDIA Video cards) and Vista would freeze all. the. time. Linux was fine.
Back in October, NVIDIA released new drivers. I tried them with Windows and, surprise, no more memory issues.
The reason was Linux loaded generic chipset drivers that didn't use some of the higher-end features, same with MemTest. The windows drivers used these features, and since they weren't access correctly, problems developed.
OS is irrelevant. I've had more than a handful of memory modules go bad over the years. If memtest86 (there's a memtest86+ now too) detects an error get new memory. Running different OS or tasks or switching modules around may seem to "fix" it but it's not. It's either avoiding the problem spot or the memory error didn't cause whatever to blow up. It eventually will. You might be writing bad data to your disk, miscalculating something, etc. Sometimes it looks like the OS is corrupted. Writing this makes me think it a good idea to run memtest once a month or at least once a quarter.
I used to work for a well-known computer company, fixing people's stuff over the phone.
Usually in this case, the Vista install was fubarred. HOWEVER, there were r-a-r-e times when we had a cx unpackage perfectly good RAM and pop it into a known-good mobo, and... no go. Would NOT start up. Ran fine with Ubuntu, but Vista for whatever reason wasn't happy until we took it down to two sticks. Didn't matter which slots it was in, it would only take two sticks. Even checked the BIOS to see if there was anything amiss- nothing. Their version of Vista just didn't like it, and it didn't matter how many times they reloaded with the OS disk. ::shrugs:: That's why I dual-boot with both OSs.
I have a Windows XP box that I maintain only for the purpose of playing games. There's nothing installed on that box except an A/V and Steam (Steam discussions go somewhere else, this is NOT the point I'm making).
What happens on this box is I'll have sysfader.exe bomb out and explorer.exe will often crash. This hardware runs rock solid under Ubuntu. These process errors keep popping back up and won't go away. Windows may be "fine" overall, but like any large product is has its fair share of bugs and skeletons in the closet.
Long story short: things other than cosmic rays can cause memory errors.
I once had a box that wouldn't pass memtest86 if run with 3 DIMMs installed. The sticks all tested fine individually, in any socket or combination, except for when all 3 were installed. It turned out that some cabling in the box was hanging nearby, and when moved away from the RAM the setup became rock solid.
Stability does not mean only "no crashes". It also means that you can run the same installation for years without having to re-install your OS for whatever reason.
I know it is possible to do that. I have Windows XP installation from 2002 still going without re-install, but it does require expert knowledge, really cautions browsing and constant tinkering and cleaning up of garbage from registry, temp files, residues of programs that I have uninstalled (or upgraded to newer versions) etc.
Since I have switched to Mac, I have to baby sit my OS installation way less if at all. It's much more resilient to user using it :D.
As the island of our knowledge grows, so does the shore of our ignorance.
He isn't claiming to be KDAWSON. If you bothered reading his sig, you would realize that.
But the OP was talking about a *HARDWARE* problem.
you had me at #!
Parent is right... as it says: Disclaimer: This comment is not officially stated by kdawson (3715).
What about that dont you get?
You must be unlucky or the cause.
This would make a great slogan for Microsoft's new ad campaign:
"Not an actor, but he plays one on TV."
"...no one uses ECC memory..." You calling me a nobody? ECC came in this Mac Pro here. Comes in all Mac Pro models.
If I didn't have absolutely NOTHING to do, I wouldn't be here.
I'm running an ASUS M3A78-CM here.
It has the AMD780V chipset which supports ECC and it was pretty cheap. (One of the cheapest ASUS-MBs for AM2+)
The Kingston "ValueRAM" ECC module is also only sligthly more expensive. (I'm 1x1GiB, but I plan to add 2x2GiB (All DDR2-800))
"That makes no sense, and is totally wrong. You're a moron."
Now that's a clinching argument!
I had a problem with bad RAM on a brand new machine and nearly ended up sending the DIMMs back. But I wondered how that could happen, since they were Corsair Dominator, supposedly a reliable brand. Turns out that when my BIOS "loaded" the "extreme" profile to run the memory at 1600 instead of 1333, it didn't actually set the correct voltages, but only displayed them. Since the values were grayed out, I had assumed it would just set them, and so didn't check further at first. Later, after some corrupted data and many many memtest errors it occured to me to look at the BIOS settings again. Raising the voltage on the DIMM to the correct value (which BIOS was displaying as its "extreme profile") eliminated all the errors right away. (This was on GA-EX58-UD5, by the way)
I work for a large academic HPC organization which operates ~10k cores. our typical config has 2G/core, so we have a lot of dimms, all ECC. the majority of our systems have no corrected errors (CE); a few have modest rates (few hundred/day). we replace dimms which cause either uncorrected errors or > 1k CE/day. these are typically 8 GB machines with 1G ecc dimms - the bios hides details like whether the ECC is chipkill or not, or whether it's scrubbing. but the fact that large samples of COTS dimms generate no ECCs implies that a smaller-memory desktop stands a good chance of operating without random corruption. dimms are from Micron; systems are 1U servers in pretty decent machinerooms, at close to sealevel.
I find that when a Windows machine, from Windows 2000 on up, when taken care not to install too many programs and/or immature or junk-ware, then Windows remains quite stable and usable.
So basically, Windows is great and stable as long as you don't try to, y'know, run programs on it? So then why would it ever be considered useful or, going a step further, worth half the cost of a peecee with Windows installed? You can get the same hardware as parts and assemble the machine yourself, or buy the whole thing from a store that will put it together for you, save the Windows Tax, and install Ubuntu on it for free. On a simple PC, which these days has 1 GB of memory and a pretty fast processor, that's a huge discount on the price, plus you get a more stable system than you would get by paying a lot more to clog up the machine with Windows.
I had some awful experiences running just Microsoft Access on Windows NT and 2000, and I generally had to switch off the power due to a serious crash at least once a day. Due to the ridiculous instability, I closed everything except Access, and the install was pretty clean, because I've never liked the various additional toolbars and launchers and stuff. So is Microsoft's own Office "too many programs and/or immature or junk-ware"?
There is a lot of software misbehavior in Windows-world. (To be fair, there is software misbehavior in MacOS and Linux as well, but I see it far less often.)
I see crappy software for Linux and OSX all the time. The difference is that I don't see the crappy software for those platforms bringing down the whole OS.
In addition to those dark Windows-and-Access days, I also have some experience writing programs to solve systems of coupled nonlinear partial differential equations representing models of certain polymer systems. Those programs were written in C on Sun and SGI workstations (and on some dumb terminals talking to those workstations). In those programs, I had to do a lot of big matrix calculations, which involved me allocating and deallocating memory quite a bit. My understanding is that those are "dangerous" operations if you do them wrong, and I know I am not a great programmer. Even so, I can only remember two problems, neither one of which came close to a BSOD or a crash requiring a hard restart.
One was that I would occasionally make a small mistake and the program (my own program, which I don't mind calling immature or junk-ware) would crash, giving me something like "Segmentation fault. Core dumped." That problem was more common when my programs were new, in about 1993-1995. When I went back in 2000 to finish my Ph.D. after having left and worked at two different companies for a while, many of the old workstations had been replaced by beige box peecees running I-don't-remember-which distro of Linux. I still modified my programs some for different situations, but I got very few segmentation faults (I don't remember any from that time, but I wouldn't be shocked if there were one or a few that I had forgotten). I do remember one occasion when the UI died on a machine I was using. I went to another and managed to determine that the machine on which I had been working was still OK and my processes were still running there. I went to the computing services guy responsible for that computer lab. He verified what I'd told him, reset the X Server on the machine I'd been using, and showed me how to do the same thing in case it happened again. After that one time, I never even had UI problems again, in stark contrast to the unpleasant experiences I've had dealing with Windows.
In stability and security (another big subject) terms, there is still a huge difference between the 40 year old UNIX model on which OSX and Linux are based and the Windows "this time for sure" advertising gimmick of the year. Remember when Vista was going to be better than OSX and Linux? Great days. Now even Microsoft itself is talking about what an utter piece of crap Vista is, and trying to get y'all to hang on for Windows 7. You gonna fall for that again?
"It is nice to know that the computer understands the problem. But I would like to understand it too." --Eugene Wigner
Short version:
1) As per my experience, yes.
2) It just depends on how critical using Your PC is.
Long version: :-)
1) I have a Debian notebook which looked fine, except I noticed frequent browser malfunctions. The whole system didn't crash, but some processes apparently did. The machine wasn't critical, and just recently I decided myself running a MemTest, then discovered the issue. Removing a bad memory module fixed the problem.
2) As long as you can afford it, the "better" your memory is, the safer you are. That means you're reasonably safe, as much as your money can buy that safety, from two possible scenarios:
a) your data gets corrupt; if it's an animated GIF on a webpage showing an annoying ad, it's a non-issue; but imagine if it was your cash balance, which then got suddenly overwritten on filesystem (disk) by some auto-save feature
b) a running program (a "process") stucks on bad code; you may have even more unpredictable outcomes, but the most likely one is that you get a GPF (General Protection Fault, meaning the process was trying to read or write somewhere outside its own memory) and was terminated because of that.
Obviously, the odds of encountering something really nasty diminish with the bad/good RAM ratio; but who actually WANTS to take such a risk? Well, that depends: when a shoot-em-all game machine behaves badly, it's not exactly the same as when, say, a CAD workstation, or a server, does the same.
My 2 Eurocents. :-)
On the other hand performance-wise, Windows XP running on a Core i7 with 3GB of triple channel DDR3 ram should be sufficient for checking your email. Personally I am using a 1x2GB module. This was recommended to me so I could upgrade easily when 64 bit becomes more common and/or RAM drops in price. In the meanwhile I save a few watts of power by only having one module of RAM.
When you buy memory from an outlet, see how ESD damage is thought not possible if they grab the memory by the ends with thumb and forefinger. I have had to tell shops that they would be sacked on the spot if they were working on building or handling memory at the manufacturer. They proudly tell you they've never damaged memory, just because purchasers have not hot footed it back to the shop minutes later to say it's broken. The damage they've caused may not result in failure for months, but it will. With a large antistatic mat and a monitored wrist strap I have found memory is the most frequent failing component.
If you haven't yet, download Prime95 right now and run it on your Windows systems.
http://www.mersenne.org/freesoft/
I've found that the prime95 stress test will catch errors that memtest won't. You need to run prime95 twice. Once right after booting the system, and a second time after the system has been running for more than a week (and in use for that week). Each run should be at least 1 hour.
In theory, memtest could pass this location automatically, or at the click of a button, which would be a lot more time efficient than replacement... assuming the bad RAM doesn't waste your time again later.
There are a number of reasons that might come in handy. I find modern motherboards very easy to damage however, so IMHO anything that involves opening the case should be avoided. I've kept gaming rigs that had minor hardware faults. Also while waiting for the RAM to arrive a "just don't use the bad bits" may come in handy.
but it doesn't report correctable ECC errors if it doesn't know the chipset or if it doesn't have reliable support for the chipset (that's what it means if ecc is set to 'no' in memtest on a board with ecc ram)
that said, you are right, some motherboards don't support ECC. I'm just saying that's not what the 'ecc' 'on/off' field in memtest86 means.
Zowie. I had no idea (living in Denver) that it was 10x sea-level. Now I'd like to run that test during the summer, when I'm living in Leadville, elevation 10,000ft.
That's a very cool article. Thanks for posting it.
Nostalgia's not what it used to be.
There are two of them:
http://www.memtest86.com/
http://www.memtest.org/
Or use both?
Ant(Dude) @ Quality Foraged Links (AQFL.net) & The Ant Farm (antfarm.ma.cx / antfarm.home.dhs.org).
Memtest86 will find some memory issues, but not all. I find that you have to really stress the memory interface, itself (i.e.: not just typical reads/writes) in order to see truly "interesting" failures...ones that could easily cause "unexplained" lockups, etc. For this sort of thing, I recommend this. This will slam the memory i/f much more than just reads/writes...which is what you really ought to be doing anyway.
"...The smart and lazy ones I make my commanders." - Erwin Rommel
Back in my day, we tested our souped up 386 and 486 machines by running an overnight XFree86 compile job which tended to go wrong in the middle somewhere if you were having memory, bus, or disk I/O corruption issues.
http://www.ida.liu.se/~abdmo/SNDFT/docs/ram-soft.html
This references an IBM study, which is what I think I actually remember but could not find quickly this morning.
"In a study by IBM, it was noted that errors in cache memory were twice as common above an altitude of 2600 feet as at sea level. The soft error rate of cache memory above 2600 feet was five times the rate at sea level, and the soft error rate in Denver (5280 feet) was ten times the rate at sea level."
IBM research is a wonderful resource in the area of soft errors. I do remember exactly reading your quote, I didn't bother to track the exact article, but it should be part of this special issue http://www.research.ibm.com/journal/rd40-1.html, the banner article mentions Denver but doesn't have the exact quote. The web shows it would be "Terrestrial Cosmic Rays", the second article in that issue. They have a more recent special issue on the same subject http://www.research.ibm.com/journal/rd52-3.html
I bought the cheapest DDR400 RAM I could find at Fry's and it failed the memtest86. I had to manually change the BIOS to DDR333 for it to run reliably and pass the memory test. It has worked fine ever since, which has been almost a year.
"Meaningless!, Meaningless!" says the Teacher. "Utterly meaningless!"
I have found that if Memtest86 reports an error, that error is real and will matter sooner or later.
Memtest86 does a rather thorough scan including writing and reading patterns that are by design worst case. For example, some memory faults are such that a particular zero bit may flip to a one if all bits surrounding it on the chip are set to one. Memtest86 uses a series of bit patterns meant to trigger such problems. That's why it may find problems that don't or haven't yet crashed your system in practice. Note well that it's pure luck if there's no crash and there COULD be silent data corruption going on. Imagine a single I/O buffer out of thousands that flips 1 bit occasionally if just the wrong bit pattern is stored.
That's why I run my memory in RAID mode (Redundant Array of Inexpensive Dram)!
today is spelling optional day.
Some simple tests:
Being one who has maintained an 1100 node cluster with 8800 pieces of ECC RAM I can tell you we chase bad RAM sticks ALL THE TIME! It's not necessarily due to cosmic activity, the RAM just exhibits bad behavior as the circuits get older and things start to separate and break down due to thermal load over time. Even a small defect that would let the RAM pass the manufacturers tests will eventually lead to a DIMM failure down the road. Most average human beings will never determine why their machine crashes every few days if it is a RAM issue. Some power users will even overlook it because they have too much faith in RAM that *was* good when they bought it, but now that it's two or three years old ...
I wouldn't trust a single app to verify your RAM. Run a couple different tests and see if you can nail down the problem. I can look and see how we're tracking that and get back to you.
Would that mean people using notebook computers in airplanes should expect to see more errors than they do on the ground?
Or do the airplanes themselves shield enough of the alpha particles?
Look at the Apple Mac Pro. It comes with ECC memory. It's not the only computer either.
Most people buy PCs based on price. If they can save $5 they will. These people don't get ECC but then they aren't doing anything critical with their computers, just games or the Internet. Anyone running a critical service would have some kind of redundant setup to either deal with the crashes or prevent/reduce them with with things like ECC, RAID, dual power supplies and so on. Sun's servers even allow you to "boot around" failed hardware
Memory testing will NOT protect from casmic rays and flipped bits, what they call "soft errors"
That is not normal behavior for either XP or Vista. Are you running Windows in an emulator or something?
...which as you note is brought up often on these boards. What is not brought up is that most people replace their computers long before the HD is even thinking about going south on you.
I've had hard drives physically fail on me before, but in my personal experience it's been pretty rare, and I'm used to working on computers that are over half a decade old.
I would suspect you've got problematic hardware. That's the most likely cause of Windows FUBAR.
My experience with Linux, Windows, and glitchy hardware is fairly epic. Back in late '96, I got my first personal computer. It was from a fly-by-night shop, with sub-par parts: an integrated SiS motherboard, cheap RAM, Quantum Bigfoot drive. In retrospect, this computer likely had RAM problems when I got it, as in W95 is was more unstable than I was expecting.
Years went on, and I kept using this computer: I was a young geek who couldn't afford another $1500+ for a computer, after all. I heard that there was this linux thing that was rock-solid stable, so I decided to give it a try.
Long story short, the instability in the system got worse and worse - to the point where Windows would occasionally crash while or shortly after booting, but always several times an hour. I was down to using Linux for 99% of everything, and had mostly stopped gaming. Linux, while it wasn't rock-solid-stable, would only crash 2-3 times a day at the outside (in '99).
In '99, I did manage to patch and build the kernel with BadRAM, and that improved things measurably. (But it was time for a new system, anyway.)
Might try patching your kernel with BadRAM (not sure if Ubuntu does by default): https://help.ubuntu.com/community/BadRAM
~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers
This is either a troll, or you asked your question in the worst possible way. It's not a windows/linux question. Bad ram is bad, period. It sounds like your ram is bad. Probably the way linux is allocating that chunk of ram just happens to be non-crucial and not crashing your box. It's probably using it for cache and trashing your files or something. ;)
If it's taking a couple hours, it may be heat related, or it might just be a borderline chip.
Tim May is sitting in a hot tub, laughing...
...write a better one. :)
From real world experience the cosmic ray explanation is as dumb as it sounds. Much of the historical problems with bit flipping have been resolved by mineaturization (less room for "sticking") and changing to packaging materials to avoid problems induced by secondary reactions.
If memtest sais something is wrong then you can believe that something is wrong. Using linux for x hours/days/years at a time may not necessarily uncover a problem.
Also linux and windows fill memory differently so a memory related hardware problem on one platform may be much more or less apparent than the other depending on the nature and location of the problem. It does NOT mean one platform is better or worse than the other in this regard.
From experience with modern DIMMs some of the cheaper DIMMs on the market have whacked voltages and or improperly programmed memory timings.
I've had great success playing musical dimms (esp between pairs), manually adjusting various memory timing parameters in the bios to turn a once-an-hour memtest86+ error (totally missed by other memory diag software) into days of running (limit of how long I was willing to wait) without a single issue.
Of course this process is all trial-and-error, may not even work and can take quite a lot of your time (days)..so its best not to start with cheap crap in the first place.
So true.
I think we should bring back segmented addressing while we are at it too. I was so much more productive when I had to use DS, SS, GS, FS, and CS. I miss them so. All those segments made the computer run so fast, too!
Many people seem unaware of the fact that mixing memory modules of different speeds or different timings can and does cause problems. Even the engineers in our IT department at work don't get it.
http://www.research.ibm.com/journal/rd/401/ziegler.pdf
The article does mention concrete shielding, but nothing about metal?
"Recently, experiments on cosmic ray effects at airplane
altitudes have been published by IBM, Boeing, and others
[14, 151. We do not review this specialized field, other than
to note that the fail rate of electronics at airplane altitudes
is about one hundred times worse than at terrestrial
altitudes, as was predicted in an IBM paper 15 years
earlier [16]."
Windows Internals has a pretty fun tool which by the click of a button will do bad things. One of the 'bad' things it can do is randomly overwrite kernel memory.
What is fun about the tool is that it is like Russian Roulette: You can click the button several times currupting memory, and eventually you will corrupt something important and bring the machine down. Or you can click the button a couple of times, and see how long you can use your machine before that memory path is hit and your system comes down.
The tool consists of two components, Notmyfault.exe, and Myfault.sys (which IIRC, is embedded within the exe, and launched into the kernel when the tool is run with admin rights). Notmyfault.exe cannot itself take down the machine, as it has not 'rights' to stomp on memory outside of its sandbox. This is why it has to request the bad deed to be done by the kernel component, myfault.sys
You can find a link to the tool below:
http://download.sysinternals.com/Files/Notmyfault.zip
Related to this thread - you can corrupt memory and not see any adverse affects if nothing important is located in that memory space. This would easily explain why Windows might crash, but not Linux. It just depends on where modules are loaded, and what code/data is corrupted.
Memtest or use ultimate boot cd.. If it has 2 sticks of memory, take out the 2nd stick and re-test, then take out the 1st stick and add the 2nd stick in the 1st slot and re-test. It is usually only one or the other, or a seating problem. Good luck.
Just palpate your memories and check for lumps.
I was just researching ECC memory and memory errors a few weeks ago. I am planning to build a computer with so much RAM that it can basically keep all my most frequently accessed files and programs in main memory, all the time. However, this made me seriously consider the reliability of RAM. I was pleasantly surprised that ECC RAM is quite easy to find and not that much more expensive than non-ECC RAM. However, figuring out which motherboards and/or CPUs support ECC RAM was so tedious that I gave up.
I have not been able to find any clear information about whether ECC support is purely a motherboard issue, purely a CPU issue, or a combination, for most kinds of CPU. My best guess is that if the CPU does not have an integrated memory controller (Intel CPUs and some AMD CPUs), ECC support is a motherboard issue, and if the CPU does have an integrated memory controller (many AMD CPUs), then ECC needs to be supported by both the CPU and the motherboard. In all cases, ECC needs to be supported by the RAM (although, in theory, this doesn't have to be the case).
With these somewhat shaky assumptions, I went to investigate what combination of 64-bit CPU and motherboard I could buy that would support at least 8 and preferably at least 16 GB of ECC RAM, preferably non-registered. As it turns out, whether a particular motherboard or CPU supports ECC memory is often not listed by sellers. Searching the web for this kind of information is difficult, because searching for "ECC" will give you many hits that are, in fact, "non-ECC" - which means that the component supports non-ECC memory, but doesn't mean it supports _only_ non-ECC memory.
All this really confused me. If it is so easy to find ECC-memory, it seems reasonable to assume there is a fairly strong demand for it. But if that is the case, why is it so difficult to find out which CPU/motherboard combinations support ECC? And, actually, why aren't we all using ECC memory nowadays? I would say that the probability of memory errors occurring must have increased with increased megabits per chip, both because there are more bits that can be flipped and because the energy needed to flip a bit is smaller. Given that, I would say memory reliability is a real issue. But it is hard to even find up to date statistics on this.
Now, I know all of my failures to find information can just be attributed to me not searching in the right way, but I really want to know the answers. So, any help and advice appreciated.
As for the computer I want to build:
- 64-bit CPU with virtualization extensions, preferably multi-core
- 8 or 16 GB of ECC RAM
- 80 or more GB solid state disk
- Video card with fast 3D using open-source drivers (this probably means AMD)
- I care about power consumption, especially when the system is idle
- Preferably no moving parts
I know this is ambitious, but everything besides figuring out if it will support ECC seems feasible.
Please correct me if I got my facts wrong.
I've especially ran into issues with DDR2 sticks in that they may not use the default 1.8/1.9V setting on most systems, but require 2.0-2.3V to operate: especially if they're "high-performance" memory meant to run at 1066 speed. Default timings also can be an issue with speed levels programmed into the chips as well: you can check for this issue by setting the RAM to run at 1 or 2 speeds lower (say DDR2-800 running with a 333mhz clock (DDR2-667) instead of 400mhz.
Life is irony, and nothing ever goes as planned.
Apple's Mac Pro uses (and shipped with) fully buffered ECC... at least my Jan 2008 model does.
I guess that extra cost *does* go somewhere. ;)
interests me is the fact that DDR costs at least twice as much as DDR2?Now why is that?
Wanted : A Signature.
It is serious.
I think people get their ideas from Maths. There things are perfect, without catches.
There are no such cases in real world. This gentleman have just point out that our system can not get bigger and bigger without rethinking the realibality.
Get things seriously. recheck if one of the assumptions have been fail from time to time.
okay it's an 8 bit 6500-style processor.
This is a really good question and it points to the need for more utilities to check and fix bit level errors in files and memory images.
I wish there was a set of Linux utilities for generating and using error correction checksums.
The usual checksum tool simply reports if a file has an error.
What I would like to see is a checksum tool capable of fixing a multi-byte, single byte or single bit error.
These are checksums using Hammung codes I think?
I strongly suspect that huge hunks of don't care data like JPG photos and DVI movies develop bit level errors during storage, handling, copying, and editing.
A utility like gpg will spot errors. And diff spots character level errors. But they are both real clumsy at finding and precisely repairing individual bits. With diff you wind up unsure if your reference or original file changed.
This Linux system I am writing on had a bad RAN memory bit and the random number generator loaded over that location.
The random number generator didn't work but Linux mostly ran OK. I thought it was a software problem for several weeks. Finally after replacing the random number code and even checking bug reports, I finally looked at the RAM memory. Bing!
Ever think that the faster a computer runs, the more memory it handles they higher the probability of an error. If OSes hadn't improved imagine the hell we would be in now. Win 95 crashed every 30 minutes or more on a 486 running at 33Mhz. At 3,000Mhz that could be seen are 30m/90.9 or a crash every 20 seconds. Windows isn't more sensitive to a memory error, it just depends on where in ram the error is. Vista loaded the OS files in different memory locations with each boot to prevent 'badware' from finding a sweet spot in ram to attach to. Hence a memory error which in XP my effect some incidental program and go unnoticed in Vista can cause errors that vary for each boot cycle. Yani
1) I usually trust the results of memtest after a cycle or three. It is true that this doesn't give any conclusive evidence; but generally when memtest tells me a module bit the dust; the machine can be revived by replacing that module. If memtest has nothing to say and the machine is still wonky; I would investigate the hard drive; processor etc. first before re-evaluating the memory.
2) It depends how important your email is to you. whether you've got 64k or 16GB of RAM; every single word on it's own can be corrupted. So; even with better production; the end product is still dependent on each word of memory being correct. Each corruption may lead to an error. If that error is likely to crash a system you depend on; go ECC. If it's more likely to mean that you just need to reboot your gaming rig; you don't need ECC. When running your machine in ubuntu; the faulty words of your memory where allocated for some random number or something not addressed enough to cause corruption.
16GB NON-ECC OCZ DDR2 800 here (4 x 4GB) in an ASUS P45 board. No problems. Passes memtest. Works fine in Linux and Windows.
I had a similar problem that was very hard to isolate and resolve; Windows xp was locking up while vista was stable. Then vista started locking up but Ubuntu ran fine, even running full stress tests. Eventually I figured that the extra local heat from two 8800gtxs being dumped onto the southbridge (due to removing the water cooling from both cards and SB and replacing with stock heatsinks) was causing the lockups. I dropped the overclock on the CPU and the lockups stopped.